NLP evolved to be an important way to track and categorize viewership in the age of cookie-less ad targeting. While users resist being identified by a single user ID, they are much less sensitive to and even welcome the chance for advertisers to personalize media content based on discovered preferences. This personalization comes from improvements made upon the original LDA algorithm and incorporate word2vec concepts.
The classic LDA algorithm developed at Columbia University raised industry-wide interest in computerized understanding of documents. It incidentally also launched variational inference as a major research direction in Bayesian modeling. The ability of LDA to process massive amounts of documents, extract their main theme based on a manageable set of topics and compute with relative high efficiency (compared to the more traditional Monte Carlo methods which sometimes run for months) made LDA the de facto standard in document classification.
However, the original LDA approach left the door open on certain desirable properties. It is, at the end, fundamentally just a word counting technique. Consider these two statements:
“His next idea will be the breakthrough the industry has been waiting for.”
“He is praying that his next idea will be the breakthrough the industry has been waiting for.”
After removal of common stop words, these two semantically opposite sentences have almost identical word count features. It would be unreasonable to expect a classifier to tell them apart if that’s all you provide it as inputs.
The latest advances in the field improve upon the original algorithm on several fronts. Many of them incorporate the word2vec concept where an embedded vector is used to represent each word in a way that reflects its semantic meaning. E.g. king – man + woman = queen
Autoencoder variational inference (AVITM) speeds up inference on new documents that are not part of the training set. It’s variant prodLDA uses product of experts to achieve higher topic coherence. Topic-based classification can potentially perform better as a result.
Doc2vec – generates semantically meaningful vectors to represent a paragraph or entire document in a word order preserving manner.
LDA2vec – derives embedded vectors for the entire document in the same semantic space as the word vectors.
Both Doc2vec and LDA2vec provide document vectors ideal for classification applications.
All these new techniques achieve scalability using either GPU or parallel computing. Although research results demonstrate a significant improvement in topic coherence, many investigators now choose to deemphasize topic distribution as the means of document interpretation. Instead, the unique numerical representation of the individual documents became the primary concern when it comes to classification accuracy. The derived topics are often treated as simply intermediate factors, not unlike the filtered partial image features in a convolutional neural network.