Machine Learning
Natural Langauge Processing
31 Oct, 2018
Nick Jenkins
Head of Data Science - The Evolved Group
The future of Text Analytics - 2019 and beyond

We can’t help but try to predict the future, especially when it comes to the application of new technologies. Who knew before the fact how radically our lives would transform due to the internet, computers, electricity and the industrial revolution? It seems the most recent development is the rise of AI and machine learning. Will this lead to a future like in Terminator and WALL-E, or something benign? We read about impressive advances such as game grandmasters beaten by AI, self-driving cars and using Alexa in the home. Naturally one might ask when we will have an AI that can beat chess grandmasters AND cook my pancakes? Most AI technology is quite specialised, and what we’re referring to here is also known as ‘General AI’. Unfortunately, this seems to be a long way off, but we can achieve a lot with focused effort in a single domain.
Marketers have been leveraging machine learning techniques for some time. When we look to answer business questions such as valuing a customer, uncovering factors contributing to customer churn and creating propensity models to estimate purchase we may be using a machine learning model.
In the past few decades we have been able to leverage these techniques due to our increasing focus on collecting structured data. We record customer details upon signup, upon purchases, when complaints are received and through customer feedback. In market research we focus on understanding the latter, with the aim being to generate actionable insights. We’ve come a long way in a short time since the humble clipboard and paper survey. We conduct online surveys available to users on mobile devices at relevant times and places making structured data easier to collect than ever.
When survey questionnaires are designed we know - or think we know - the areas of customer experience to ask respondents. A well-designed survey can facilitate good analysis, which can produce the insights required to improve the customer experience. There are some issues with this approach however - attention spans are short and surveys are long. By having a designed conversation, we might miss out on other important details of a customer’s experience. Not all customers have a similar experience, particularly when it comes to products purchased. These issues lead to utilising open-ended text comments to attempt to discuss and capture the things that are important to customers.
Unlike the structured, numeric data we are used to capturing through rating scales, text data is not as easy to analyse. A typical approach has been for a research consultant to read some or all of the comments and code the key themes, sentiment and any specific product or brand mentions. This process tends towards the qualitative and some excellent insights can be retrieved. But it can be slow, costly and result in poor replication over time (as different consultants apply different interpretations). This process helps turn the text data into something structured: does it belong to these categories or not?
This challenge is not limited to marketing, it is part of a broader machine learning domain called Natural Language Processing (NLP). While many of the other AI domains discussed earlier appear to have some successes and be ‘solved’, NLP is a domain where progress has been comparatively slower. You’ll notice we still haven’t mastered voice recognition. By ‘mastered’ we generally mean capable of above-human performance. It turns out that understanding language is HARD, and it’s primarily due to the unstructured nature of the information. Once we turn the unstructured data into something structured – like the coding example above – then we can perform analysis and perhaps even further machine learning. The process of turning unstructured text information into something structured is known as ‘encoding’.
There are many ways we might go about encoding some text data. We can take this article for example and count the number of letter ‘e’s for example. Or we could count the number of words, or how many times the word ‘data’ is mentioned. Creating these features might tell us something useful about the article. From some quick calculations the most common words are ‘the’, ‘to’, ‘we’ and (of course) ‘and’. Let’s try removing these highly common words through a process called stopword removal. Now we’re left with the following top words; ‘data’, ‘machine’, ‘learning’, ‘something’, ‘customer’, ‘text’, ‘AI’. This analysis already goes some way to summarising the concepts of the article, but we may need other techniques for other types of text analysis.
I’ll discuss some methods and obstacles to improving this encoding schema, but first it’s worth discussing some of the analysis we may be looking to do for text. From the Wikipedia article for ‘text mining’ some of the typical use cases include text categorisation, text clustering, concept/entity extraction, production of granular taxonomies (hierarchies), sentiment analysis, document summarisation and entity relation modelling (i.e. learning relations between named entities).
Our encoding here might be simple, but there are many issues in regards to text data that aren’t addressed by this approach. While the top 3,000 words covers 95% of the English language, it has a very long tail with over 170,000 words. And, this doesn’t include all of the slang in use, brand names and location names. Never mind things like misspellings and multi-word phrases. The meanings of words are difficult too; its common for two different words to mean the same thing, or a single word to mean multiple things. The ordering of our sentences matter, as does idiom like sarcasm, humour and negation. Language is incredibly complex!
Students are still required to learn language mastery in Year 12 and beyond. If we think about it, there are a near infinite ways we might encode this previous paragraph to be able to relate it to other text data. We might take all of the English words, or better yet 3-4 word combinations so we can capture the ordering effects. The issue here is we rapidly run into a limitation with machine learning, known as the ‘curse of dimensionality’. This machine learning theory simply states that for every feature we add, we need to add more cases/examples if we were to perform machine learning. In practice this limits us to a few hundred features maximum to cover thousands of training examples for a task like text classification.
Despite these challenges, we’ve developed better encoding systems to help us adapt and overcome many of these issues, in our quest to perform the types of analysis outlined above. The process for performing text analysis involves two steps: unstructured data becomes structured data through an encoding step. Then, the structured data is used for a variety of machine learning and AI use cases. We’ll discuss briefly what some of these machine learning steps are, before discussing more about our currently employed encoding methods, and how they help overcome the issues in language.
Text categorisation and sentiment analysis are examples of supervised learning. These sorts of problems are similar to the marketing challenges mentioned earlier around customer churn – we have an outcome we’ve previously documented, and we train a model to learn the relationship between examples and categorisation or sentiment. Clustering, entity extraction and entity relation modelling are examples of ‘unsupervised learning’. This is similar to segmentation work, in that we don’t have predefined groups, but we are hoping that by specifying some mathematical and uniform concept of ‘similarity’ that the whole data set can self-organise into interpretable groups. Lastly, document summarisation and hierarchies are similar to unsupervised techniques but they include a ‘reencoding’ step, such that their outputs are converted back into text.
Our techniques for encoding are an area of constant research. We look for small improvements, and attempt to classify systemic issues, understanding them and attempting fixes. It’s a partnership of artificial intelligence, intelligence, CPU cycles and grunt work.
While some of the techniques depend on the application, we generally use the following steps for text analysis:
Pre-processing
This may include steps like removing stop-words, converting to lower case text, replacing common phrases into groups (known as n-grams), removing punctuation and replacing common spelling errors. By doing so we may throw out some useful information, such as the capitalisation of some brands, or a smiley emoji. Good pre-processing involves keeping some of that information while still reducing the total amount of variation in the data set.
Converting words into word vectors
While we could write entire articles on word vectors alone, this technology gives us a few great benefits in text analysis. It allows us to massively reduce the number of features, so instead of needing 170,000 features representing each word, we only need ~128 or so to pretty accurately encode all of them. Not only that, word vectors allow us to determine similarity between words and can be useful for clustering. This technique also allows us to leverage the learnings of millions of other text documents for use in understanding a new novel case. Word vectors help us capture how words are used in sentences. For example, if words like good, bad, great are all likely exchangeable at some point in a sentence.
Adding further conceptual information to word vectors
As of this point we’ve done sufficient work to reduce the feature space into something tractable, now we can start adding helpful features. One we might add at this point is the positivity/negativity associated with each word, or to denote emotional spectrums. This helps us to tease apart word vector groupings so that ‘good, great’ is not confused with ‘bad, awful’.
Reintroducing the ‘orderliness’ of word usage
Especially when running sentiment and text classification projects, the orderliness of words is important information that can help determine if what a person is saying is positive or negative. We use some of the latest deep learning designs (LSTM) to incorporate this information into the text analysis.
These techniques have seen us achieve some fantastic results for predicting customer sentiment, or helping to classify comments. These projects help deliver real insight into what customers are thinking. The techniques help us fuel other research projects such as our chatbot and conversational scripts. Whereas we have some clearer metrics to measure performance for these supervised techniques, it is comparatively less clear how to judge results of unsupervised techniques such as clustering. We’re currently in a development cycle to review and improve our clustering methods.
What remains is to reflect on the problem domain of Natural Language Processing in relation to other domains. There’s something rather beautiful in the fact it’s taking so long for research to ‘crack’ language – it gives us as humans something to hang onto as we blaze forth into the future towards increasingly powerful AI applications. Natural Language Processing appears to be one of the last dominos that will fall along the way to creation of a General AI. As we’ve discussed in this article, the best we can do for the moment is to coerce our human expression and experience into a rough structure.