Making Sense of “Slanguage” in Text Analytics

By Manya Mayes

If your organization is actively analyzing the sentiment of patient and customer commentary, then you have some insight into the new 4G mobile and social media world. Not only do they have an opinion, they are more than ready to broadcast it – and in whatever “slanguage” suits them. But how do you make sense of it? Are there sentiment analysis tools that can process, with accuracy, this type of challenging customer feedback in call center transcripts, as well as social media?

Slanguage is language characterized by excessive use of slang, particular to a group or subgroup of people, such as users of social media forums. Internet slanguage includes abbreviated and clipped speech but is unique in that it also contains letters, numbers, and non-alphanumeric characters.

Slanguage can include any topic, such as politics, economics, and current affairs, but it is most commonly seen in expressions of personal feelings and opinions. Also, slanguage can be punchy, playful, and, in the case of social media, conveys a savvy quality of the author (“hip to the Internet”). Mixing letters with numbers in words like “gr8” is an example. It is shorthand (for the sake of speed and brevity) based on knowledge of orthography, a keyboard, and phonetics (how words sound). Consisting of a number (“8”) which, when spoken aloud, makes the same sound as the standard “-ate.”  This is actually a clever formulation of slang – far more special than coining new terms.

Character-Related Challenges: The sheer volume of customer opinion freely available via the Internet has given forward-thinking organizations the ability to gain key insights from valuable customer feedback. The problem, though, with this voluminous feedback is that there is no editor-in-chief to review the accuracy of the language being published. Moreover, the frequent need to fit opinion into a 160-character SMS text message or a 140-character tweet means there is motivation to be brief and to the point.

In an effort to make a statement in 160 characters or less, the length of a word rather than the statement itself is what becomes brief. For some users, predictive text on mobile phones and iPads allows typing in shorthand and prompts users with the full word equivalent. But when these users move to an environment without predictive text, shorthand becomes the norm. Bcuz y’all no wot I mean neway!  Propelling the creation of this new language, TweetDeck, for example, offers the ability to “TweetShrink” written text to allow customers to fit in as much information as they possibly can.  So, now there are even more ways to misspell a word. That can be a real problem for call center analysts.

Is “Twitterspeak” Really Different? Let’s look at the following tweet:

Doctor couldn’t give me any meds because I waited too long to go. FML

When I first saw this, I had no idea what FML meant. I had to go to one of the “netlingo” sites to get a definition. Suffice it to say that from a sentiment analysis point of view, it is negative. Without knowing what it means, software performing sentiment analysis could classify this statement as “neutral” given it is a statement of fact. But the addition of FML to the end of the tweet gives the statement a negative slant, and it is exactly this information that should not be overlooked.

Let’s make a comparison of Twitter and survey text. If you parse the sample records with a mature knowledge extraction base and then evaluate for differences, you can find that Twitter has far fewer known words and far more unknown words – to the tune of seven times more distinct words than the survey text examined, leaving a business analyst with seven times more information with which to detect the signal from the noise.

Dealing with Reality: I have analyzed text that is grammatically correct and ready for publishing and text that is neither of those things but gets published anyway. News and media sites, such as The New York Times, regularly post articles with the occasional spelling mistake or grammatical error. Meanwhile, social media sites, such as Twitter, post tweets with occasional grammatical accuracy!  The omnipresence of what I call the “Ten Transgressions of Text” (which include misspellings/typos, shorthand, slanguage, and profanity) makes for challenging analyses.

For the analyst who is responsible for gleaning value from this slanguage, there is always the ethical question of who should be reading any X-rated content and what HR and legal think of exposing their employees to such content. Did I think I would ever be analyzing information that looked like this?  Actually, I did. Did it surprise me when a large market research organization asked me if we could automatically remove profanity?  No, not really. Did it feel weird to aggregate lists of text speak, slanguage, and profanity to remove from analyses? Absolutely!  This slanguage could be enough to make the hairs on your neck stand up.

There is a real problem here in that profanity and crassness can hold extremely vital information to a company. It is important to identify the meaning but also to abstract away from the meaning because the words may not be that valuable; it is what they convey about the speaker that we care about. This approach allows the analyst to get the information they are seeking while remaining “protected” from details, and, what’s most important, to aggregate similar data for slicing and dicing. For example, grouping the following three items together allows for aggregation of three different statements, which are all getting at the same thing:

  • “That product is really lousy” as standard
  • “That product is hella lousy” as slang, and
  • “That product is *BLEEP* lousy” as profanity

Given the seemingly large number of options for spelling a word, I like also to say there are at least 16 ways to misspell a word. Consider the number of ways you could spell “BlackBerry”: blackberry, black berry, bberry, bb, blackbery, blackberi, blckberry, blkberry, or blacberrie. You get the idea.

Further, consider the following text lingo examples from netlingo.com:

*$:  StarBucks (I wonder if Starbucks knows this lingo for their brand!)

AWLTP: Avoiding Work like the Plague

AYOR: At Your Own Risk (useful for analyzing product sentiment)

BlkBry: BlackBerry (I missed that one in my list above)

BTD: Bored to Death (useful for analyzing movie reviews and presentation feedback!)

Slanguage Analysis (Natural Slanguage Processing): Whenyou are presented with the task of analyzing slanguage-ridden text, all is not lost. If you are to understand what people are saying about your organization and services, text analytics provides the means to automatically find the signal amidst the noise. The presence of slanguage, emoticons, clipped text, abbreviations, and profanity can tell you something about both the content and the author. The aim of the analytics is to report results in your own business terms – not the terms of TweetShrink.

I am frequently asked, “How much information in social media data is useful?”  The answer to that question must depend on the definition of “useful.”  In a manual review of a sample of tweets, some 63 percent of extracted information was considered to be well-formed (representative of the information in the tweets). The remaining 37 percent was noise. If the well-formed information does one of the following – helps protect customer safety and avoid a public recall for an automotive manufacturer; detects a segment of high value, unhappy customers; detects an issue that, when fixed; helps leapfrog a competitor; or identifies customers with “at-risk” behavior — then it is useful.

Context is Key: The aim is not to remove all of the noise but to give meaning to it based on the appropriate context. In order to do this, healthcare organizations will have to face language evolution, and text analytics vendors will need to put processes in place to record and translate the language changes in the context in which they are used. R&R to the military has a completely different translation than R&R for the automotive industry, and it is that context that helps us understand which translation is appropriate.

The ability to automate the detection of slanguage and provide the correct contextual-based translation for the relevant call center industry is key. The technology will evolve as the language does.

Manya Mayes is the director of advanced analytics at Attensity, which delivers an integrated suite of customer analytics and engagement applications that help organizations use the voice of the customer to deliver a superior customer experience. Manya has spent over 20 years using statistical methods to analyze data, the last 15 working specifically with data and text mining software.

[From the February/March 2011 issue of AnswerStat magazine]