ITC_Data Mining - Semantic Scholar

3 downloads 185 Views 179KB Size Report
Machine Learning – Here, a training data set is ... Advanced techniques, like Support Vector Machines,. • Sentiment
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM A thought paper on ‘business benefits of text mining.’

Executive Summary

The Approach

Text Mining has evolved and garnered a lot of interest in recent times, not without a reason though. Organizations collect huge amounts of text data. They can use various text mining tools and techniques to analyse that data for meaningful, actionable insights.

Text data can be in different forms - Facebook posts/comments, tweets, customer feedback, blog posts, sales rep. notes, patient health records, complaint logs, third-party surveys (free text formats), newswires, newsfeeds, documents, etc. Conceptually, text mining is a three-step process (Figure 1).

This paper discusses how automatic document classification, information retrieval, word frequency calculation, sentiment analysis, topic modelling and trend analysis can be utilized for root cause analysis, devising competitive strategies, enhancing customer experienceand so on.

Comments/feedback can be stored in simple tables in a spreadsheet/database and documents can be stored in folders. Text available on social media and other websites can be collectedusing

scraping tools. Known attributes of the text, like date, location, customer ID, business area, should also be captured and recorded, as that would help in filtering and selective analysis. Incoming text can be cleaned up by removing irrelevant data, if required. This text data can then be mined with the help of commercial programs (SAS, SPSS, SAP Infinite Insight), niche commercial or open source programs (Attensity, OpenText, SmartLogic, openNLP, KNIME) or analytics tools (R, Python) (Figure 2).

Figure 1 ! Web Sensors ! Open and Paid API

LOADING

! Advanced Text Analytics ! Word Frequency

! Web Scraping ! Internal Data ! Customer Feedbacks ! Value Chain Partner Interaction

! Data into flat files or spreadsheets

! Word Associations

! Data load in Databases

! Trend Analysis

! Sentiment Analysis

EXTRACTION

ANALYTICS

Figure 2 ! SAS Text Analytics ! SPASS Text Analytics ! SAP Infinitelnsight ! Smart Logic ! Rapid Miner Survey Emails Logs

Text Mining-Tapping Hidden Kernels of Wisdom

Web Scaping tools

! Open Text ! R Test Mining

World Cloud Setiment Analysis & Trends Topic Analysis

Text Mining-Tapping Hidden Kernels of Wisdom

Techniques Document Classification and Grouping The immediate benefit organisations get form text mining is better classification of documents (mails, feedback, comments). It helps them classify documents based on functions, departments, business areas, etc., improving efficiency. Queries or news wires can be classified and routed to the concerned people, so they can take appropriate action.Techniques like K Nearest Neighbor (KNN), Decision Trees, Bayesian Classifier, etc. can all be used to classify documents. Usually, documents are matched against already known categories, but there could be cases where several incoming documents need to be grouped, without classifying them based on their structures. This is where the concept of high homogeneity and high heterogeneity comes into picture. In this regard, document classification and grouping is very similar to supervised and unsupervised learning respectively.

Information Retrieval Businesses need information to base their decisions on, and hence must have such insightful information at their disposal. Information retrieval refers to the technique of fetching a set of documents containing desired information. This can be achieved by passing a ‘clue’ to the program, which can be in form of a keyword or combination of words. The program returns documents containing text with the closest match to the ‘clue’. From text mining perspective, single or multiple words passed as ‘clue’ can be considered as a document. Based on ‘similarity’ measures, a set of documents with the best match is

Text Mining-Tapping Hidden Kernels of Wisdom

returned. The simplest and most obvious similarity measure is the count of common words. More sophisticated measures include extending the idea of word count, like count with weightage, cosine similarity, etc.

Word Frequency One of the most widely used techniques for text mining based on word frequency is wordcloud – a visually appealing representation of the most frequent words in the text data, which makes the font size proportional to the frequency of the word. Most text mining tools provide multiple options to control the number of words, remove redundant words, reduce the frequency of similar words,etc. They offer a simple albeit insightful way to highlight hitherto unknown aspects of the business. Wordclouds can be made more insightful by building drilldown capabilities that can search for comments with a specified ‘keyword’. Organisations can either opt for specialized software to generate wordclouds, or choose general purpose text mining programs that have such functionalities in-built in them.

! Machine Learning – Here, a training data set is made available to the program with predefined sentiments. The algorithm trains based on occurrence of words or patterns, and new incoming texts are classified based on the occurrence of such patterns or words. ! Sentiment Lexicons – Here, matches of positive or negative words are found and a sentiment measure is calculated. However, it is easier said than done. The complexity of languages, sarcasm, colloquialism, poor spelling and grammar can marthe accuracy of results in this area. In addition, sentiments need to be analyzed holistically and in a context. For e.g., finding two products or services similar can only be construed positive or negative by knowing about the two products or services being discussed. The comment could be a compliment or criticism, depending upon what it is being compared to. Sentiment analysis engines are generally 60-70% accurate (humans are 80% accurate), but results of sentiment analysis are improving with continuous research.

Topic & Content Analysis Advanced text mining can be used to analyze topics and context. Just likesentiment analysis, topic analysis can be done either through machine learning or based on predefined dictionaries. Advanced techniques, like Support Vector Machines, Latent Dirichlet Allocation, etc., can also be used. In addition, word clustering (Hierarchical or K Means Clustering techniques), network diagrams and word associations can be used to look for topics or context based on natural word groups. Cooccurrence and proximity are two of the most useful ways to group words in similar topics.

Sentiment Analysis The most exciting and challenging aspect oftext mining is analyzing the sentiments in the text data. Imagine a scenario where businesses can understand the sentiments (positive, negative or neutral) latent in comments/posts/feedback without going through copious amounts of text data. Such information would be extremely valuable to organisations operating in customer centric B2C markets, like hospitality, travel, banking, retail, etc. Sentiments can be analyzedin two ways:

Text Mining-Tapping Hidden Kernels of Wisdom

Business Use Cases Text mining finds application across industries and functions – travel,retail, banking, hospitality, and healthcare, etc. Market intelligence teams can segregate news feeds and web articles based on document classification or grouping. Similarly, incoming emails can be separated and auto-forwarded to respective teams. An airline company may find that its in-flight entertainment system or baggage handling process is a sore point for its customers, or a hotel may find that a seemingly innocuous construction near its premises could irritate its otherwise satisfied customers. HR teams can analyze employee feedback and find potential areas of improvement for organizational development. Reviews can be made even more valuable by crawling the top web searches for user queries and providing text mining results (frequent words, sentiments, broad areas of discussions, trends etc.) in real time.

Text mining can also be used in tandem with voice-to-text technology for analyzing transcripts of the cockpit voice recorders of airlinestogain insights. This can help understand the reasons behind anomalous and risky flight manoeuvres or flight incidences. Voice-based feedback in hotel or banks can also serve as important inputs for text mining.

Text mining can be very valuable for both intra as well as inter-organizational benchmarking. One can view word clouds, sentiments, etc. in two different time frames, for two different geographies, departments, functions, etc. Text mining can be used for comparison against industry rivals too. If time stamped data is available, it can be used for a trend analysis of sentiments or social outreach (no. of comments, posts, likes etc.). For e.g., an F&B organization can see how its flagship product pitches against its rival’s product. A simple wordcloud and drilldown can reveal that it is not considered as healthya breakfast companion as the other beverage sold by its competitor. It can be also be used to find associations between diseases and interaction between and adverse effects of drugs.

Text Mining-Tapping Hidden Kernels of Wisdom

Text Mining-Tapping Hidden Kernels of Wisdom

Author

Co-Author

Anand Nath Jha,

Viros Sharma,

Analytics Architect – DWBI & Analytics, ITC Infotech

Vice President & Global Practice HeadDWBI & Analytics, ITC Infotech

Mr. Anand has 17 years of diverse experience in strategic Marketing, Analytics, Project Management and Aerospace Engineering. He has worked for Honeywell, General Electric, Hindustan Aeronautics Ltd. and LM Windpower. He graduated in Aerospace Engineering from IIT Kanpur, and pursued MBA and Advanced Certification in Analytics from University of Phoenix and IIM Lucnkow respectively.

Mr. Viros has more than 20 years of experience in DW/BI Consulting and PracticeBuilding space. He has worked for multinational IT companies like BearingPoint and iGATE in India and USA. He did his AMP from IIMB and holds double masters in Mathematics and Computer Applications.

About ITC Infotech ITC Infotech, a fully owned subsidiary of USD 7 billion ITC Ltd, provides IT services and solutions to leading global customers. The company has carved a niche for itself by addressing customer challenges through innovative IT solutions. ITC Infotech is focused on servicing the BFSI (Banking, Financial Services & Insurance), CPG&R (Consumer Packaged Goods & Retail), Life Sciences, Manufacturing & Engineering Services, THT (Travel, Hospitality and Transportation) and Media & Entertainment industries. For more information, please visit http://www.itcinfotech.com | Or write to: [email protected]