Knowledge Engineering with Big Data - IEEE Computer Society

0 downloads 154 Views 320KB Size Report
50 Pbytes of data saved by the end of the survey.2. In addition to scientific research, big data is increasingly part of
Big Data

Knowledge Engineering with Big Data Xindong Wu, Hefei University of Technology and University of Vermont Huanhuan Chen, University of Science and Technology of China Gong-Qing Wu, Hefei University of Technology Jun Liu and Qinghua Zheng, Xi’an Jiaotong University Xiaofeng He and Aoying Zhou, East China Normal University Zhong-Qiu Zhao, Hefei University of Technology Bifan Wei, Xi’an Jiaotong University

BigKE is a big

Ming Gao, East China Normal University

data knowledge

Yang Li, University of Science and Technology of China

engineering

Qiping Zhang, Hefei University of Technology

framework that

Shichao Zhang, Zhejiang Gongshang University

handles fragmented knowledge modeling

Ruqian Lu, Chinese Academy of Sciences Nanning Zheng, Xi’an Jiaotong University

and online learning from multiple information sources, nonlinear fusion on fragmented knowledge, and automated demanddriven knowledge navigation. 46

A

long with all the advances in computing and information technologies, the prevalence of big data in scientific research and industrial applica-

tions is rapidly increasing. Massive scientific datasets are being accumulated, stored, and processed in research areas besides computer science—the Sloan Digital Sky Survey (SDSS) project,1 located at the Apache Peak Observatory in New Mexico, for example, amasses big data related to astronomy. In its first few weeks of operation, the volume of collected astronomical data was larger than all the data

collected in the history of astronomy to that point. So far, the SDSS has gathered more than 140 Tbytes of data. When the Large Synoptic Survey Telescope (LSST) goes online in 2016, it will produce an estimated 20 Tbytes of data per night, with at least

1541-1672/15/$31.00 © 2015 IEEE Published by the IEEE Computer Society

IEEE INTELLIGENT SYSTEMS

50 Pbytes of data saved by the end of the survey.2 In addition to scientific research, big data is increasingly part of our businesses and everyday lives, with many insurance and other applications tapping into large, fast-moving, complex streams of big data and applying advanced analytical techniques to transform the way they do business.3–5 In many situations, storing all observed data is infeasible, making it essential to efficiently acquire useful knowledge from different local sources. However, streaming data usually comes from multiple, heterogeneous, autonomous sources with complex and evolving relationships, and traditional algorithms can typically acquire only pieces of knowledge from single sources.6 Moreover, these pieces of knowledge are usually fragmented, with uncertainty, incompleteness, and varying levels of quality. This fragmented knowledge is part of the migration puzzle—each piece provides some limited information, but not the whole picture. Traditional knowledge engineering can’t obtain and process such fragmented knowledge because it’s usually acquired from different sources. We propose a framework for big data knowledge engineering (BigKE) to address this problem.

A Three-Tier Approach BigKE consists of three tiers: fragmented knowledge modeling and online learning from multiple information sources, nonlinear fusion of fragmented knowledge, and automated demand-driven knowledge navigation. Unprecedented data volumes require an effective data analysis platform to achieve fast responses and real-time classification. Conventional offline data mining methods are clearly inappropriate for streaming data because they require the data september/october 2015

to be reformed. Online learning methods (the first tier of BigKE) can help tackle this challenge and swiftly adapt to drifting in streaming data. BigKE’s second tier fuses this fragmented knowledge acquired from multiple local sources to obtain a complete set of knowledge. However, this knowledge is likely to be biased and inaccurate due to the limited scope of its analysis. BigKE’s nonlinear fusion process provides the united representation and foundation for the construction of a knowledge ontology7 —that is, a complete set of knowledge for the target learning goal. BigKE’s third tier takes the fused knowledge ontology and the unified representation from an overall perspective, implementing demand-driven knowledge navigation for personalized knowledge services. Take the development of machine translation as a motivating example of knowledge engineering with big data. Since first investigated in 1954, it’s widely acknowledged that the problem of automatized machine translation can’t just be solved by memory and grammar rules. The IBM project Candide8,9 aimed to improve translation, and the outcome was very inspiring, with 300 million sentences in Congressional documents translated into English and French for publication. The core underlying mechanism was the transform from a translation problem to a statistical one.10 This transform worked well and improved translation accuracy, but even though a lot of effort was devoted to the project, little progress was achieved. In 2006, Google stepped into the machine translation field by fusing autonomous, heterogeneous sources. To train Google Translator,11 a large amount of accessible materials is collected from the Internet, including government documents and associated translations, along with ­low-quality materials of every stripe. www.computer.org/intelligent

Although its information sources are diverse, Google Translator achieves state-of-the-art performance, as evidenced by its proportion of the translation market, because it accepts all kinds of sources, including inaccurate ones, in its training phase and then fuses all of that fragmented knowledge. The training materials are complementary, with inaccurate materials revised during the training process. Google translation follows an approach similar to BigKE’s—it consists of online learning, fusion of fragmented knowledge, and knowledge service. Along with techniques that can be integrated into this threetiered framework, semantics-based automated service discovery12 and mobile app classification with enriched contextual information13 are good examples of BigKE applications. Semantics-based automated services address the issue of Web service discovery without explicitly associated semantic descriptions, and mobile app classification explores additional Web knowledge and enriches contextual information. To enhance knowledge acquisition from multiple, heterogeneous, autonomous webpages for personal knowledge service, researchers at the Academy of Mathematics and Systems Science (AMSS) at the Chinese Academy of Sciences undertook the Second Generation Browser Project (SGBP)14 in 2011 to develop a new type of browser capable of extracting and synthesizing knowledge according to user demand. Various algorithms have attempted to process large-scale high-dimensional data. Divide-andconquer anchoring,15 for example, was proposed to tackle nonnegative matrix factorization in high dimensions. By dividing a high-dimensional anchoring problem into cheaper subproblems and then combining those to achieve detected anchors from 47

Big Data

Related Work in Modeling Big Data for Knowledge Engineering

B

igKE starts with fragmented knowledge for big data applications. We review and compare the 4P medical model, the IBM 4V model, the five Rs for big data, and the HACE theorem, along with our BigKE model here in modeling big data for knowledge engineering.

The 4P Medical Model Several research efforts on medical diagnosis systems have relevance to big data applications. Most early systems are based on domain expertise, which leads to expert systems.1 The lack of specific knowledge or history of patients, disease history in patient families, and context information about patient communities means these systems are often less effective. Gradually, the importance of individual knowledge and society factors on the cure of diseases were recognized. In this context, the 4P medical model was proposed.2 The 4P model includes predictive, preventive, personalized, and participatory dimensions. The 4P medical model demonstrates a general trend in current healthcare by stating the importance of expertise and the participation of social and personal factors. Such expertise, social, and personal information could be accessed from different storage sites— such as in medical publications, online doctor guides, patient communities on Facebook, and so on—that are typically autonomous, heterogeneous data sources. To infer integrated knowledge from these information sources, we are in urgent need of data mining algorithms capable of handling diverse and large volumes of data. The fusion of medical knowledge and big data, which comes from both individuals and society, leads to pervasive medical care.3 The major difference between pervasive and

low-dimensional spaces, this algorithm solves the problem in a parallel way. To represent data, multiview intact space learning16 was proposed to offer complementary information from different perspectives of the multiple sources to help the browser discover latent representations. In addition, locality-sensitive hashing17 is an incremental, scalable community detection algorithm that models dynamic unstructured data with streams and processes data in real time.

A Knowledge Engineering Framework for Big Data Challenges In August 1977, Edward Feigenbaum proposed the concept of knowledge engineering for the first time at the 5th International Joint ­ Conference on Artificial Intelligence (IJCAI 77). 48

current healthcare is the portion of individual information and domain expertise. Pervasive healthcare extracts fragmented knowledge from clients and other accessible sources (such as social media), fuses fragmented knowledge into a global knowledge graph, and performs inference based on the overall knowledge graph together with individual requests. In contrast, conventional knowledge engineering relies heavily on domain expertise, weakening the personalized and social factors.

The IBM 4V Model

In 2013, Gartner analysts4 proposed a 3V model (volume, velocity, and variety) as essential properties of big data. The volume dimension denotes that the learning problem can’t always be carried out in batches, velocity characterizes data speed, and variety refers to the inconsistency in architecture and storage structures. Recently, IBM refined the 3V model after reflecting on new features in big data, changing it to the 4V model to include veracity. Although the 3V model touched on essential properties of big data, the critical factor affecting strategy selection by commercial companies emphasizes data veracity. IBM holds the opinion that this revised model will drive companies to pay more attention to advanced technologies for data quality and to explore big data in full.

The Five Rs of Big Data Merritte Stidston proposed the five Rs of big data from the data management perspective as relevant, real time, realistic, reliable, and ROI.5 The five Rs provide supporting evidence for the validation of our BigKE framework.

Since then, knowledge engineering has been a foundation of expert systems (also known as knowledgebased systems).18 The construction and interpretation of a knowledge-based system rely on carefully designed knowledge acquisition, representation, and inference. Knowledge engineering is also a foundation of various intelligent systems, which differ from expert systems because of context awareness and adaptability. We can roughly divide knowledge engineering’s evolution into four generations: • The first generation focused on empirical applications. In 1965, Dendral,19 a computer program made to derive molecular structure, marked the emergence of the first www.computer.org/intelligent

expert system. It outperformed contemporaneous experts, including its designers. • The second generation was also marked by an expert system, MYCIN,20 and its successor, EMYCIN.21 The MYCIN system could acquire and interpret medical knowledge; it was designed to diagnose diseases and interact with users in English. • The third generation involved wide industrial applications in the 1980s. The products of knowledge engineering entered the industrial sector on a large scale and brought remarkable profit. • Since the beginning of the 21th century, the focus has been on big data. A vast amount of data is generated and processed every day, and accessible content on the Internet is far beyond the exploration c­apabilities IEEE INTELLIGENT SYSTEMS

­ haracteristics like relevant (data fit) and real time (data on C time) emphasize that an application system should be conducted with relevant data sources and be of high response speed, leading to the evaluation of data sources and usage of online learning algorithms. Data quality, or the reliable dimension, emphasizes the need for quantification of key factors, such as trustworthiness. Data insights, or the realistic dimension, operate in analog with knowledge fusion. The ultimate goal, ROI, is return on investment for commercial companies in the same way that knowledge services fit in the third tier of BigKE.

The HACE Theorem

According to the HACE theorem,6 the exploration of big data starts with heterogeneously infrastructural data from autonomous sources, seeking complex and evolving relationships. The HACE theorem unifies previous work on the 4P, 4V, and 5R models, and provides a framework to process big data. However, the HACE theorem doesn’t solve the problem of learning from fragmented knowledge in a systematic engineering framework. In this big data era, a vast amount of fragmented knowledge comes from autonomous and heterogeneous sources. Our BigKE framework aims to explore the utility of fragmented knowledge to generate integrated knowledge and to provide personalized knowledge services. BigKE also goes a step further to connect fragmented knowledge with a novel view on knowledge engineering.

Second-Generation Browser Project Current browsers are incapable of extracting the required knowledge from a huge set of webpages and producing a survey or solution for users. The Second-Generation Browser

of data consumers, who usually can’t locate the relevant information within an acceptable time frame. In addition to the large volumes of data, the behavior of data consumers also keeps evolving with ongoing events. To cope with the challenges brought by the big data phase of knowledge engineering’s evolution, BigKE uses its three-tiered framework to offer several advantages over conventional knowledge engineering. As Figure 1 shows, BigKE starts by learning fragmented knowledge from autonomous, heterogeneous sources with the aim of delivering personalized knowledge services. The first tier handles uncertainty and heterogeneity, generating fragmented knowledge. The associated temporal and spatial characteristics of data and the raw fragmented knowledge are september/october 2015

Project (SGBP),7 undertaken by a team of researchers in the Academy of Mathematics and Systems Science in 2011, aims to develop a new type of browser capable of extracting and synthesizing knowledge from sets of webpages according to user demand. It consists of three parts: a document library tree constructor, a survey report generator, and a query/answer server. The styles of service are threefold: the SGBP server provides a tree-style text library for reader browsing, it can produce a knowledge survey with segments from this tree, and it answers readers’ questions with SQL-like tools to fetch and locate solutions.

References 1. S. Linnainmaa, “Overview of Expert Systems Technology,” ­Espoo, vol. 1, 1990, pp. 13–35. 2. C. Auffray, D. Charron, and L. Hood, “Predictive, Preventive, ­Personalized and Participatory Medicine: Back to the Future,” Genome Medicine, vol. 2, no. 8, 2010, pp. 57–59. 3. D. Vassis et al., “Providing Advanced Remote Medical Treatment Services through Pervasive Environments,” Personal and Ubiquitous Computing, vol. 14, no. 6, 2010, pp. 563–573. 4. D. Che, M. Safran, and Z. Peng, “From Big Data to Big Data ­Mining: Challenges, Issues, and Opportunities,” Database Systems for Advanced Applications, Springer, 2013, pp. 1–15. 5. M. Stidston, “Business Leaders Need R’s not V’s: The 5 R’s of Big Data,” 2007; https://www.mapr.com/blog/business-leadersneed-r%E2%80%99s-not-v%E2%80%99s-5-r%E2%80%99s-bigdata#.VYgNtfmUeEk. 6. X. Wu et al., “Data Mining with Big Data,” IEEE Trans. Knowledge and Data Eng., vol. 26, no. 1, 2014, pp. 97–107. 7. R. Lu et al., “KACTL: Knowware Based Automated Construction of a Treelike Library from Web Documents,” Proc. Int’l Conf. Web Information Systems and Mining (WISM 12), 2012, pp. 645–656.

combined for shared contents or topics.22,23 These combined knowledge fragments can then be adopted to evaluate the reliability and quality of the data sources and obtained knowledge. The second tier fuses the fragmented knowledge to generate integrated knowledge via knowledge graph discovery, which is a graphical knowledge representation scheme suitable for knowledge fusion and navigation. The discovery on possible connections between fragmented knowledge and subgraphs makes learning on the graph feasible. Personalized inquiries and context information are modeled in BigKE’s third tier to provide demand-driven knowledge services. The goal is realized by knowledge compilation and publication, which tailor the knowledge for personalized requirements and effectively present the knowledge to users, respectively. www.computer.org/intelligent

The generation of integrated knowledge from fragmented knowledge happens in a manner that’s often taken for granted but that’s closely related to the involvement of intelligence. When the amount of research publications grows rapidly, such publications can hold hidden knowledge. For example, Don Swanson used analysis and knowledge fusion techniques24 to study thousands of publications on migraine headaches, ultimately learning that several factors lead to them, most of which were connected to magnesium. This connection between migraines and magnesium was revealed solely through publication analysis. (Later, this was clinically verified.) The generation of integrated knowledge in BigKE could be useful for industrial and commercial sectors as 49

Big Data

Demand-driven knowledge service Knowledge compilation and knowledge publication

problems on their own. A “big,” yet understandable approach is urgently required. BigKE is such a framework—it can provide understandable results for big data applications.

Knowledge navigation and path discovery

Social and personalized modeling in conjunction with context awareness

Evaluation and evolution of knowledge graph

Nonlinear knowledge fusion

Nonlinear fusion of reliable subgraphs

Association analysis and emerging patterns in fragmented knowledge

Semantic encapsulation of fragmented knowledge

Data reliability evaluation of autonomous sources

Modeling of data and feature streams with spatial and temporal characteristics

Online learning with data streams and feature streams Fragmented knowledge modeling through online learning from multiple data sources

Figure 1. The BigKE framework. The first tier handles uncertainty and heterogeneity, generating fragmented knowledge. The second tier fuses the fragmented knowledge to generate integrated knowledge via knowledge graph discovery. The third tier models personalized inquiries and context information to provide demand-driven knowledge services.

well. On 2 June 2014, the US Food and Drug Administration (FDA) launched OpenFDA (http://open.fda. gov), which opened access to more than 300 million adverse reaction records. The FDA provides an API to download raw data, technical reports, 50

and so on and has built a community to promote interactions among scientists with the goal of encouraging innovation around FDA data. The launch of the OpenFDA project reveals that conventional expert systems are incapable of tackling such big www.computer.org/intelligent

Fragmented Knowledge Modeling and Online Learning In the big data era, information comes from multiple, heterogeneous, autonomous sources. The storage ar­ chitectures are diverse, the management of the data sources is relatively autonomous, and the relationships among data objects are complex and evolving, all of which contributes to the emergence of fragmented knowledge. Fragmented knowledge modeling and online learning from multiple autonomous information sources is inevitable and crucial to big data applications. Traditional offline learning methods process data in batches, whereas online learning methods use data streams. Online learning methods are more capable of handling concept drift,25 which is hard to spot in offline learning algorithms. For example, in Twitter, discussion topics evolve rapidly in accordance with emergencies or events, a characteristic that requires data mining algorithms to adapt quickly. Compared with conventional knowledge engineering, BigKE focuses on online learning techniques that operate on autonomous and heterogeneous sources in three ways. First, BigKE uses both feature and data streams to compress data transmission. Feature streams derived from temporal and spatial information in data sources serve as a compact yet effective representation of data from multiple autonomous sources. One example is the employment of Bayesian networks to simulate vehicle status for Mercedes-Benz.26 The complex status of vehicles is represented by parameters in a Bayesian network whose dimensionality is much less than that of streaming IEEE INTELLIGENT SYSTEMS

sensor data. Car statuses can be reconstructed and monitored using these parameters, which greatly reduces bandwidth usage and improves service levels. Second, online learning algorithms are designed especially for scenarios in which rapid concept drift is occurring. Fast concept drifting requires data mining algorithms to focus on the ­latest information. The impact of prior knowledge should be weakened due to the underlying concept’s shift—in other words, data mining algorithms should have shorter memories when concept drift is occurring. The length of memory should be dynamically tuned to adapt to concept changes. Third, the quality of data collected from multiple autonomous sources varies, requiring effective evaluation approaches. Data quality can be measured in accordance with either source quality or evidence from other data sources. For example, data collected from academic publications is more accurate than that from Internet forums or tweets, but with large volumes of data, these low-quality sources can be reciprocally verified and fused to generate high-quality knowledge. The predication of flu trends made by Google’s use of search keywords is an example of this point.27 To evaluate knowledge quality, the proper handling of data source heterogeneity and autonomy is crucial to successfully implementing data mining and fusion methods. To do this a co-learning algorithm seeks to obtain supervised information from other peers—specifically, it attempts to construct a learning procedure in which several tasks optimize the same learning target or loss function as a whole. The tasks in co-learning complement each other for better performance. Transfer learning, on the other hand, aims to make a learning task “borrow” information from relevant other tasks to enhance its performance. In BigKE, the relative september/october 2015

quality of data sources is assured in the learning procedure, and irrelevant information is disregarded. Therefore, the fusion of fragmented knowledge is expected to generate high-quality knowledge by conducting co-learning and transfer learning on multiple heterogeneous, autonomous data sources. Many examples demonstrate the importance of modeling fragmented knowledge, especially in situations where no qualified models are available. ZestFinance, a financial company focused on analyzing and estimating potential customers, tries to collect every possible record, including how long users browse the company website, which letters they capitalize in their credit card applications, whether their mobile phone bills in arrears, and so on. This information comes from multiple sources, and the company builds up various models with it. Data from a single source could be less convincing, but with all the relevant data from multiple sources contributing to the model, it’s likely that reliable information will be obtained. In this example, modeling on fragmented data demonstrates great potentials in forecasting potential customers’ behaviors.28

Nonlinear Fusion of Fragmented Knowledge Knowledge fusion relates autonomous knowledge sources and generates integrated knowledge. It differs from knowledge aggregation in that the latter just collects relevant knowledge and doesn’t infer from it. Knowledge fusion not only collects knowledge, but it also “creates” integrated knowledge by inferring new or hidden knowledge. BigKE adopts a knowledge graph29 as the primary knowledge ­representation, with other knowledge representation schemes used for in-depth domain expertise. The fragmented knowledge acquired from online learning algorithms www.computer.org/intelligent

is modeled as metaknowledge, and the fragment relationships30 are represented by nodes and links in the graph.31 With this representation, the fusion of fragmented knowledge is transformed into a fusion problem on local knowledge (sub)graphs. Google is an example of a system that has successfully applied a knowledge graph representation to organize search output entries. A knowledge graph can reduce the ambiguity of search entries by considering the search context, facilitating the visualization of the search engine’s output. For example, a user entering “Amazon” into a search engine might be looking for the shopping website, not the world’s biggest river—the knowledge graph can help here by minimizing confusion. Google’s knowledge graph differs significantly from BigKE. First, Google’s knowledge graph is dedicated to finding the most relevant information and improving the user experience. The ultimate goal is to better understand user inquiries. The Google knowledge project has distinct clients and can base its services on a knowledge ontology, such as Freebase, Wikipedia, and so on. BigKE, on the other hand, aims to provide services without distinct users or clients, and knowledge sources aren’t coming from the current ontology. Rather, knowledge sources come from large quantities of fragmented data. To provide better services, BigKE doesn’t rely on static knowledge sources, but sources that are evolving all the time. The main research topics in BigKE’s knowledge fusion are • to evaluate and quantify the relationships of fragmented k ­ nowledge (relationships in fragmented ­knowledge pieces are unclear, so knowledge fusion methods must be able to evaluate and quantify them) and • to fuse and represent the fragmented knowledge. (Human 51

Big Data

c­ognition is based on the relationships of fragments, with related fragmented knowledge pieces connected as a graph32; following this philosophy, the representation of knowledge by a graph facilitates the understanding of fragmented knowledge pieces’ relationships,33,34 transforming the problem of knowledge fusion into how to connect and merge nodes in the graph.) There are several advantages to using a knowledge graph in knowledge fusion. First, the knowledge has a unified representation format for both local and global components, facilitating the fusion procedure. By representing specific domain knowledge in a unified graphical form, the domain knowledge from experts can also be fused, improving the knowledge graph’s quality. Second, a demand-driven knowledge service is transformed into path navigation in a knowledge graph.35 The metaknowledge and its relationships in the graph solve a specific question. In the past few years, several smart city projects 36 have emerged in pilot cities such as Beijing, Shanghai, and Guangzhou. The initial aim of such projects was to fuse all possible sensor data and provide intelligent solutions for modern city situations, such as traffic control, crime, safety, and so on. The implementation of a smart city requires the exploration of multiple data sources and their relationships, and demand-driven knowledge services for intelligent management and operations. In this sense, BigKE provides essential support for the successful realization of such projects.

Automated Demand-Driven Knowledge Navigation Knowledge navigation has already been successfully applied in 52

­Web-based e-learning systems,37 which aim to deliver learning content to students in an intelligent way. An effective e-learning system should be able to customize the delivery process according to students’ personalized needs. Generally speaking, an e-learning system models learning content in a graphical form—the nodes and links correspond to knowledge units and their semantic relationships, and the customized requirements are formulated as a pathway on the graph. Navigation in a knowledge graph is a way to connect fragmented knowledge for personalized demands. The most feasible solution is depicted in the knowledge graph as the most probable path. To obtain and understand human requests is a nontrivial task that involves many techniques, such as context awareness,38 collaborative filtering,39 and so on. Context-aware computing explores user context to avoid confusion and obtain more confident understanding. Similarly, collaborative filtering, which is often used in recommendation systems, employs sorting techniques to filter ambiguous results. Different from data-driven forward chaining in traditional knowledge engineering (such as “recognize-act” (match, conflict, resolution, apply) cycles in production systems, Rete,40 and LFA41,42), BigKE performs knowledge inference on the knowledge graph directly. Discovery algorithms might have a high computational complexity for a large-scale graph, so approximation techniques must be investigated for path discovery in the knowledge graph.

Big Data Challenges Many researchers have investigated big data–related technologies and proposed many techniques and ­methods, including the HACE theorem,6 cloud computing, 43,44 Hadoop, 45 and MapReduce.46 www.computer.org/intelligent

Fragmented Knowledge versus Domain Expertise

In addition to IBM’s 4 Vs (volume, velocity, variety, and veracity),47 another V characteristic that’s often ignored is the value of big data analysis from heterogeneous, ­autonomous sources—it’s an extreme challenge to obtain valuable fragmented knowledge from local information. Generally, fragmented knowledge is fragmented because different data sources prefer their own technologies for data acquisition, recording, representation, and processing, and they generate or collect data in a distributed rather than centralized fashion. Compared with the domain expertise found in traditional knowledge engineering, fragmented knowledge exhibits lower accuracy but plays a vital role in the big data era. A lot of useful information can be found in fragmented knowledge compared to in-depth knowledge from domain experts. Because human experts are restricted by domain knowledge, expert-based learning is prone to be inefficient and human-biased. Moreover, domain expertise might be unavailable in some specific tasks. Take Ford Motor Company as an example. Its development team has argued over whether its SUV trunk door should be electrical. Doors that open electronically are more convenient for drivers, but they have a limited opening angle. Previous surveys didn’t uncover this problem, yet social media indicated that customers were aware of it. In this case, domain expertise and traditional surveys were insufficient. In some applications, domain expertise has a comparative advantage. It might be easier to obtain expert assistance when there is relatively less accumulated data available for a problem, and the pure data might be misjudged. Expertise could offset the effect of having less available data. IEEE INTELLIGENT SYSTEMS

Knowledge Fusion

Earlier approaches for knowledge fusion depended on formal logicbased techniques. For example, fusion rule technology48 was developed to merge structured reports, such as XML documents, in which text entries are restricted to simple phrases. Knowledge fusion with big data offers opportunities to go beyond traditional structured data: blogs, short instant messages, emails, sensor data that isn’t all-inclusive, and other nonstructured data that can be merged together to derive integrated knowledge. There are several challenges with nonlinear knowledge fusion in big data. The complexity has many aspects, including heterogeneous data types, such as • tabular data (relational databases), text, hypertext, images, audio, and video; • intrinsic semantic associations in data—for example, news stores on the Web, comments on Twitter, pictures on Flickr, and videoclips on YouTube could share common topics; and • relationship networks among data, such as individual pages on the Internet linked to each other via hyperlinks that form a complex network. 6,49 A new knowledge fusion approach must be carefully designed so that data can be linked through complex relationships to form useful knowledge; the growth of data volumes and item relationships should help form legitimate patterns to predict future trends. Only then can the value of big data be utilized to its greatest potential. Demand-Driven Big Data Mining Algorithms

Because a knowledge graph evolves with time and the observed objects, september/october 2015

the information extracted from it isn’t constant. In addition, user requests can also evolve, all of which contributes to the complicated functions of data mining tasks. To incorporate a knowledge graph with user inquires, data mining algorithms must model and analyze dynamic knowledge together with dynamic requirements simultaneously. A context-aware model is a promising approach that can help satisfy these requirements.50 With a knowledge graph, a context-aware model can aggregate individual, group, and social behaviors, emotions, and preferences together, along with technologies such as social psychology, frequent pattern mining, and machine learning. By analyzing the interactions among behaviors, emotions, and preferences in context-aware models, a data mining task can discover knowledge and provide knowledge services more effectively and accurately. Demand-driven big data mining emphasizes the combination of all this information to better understand users and then infers from the knowledge graph to provide knowledge services, such as personalized knowledge compilation for medical training based on a user’s interest and background.

T

o explore the essential generation mechanisms underlying fragmented data, BigKE combines fragmented knowledge from multiple sources and domain expertise to provide reliable and personalized solutions to users. This could prove to be the most useful way of answering user queries, given the Web’s fragmented nature.

Acknowledgments

This work is supported by the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education of China (under grant IRT13059), the National Natural Science Foundation of China (under grants www.computer.org/intelligent

61229301 and 61203292), and the National 973 Program of China (under grant 2013CB329604).

References 1. S.M. Kent, “Sloan Digital Sky Survey,” Science with Astronomical Near-Infrared Sky Surveys, N. Epchtein and A. Omont, eds., Springer, 1994, pp. 27–30. 2. N. Kaiser et al., “Pan-Starrs: A Large Synoptic Survey Telescope Array,” Astronomical Telescopes and Instrumentation, Int’l Soc. Optics and Photonics, 2002, pp. 154–164. 3. C. Dobre and F. Xhafa, “Intelligent Services for Big Data Science,” Future Generation Computer Systems, vol. 37, 2014, pp. 267–281. 4. B. Wixom et al., “The Current State of Business Intelligence in Academia: The Arrival of Big Data,” Comm. Assoc. Information Systems, vol. 34, no. 1, 2014, pp. 1–14. 5. C. Jin et al., “Efficient Clustering of Uncertain Data Streams,” Knowledge and Information Systems, vol. 40, no. 3, 2014, pp. 509–539. 6. X. Wu et al., “Data Mining with Big Data,” IEEE Trans. Knowledge and Data Eng., vol. 26, no. 1, 2014, pp. 97–107. 7. G. Stoilos and G. Stamou, “Reasoning with Fuzzy Extensions of OWL and OWL 2,” Knowledge and Information Systems, vol. 40, no. 1, 2014, pp. 205–242. 8. A.L. Berger et al., “The Candide System for Machine Translation,” Proc. Workshop Human Language Technology, 1994, pp. 157–162. 9. J.S. White, T.A. O’Connell, and L.M. Carlson, “Evaluation of Machine Translation,” Proc. Workshop Human Language Technology, 1993, pp. 206–210. 10. M. Aiken et al., “Automatic Translation in Multilingual Electronic Meetings,” Automatic Translation in Multilingual Electronic Meetings, vol. 109, no. 7, 2009, pp. 916–925. 53

Big Data

The Authors Xindong Wu is a Chang Jiang Scholar in the School of Computer Science and Information Engineering at Hefei University of Technology, China, and a professor of computer science at the University of Vermont. His research interests include data mining, knowledge-based systems, and Web information exploration. Wu has a PhD in artificial intelligence from the University of Edinburgh. He’s a Fellow of IEEE and the AAAS. Contact him at [email protected]. Huanhuan Chen is a professor with the USTC-Birmingham Joint Institute in Intelligent Computation and Its Applications at the University of Science and Technology of China. His research interests include statistical machine learning, data mining, fault diagnosis, and evolutionary computation. Chen has a PhD from the School of Computer Science at the University of Birmingham. Contact him at [email protected]. Gong-Qing Wu is an associate professor of computer science at

Hefei University of Technology. His research interests include data mining and web intelligence. Wu has a PhD in computer science from Hefei University of Technology. Contact him at wugq@hfut. edu.cn

Jun Liu is a professor in the Department of Computer Science at

Xi’an Jiaotong University. His research interests include text mining, data mining, and e-learning. Liu has a PhD in computer science from Xi’an Jiaotong University. Contact him at liukeen@mail. xjtu.edu.cn.

processing, and computer vision. Zhao has a PhD in pattern recognition and intelligent systems from the University of Science and Technology of China. Contact him at [email protected]. Bifan Wei is a  research assistant professor  at the distance education college of Xi’an Jiaotong University. His research interests include Web data mining, faceted search,  and taxonomy learning. Wei has a PhD in computer science from Xi’an Jiaotong University. Contact him at [email protected]. Ming Gao is an associate professor at the Institute for Data Science and Engineering at East China Normal University. His research interests include link prediction and social mining, data stream management and mining, and uncertain data management. Gao has a PhD in computer science from Fudan University. Contact him at [email protected]. Yang Li is a PhD student in the School of Computer Science at the University of Science and Technology of China. His research interests include data mining, machine learning, and dynamic systems. Contact him at [email protected]. Qiping Zhang is a research assistant professor in the School of Computer Science and Information Engineering at Hefei University of Technology. His research interests include data mining and intelligent decision making. Zhang has a PhD in management science and engineering from Hefei University of Technology. Contact him at [email protected].

Qinghua Zheng is a professor in computer science and a vice president at Xi’an Jiaotong University. His research interests include multimedia distance education and computer network security. Zheng has a PhD in system engineering from Xi’an Jiaotong University. Contact him at [email protected].

Shichao Zhang is a distinguished professor in the Department

Xiaofeng He is a professor at the Institute for Data Science and

Ruqian Lu is a professor of computer science in the Academy of Mathematics and Systems Science at the Chinese Academy of Sciences. His research interests include artificial intelligence, knowledge engineering, knowledge-based software engineering, formal semantics of programming languages, and quantum information processing. Lu is a Fellow of the Chinese Academy of Sciences. Contact him at [email protected].

Engineering at East China Normal University. His research interests include machine learning, data mining, and information retrieval. He has a PhD in computer science from Pennsylvania State University. Contact him at [email protected].

Aoying Zhou is a professor at and head of the Institute for Data

Science and Engineering at East China Normal University. His research interests include Web data management, data management for data-intensive computing, memory cluster computing, and benchmarking for big data and performance. Contact him at [email protected].

Zhong-Qiu Zhao is an associate professor at Hefei University of Technology. His research interests include pattern recognition, image

11. I. Garcia and V. Stevenson, “Reviews: Google Translator Toolkit,” Multilingual Computing & Technology, vol. 20, no. 6, 2009, p. 16. 12. A.V. Paliwal et al., “Semantics-Based Automated Service Discovery,” IEEE Trans. Services Computing, vol. 5, no. 2, 2012, pp. 260–275. 13. H. Zhu et al., “Mobile App Classification with Enriched Contextual Informa54

of Computer Science at Zhejiang Gongshang University, China. His research interests include data quality and pattern discovery. Zhang has a PhD in computer science from Deakin University. Contact him at [email protected].

Nanning Zheng is a professor and the director of the Institute of

Artificial Intelligence and Robotics at Xi’an Jiaotong University. His research interests include computer vision, pattern recognition, and intelligent systems. Zheng has a PhD in electrical engineering from Keio University. Zheng is a Fellow of IEEE and a member of the Chinese Academy of Engineering.  Contact him at [email protected].

tion,” IEEE Trans. Mobile Computing, vol. 13, no. 7, 2014, pp. 1550–1563. 14. R. Lu et al., “KACTL: Knowware Based Automated Construction of a Treelike Library from Web Documents,” Proc. Int’l Conf. Web Information Systems and Mining (WISM 12), 2012, pp. 645–656. 15. T. Zhou, W. Bian, and D. Tao, “Divideand-Conquer Anchoring for Near-Sepawww.computer.org/intelligent

rable Nonnegative Matrix Factorization and Completion in High Dimensions,” Proc. 13th Int’l Conf. Data Mining (ICDM 13), 2013, pp. 917–926. 16. C. Xu, D. Tao, and C. Xu, “Multi-view Intact Space Learning,” to be published in IEEE Trans. Pattern Analysis and Machine Intelligence, 2015. 17. Z. Wu and M. Zou, “An Incremental Community Detection Method for IEEE INTELLIGENT SYSTEMS

Social Tagging Systems Using LocalitySensitive Hashing,” Neural Networks, vol. 58, 2014, pp. 14–28. 18. R. Studer, V.R. Benjamins, and D. Fensel, “Knowledge Engineering: Principles and Methods,” Data & Knowledge Eng., vol. 25, no. 1, 1998, pp. 161–197. 19. R.K. Lindsay et al., “Dendral: A Case Study of the First Expert System for Scientific Hypothesis Formation,” Artificial Intelligence, vol. 61, no. 2, 1993, pp. 209–261. 20. E. Shortliffe, Computer-Based Medical Consultations: MYCIN, Elsevier, 2012. 21. W.J. Van Melle, System Aids in Constructing Consultation Programs, vol. 11, UMI Research Press, 1981. 22. I. Ben-Gal et al., “Peer-to-Peer Information Retrieval Using Shared-Content Clustering,” Knowledge and Information Systems, vol. 39, no. 2, 2014, pp. 383–408. 23. A. Camerra et al., “Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with TeX SAX2+,” Knowledge and Information Systems, vol. 39, no. 1, 2014, pp. 123–151. 24. D.R. Swanson, Migraine and Magnesium: Eleven Neglected Connections, Perspectives in Biology and Medicine, vol. 31, no. 4, 1988, pp. 526–557. 25. I. Žliobaité, “Learning under Concept Drift: An Overview,” 2010; arXiv preprint arXiv:1010.4784. 26. M.L. Schwall et al., “A Probabilistic Vehicle Diagnostic System Using Multiple Models,” Innovative Applications of Artificial Intelligence, 2003, pp. 123–128. 27. H.A. Carneiro and E. Mylonakis, “Google Trends: A Web-Based Tool for Real-Time Surveillance of Disease Outbreaks,” Clinical Infectious Diseases, vol. 49, no. 10, 2009, pp. 1557–1564. 28. P. Crossman, “Zestfinance Aims to Fix Underwriting for the Underbanked,” Am. Banker, 2012; www.americanbanker.com/issues/177_223/zestfinanceaims-to-fix-underwriting-for-the-underbanked-1054464-1.html. september/october 2015

29. M. Schuhmacher and S.P. Ponzetto, “Knowledge-Based Graph Document Modeling,” Proc. 7th ACM Int’l Conf. Web Search and Data Mining, 2014, pp. 543–552. 30. R.T. Plant and R. Gamble, “Using Meta-knowledge within a Multilevel Framework for KBS Development,” Int’l J. Human-Computer Studies, vol. 46, no. 4, 1997, pp. 523–547. 31. F.N. Stokman and P.H. de Vries, Structuring Knowledge in a Graph, Springer, 1988. 32. P.M. Ryu, H.-K. Kim, and S.-Y. Park, “Graph-Based Knowledge Consolidation in Ontology Population,” IEICE Trans. Information and Systems, vol. 96, no. 9, 2013, pp. 2139–2142. 33. A. Barr, “Meta-knowledge and Cognition,” Proc. Int’l J. Conf. Artificial Intelligence, 1979, pp. 31–33. 34. R.M. Ratwani, J.G. Trafton, and D.A. Boehm-Davis, “Thinking Graphically: Connecting Vision and Cognition during Graph Comprehension,” J. Experimental Psychology: Applied, vol. 14, no. 1, 2008, pp. 36–49. 35. B. Sarrafzadeh, O. Vechtomova, and V. Jokic, “Exploring Knowledge Graphs for Exploratory Search,” Proc. 5th Information Interaction in Context Symp., 2014, pp. 135–144. 36. L. Hao et al., “The Application and Implementation Research of Smart City in China,” Proc. 2012 Int’l Conf. System Science and Eng. (ICSSE 12), 2012, pp. 288–292. 37. G. Acampora et al., “Optimizing Learning Path Selection through Memetic Algorithms,” Proc. 2008 IEEE Int’l Joint Conf. Neural Networks, 2008, pp. 3869–3875. 38. G.D. Abowd et al., “Towards a Better Understanding of Context and ContextAwareness,” Proc. 1st Int’l Symp. Handheld and Ubiquitous Computing, 1999, pp. 304–307. 39. D.H. Park et al., “A Literature Review and Classification of Recommender Systems Research,” Expert Systems with Applications, vol. 39, no. 11, 2012, pp. 10059–10072. www.computer.org/intelligent

40. C.L. Rete, “A Fast Algorithm for the Many Pattern/Many Object Pattern Matching Problem,” Artificial Intelligence, vol. 19, 1982, pp. 17–37. 41. X. Wu, “LFA: A Linear ForwardChaining Algorithm for AI Production Systems,” Expert Systems, vol. 10, no. 4, 1993, pp. 237–242. 42. X. Wu, G. Fang, and M. Gams, “LFA+: A Fast Chaining Algorithm for RuleBased Systems,” Informatica: An Int’l J. Computing and Informatics, vol. 22, no. 3, 1998, pp. 329–349. 43. M. Armbrust et al., “A View of Cloud Computing,” Comm. ACM, vol. 53, no. 4, 2010, pp. 50–58. 44. W. Liu et al., “Security-Aware Intermediate Data Placement Strategy in Scientific Cloud Workflows,” Knowledge and Information Systems, vol. 41, no. 2, 2014, pp. 423–447. 45. A. Thusoo et al., “Hive-A Petabyte Scale Data Warehouse Using Hadoop,” Proc. IEEE 26th Int’l Conf. Data Eng. (ICDE 10), 2010, pp. 996–1005. 46. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Comm. ACM, vol. 51, no. 1, 2008, pp. 107–113. 47. D. Che, M. Safran, and Z. Peng, “From Big Data to Big Data Mining: Challenges, Issues, and Opportunities,” Database Systems for Advanced Applications, LNCS 7827, Springer, 2013, pp. 1–15. 48. A. Hunter and R. Summerton, “Fusion Rules for Context-Dependent Aggregation of Structured News Reports,” J. Applied Non-classical Logics, vol. 14, no. 3, 2004, pp. 329–366. 49. A.S. Maiya and T.Y. Berger-Wolf, “Expansion and Decentralized Search in Complex Networks,” Knowledge and Information Systems, vol. 38, no. 2, 2014, pp. 469–490. 50. S. Hammoudi, S. Vale, and S. Loiseau, “Context-Aware Model Driven Development: Applications to Web Services Platform,” Proc. CEUR Workshop, 2007, pp. 478–481. 55