Apache Mahout - Isabel

Data Mining Applications. ○ Marketing. ○ Surveillance. ○ Fraud Detection. ○ Scientific Discovery. ○ Discover items usually purchased together. = Extracting ...
10MB Sizes 8 Downloads 126 Views
Apache Mahout Making data analysis easy

Isabel Drost Nighttime: Co-Founder, committer Apache Mahout. Organiser of Berlin Hadoop Get Together.

Daytime: Software developer. Guest lecturer at TU Berlin. Co-Organiser Berlin Buzzwords 2010.





“Mastering Data-Intensive Collaboration and Decision Making” EU funded research project – –

Number of partners: 8 Coordinator: Research Academic Computer Technology Institute (CTI), Greece

Hello Devoxx!

Hello Devoxx!

Hello Devoxx!

Hello Devoxx!

Hello Devoxx!

Machine learning background?

Hello Devoxx!

Hello Devoxx!

Agenda ●

Data Mining/ Machine Learning?



Why is scaling hard?



Going beyond simple statistics.

Data Mining Applications ●

Marketing.



Surveillance.



Fraud Detection.



Scientific Discovery.



Discover items usually purchased together. = Extracting patterns from data.

Machine Learning Applications ●

E-Mail spam classification.



News-topic discovery.



Building recommender systems. = Extracting prediction models from data.

Machine learning – what's that?

Image by John Leech, from: The Comic History of Rome by Gilbert Abbott A Beckett. Bradbury, Evans & Co, London, 1850s Archimedes taking a Warm Bath

Archimedes model of nature

June 25, 2008 by chase-me http://www.flickr.com/photos/sasy/2609508999

An SVM's model of nature

The challenge

Mission Provide scalable data mining algorithms.

http://www.flickr.com/photos/honou/2936937247/

HowTo: From data to information.

January 3, 2006 by Matt Callow http://www.flickr.com/photos/blackcustard/81680010

http://www.flickr.com/photos/[email protected]/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

http://www.flickr.com/photos/disowned/1158260369/

The HDFS filesystem is not restricted to MapReduce jobs. It can be used for other applications, many of which are under way at Apache. The list includes the HBase database, the Apache Mahout machine learning system, and matrix operations.

http://www.flickr.com/photos/[email protected]/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/in/photostream/

http://www.flickr.com/photos/noodlepie/2675987121/

http://www.flickr.com/photos/topsy/204929063/

http://www.flickr.com/photos/[email protected]/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

From data to information. ●

Collect data and define your learning problem.



Data preparation.



Training a prediction model.



Checking the performance of your model.



Remove noise.



Remove noise.



Convert text to vectors.

From texts to vectors

If we looked at two words only: Sunny weather

High performance computing

Aaron

Zuse

Binary bag of words ●

Imagine a n-dimensional space.



Each dimension = one possible word in texts.



Entry in vector is one, if word occurs in text.



Problem: ●

bi , j =

{

1 ∀ xi ∈ d j 0 else

}

Number of word occurrences not accounted for.

Term Frequency ●

Imagine a n-dimensional space.



Each dimension