Machine Learning

Learning

To gain knowledge or understanding or skill through:

  • study
  • instruction
  • experience

 

Machine Learning

The field of study that gives computers the ability to learn without the need of explicitly programming.

The goal is to device programs that learn and improve performance with experience without human intervention.

 

Training Data

  1. Set of examples (input -> output) for learning
  2. Used to build model

 

Test data

  1. Used to test:
    • how good your model can predict
    • estimate model properties
  2. It is always outside training data set but follows some probability distribution as training data

 

Feature

  1. also called predictor
  2. It is a meaningful attribute
  3. Internal representation of data
  4. quantity describing an instance
  5. property of an instance

 

Tuple

  1. A Record in data base
Screenshot from 2016-05-02 23-34-01
Features are columns and Tuples are rows

 

If we increase the number of records, attributes in a data set, then Machine Learning problem also becomes a Big Data problem.

 

Supervised learning

  1. It uses training data set consisting of input -> correct output to train the model
  2. Example:
    • Page Ranking Algorithm
    • Next word recommendation in Instant Messaging Application/ Whatsapp/ SMS

 

Unsupervised learning

  1. No training data set exists
  2. Most difficult algorithms are unsupervised learning because there is no “fixed” objective.
  3. used in Explaratory Data Analysis (EDA)
  4. Example:
    • used in recommendation systems to determine users who are similar to me from existing database

 

tiff infomation
Machine Learning types

 

 Classification  Clustering
We have a set of pre-defined classes and we want to know which class a new object belongs to. Group a set of objects and find whether there is some relationship between objects.
It is predictive modelling. We give          pre-defined groups and predict group of new data. It is descriptive modelling. We try to find groups which occur naturally in data .

 

Classification

There are 6 items categorised in 2 classes:

tiff infomation
Example of Classification

 

Each category has a label e.g. Eatables and Non-Eatables. If we have to predict the class of a new item “strawberry”, then it will be assigned a label “Eatable”

 

Clustering

There are 6 items categorised in 2 groups:

tiff infomation
Example of Clustering

 

Each category is unnamed i.e. there is no label attached to the group. If we have to predict the group of a new item “strawberry” then it will be in the first group.

 

Accuracy

  1. How often is the prediction correct?
  2. Accuracy is not reliable metric for real performance of model because it will yield misleading results if training data set is unbalanced (i.e.  number of samples in different classes vary greatly).
  3. Example:
    1. Let number of cats be 95 and number of dogs be 5
    2. Classifier can easily bias into classifying all samples as cats
    3. Overall acuuracy = 95%
    4. BUT 100% recognition rate for cats and 0% recognition rate for dogs
  4. One of the ways to improve accuracy is to provide more balanced data.

 

This is one of the interesting things explained by Satish Patil in Pune Python Meetup that:

There is no right or wrong model. There is no best or worst model. There is ONLY useful and non-useful model. 

Nobody knows how much percentage of accuracy is good. How much accuracy is needed depends on Business Context.

Consider a company which wants to launch a new product and they want the probability of success of the product using Machine Learning. So, it is the company which DECIDES that if they get probability below 60%, then they will not launch the product. So, this is not something that the developer decides. This totally depends on the business context.

 

Market Basket Analysis

  1. Also called affinity analysis
  2. Association Rule:
    • discovering interesting relation/connection/association between specific objects
  3. Sometimes, certain products are typically purchased together like:
    • beer and chips
    • beer and diapers
    • bread and eggs
    • shampoo and conditioner
  4. So, market basket analysis tells a retailer that promotion involving just one of the items from the set would likely drive sales of the other
  5. This technique is used by retailers to:
    • improve product placement
    • marketing
    • new product development
    • making discount plans

 

Titanic Data Set

The titanic data set was used in the machine learning talk in Pune Python Meetup. It can be downloaded here.

There are some features in the data set which can be ignored as they are not important like:

  • Passenger ID
  • Name
  • Ticket Number
  • Cabin

and there are some important features which help in classifying like:

  • Survived
  • Gender

 

Impurity Measure

  1. Measures how well are the classes separated
  2. Should be 0 when all data belong to one class

 

Entropy

  1. Entropy can be a measure of quality of model
  2. It is a measure of how distributed are the probabilities.
  3. The more equal is the share for the probability values in all the classes, the higher is the entropy.  The more skewed is the share among the classes, lesser is the entropy.
  4. The goal in machine learning is to get a very low entropy in order to make the most accurate decisions and classifications

 

Decision Tree

  1. A way of graphically representing an sequential decision process
  2. Non-leaf nodes are labelled with attribute/ question
  3. Leaf nodes are labelled with class
tiff infomation
decision tree based on titanic data set

 

Pruning

  1. Data can contain noise:
    • instance can contain error
    • wrong classification
    • wrong attribute value
  2. If a particular feature is not used by a tuple or if the feature is not influencing, then it is removed.

 

Data Preprocessing

  1. Converting data into interval form
  2. Machine learning algorithms learn from data so its important to feed it the right data
  3. Data preprocessing basically involves:
    • correcting mistakes
    • handle missing values
    • handle outliers
    • normalize values
    • nominal values

 

Missing Value

  1. The value of an attribute which is not known or does not exist
  2. Example:
    • value was not measured
    • instrument malfunction
    • attribute does not apply
  3. If a column contains “Not Available”, then it is NOT considered as a missing value.

 

Outliers

  1. samples which are far away from other samples
  2. They can be mistake/ noise or represent a special behaviour
  3. Outliers are generally removed

 

Questions that were asked in meetup

  1. Can data be extended to multiple dimension?
  2. Can distance be other than Euclidian?
    • Yes, Manhattan distance
  3. Are there online courses that teach ML intro?
    • Yes
  4. What is “k” in k-means?
    • k is no. of clusters
  5. Can we use ML for trading?
    • Yes
  6. Any daily life clustering example
  7. Any software product based on unsupervised learning?
    • Google Maps
    • Matrimony/ Dating websites
    • Red Coupon (real estate)
    • Amazon recommendation
    • Netflix
  8. Order in which features is given, is that important?
    • No
  9. Why do we say that one model is better than the other?
  10. What if accuracy is not the concern?
    • Accuracy is one way of looking at prediction
  11. Do you think that if model changes, something in feature has changed?
  12. We have tools like WEKA, so why would anyone prefer Python or R?
    • depends on the language available or language the company uses
  13. How do we know that a particular feature is important or not?
  14. What if some features are more influential than others? How will the decision tree be affected?
  15. How to handle outliers in a decision tree?
  16. Will the algorithm figure out the relationship between input and output?
    • This is possible through Regression

 

Machine Learning

Event Report: April Pune Python Meetup

April Pune Python Meetup (@PythonPune) was conducted on April 30, 2016 at Redhat, Pune. Around 70 people registered for the meetup but the turnout was around 72-73. A few people registered on the spot.

Python Pune Meetups are organised by Chandan Kumar (@ciypro) who is a fellow RedHat employee, a python programmer and FOSS enthusiast who has contributed to many upstream projects.

The meetup started around 10:45 with the introduction where everybody introduced themselves. Almost everybody knew python. There were 1-2 people who did not know python. There were a few people who were experience in machine learning and some who were completely new to Machine Learning. I had a course on machine learning in my college where i learnt the theory and did some practical assignments in R language. The crowd was diverse consisting of students, data scientists, professors and people of various age groups 18 – 70.

This speakers of this meetup were Satish Patil (@DataGeekSatish) and Sudarshan Gadhave (@sudarshan1989) who took a session on Introduction to Machine Learning. 

4
Satish Patil in Pune Python Meetup

 

5
Sudarshan Gadhave in Pune Python Meetup

Satish Patil is the Founder and Chief Data Scientist of Lemoxo Technologies, Pune where he advises companies large and small on their data strategy. He has 10+ years of research experience in the field of drug discovery and development. He told a few real life machine learning examples from his field in the meetup!

Satish is passionate about applying technology, artificial intelligence, design thinking and cognitive science to better understand, predict and improve business functions. He has a great interest in Machine Learning, Artificial Intelligence, Data Visualisation, Big Data.

Satish covered the following topics:

  • What is Machine Learning
  • The Black Box of Machine Learning
  • features
  • training and test data set
  • classification
  • clustering
  • pure and impure states
  • entropy
  • decision tree
  • supervised and unsupervised learning
  • market basket analysis
  • data pre-processing
  • Titanic data set
  • K means algorithm

Although Machine Learning is a vast concept and it definetly requires more sessions to grasp, but Satish made a remarkable effort in making us understand all the above topics in layman terms.

There are a lot of books, courses, material available online for Machine Learning, so why this meetup? Well, the best part about this meetup was the way Satish explained the BUSINESS CONTEXT of MACHINE LEARNINGThis was something new for me to learn. Getting to know the real life examples from the entrepreneur-cum-data scientist was really interesting.

1.jpg
The Machine Learning Workshop in Pune Python Meetup

The details of his talk will be in my next blog.

Chandan Kumar talked about Fedora Labs. The Fedora science spin comes pre-installed with essential tools for scientific and numerical work like IDE, tools and libraries for programming in Python, C, C++, Java and R. It basically eliminates the need to download a bunch a scientific packages which you need.

If you need any help regarding the spin, you can get help from #fedora-science channel on Freenode on IRC.

As Chandan Kumar ALWAYS encourages us to contribute to open source, he introduced us to WHAT CAN I DO FOR FEDORA?. Pune Python meetups and Devsprint are a great platform to seek for help if you want to contribute to opensource.

3
Chandan Kumar in Pune Python Meetup

 

Thanks to Satish Patil and Sudarshan Gadhave for conducting an awesome workshop! We hope to see more such workshops by you in the meetups.

Thanks to RedHat for the food, beverages and venue.

Thanks to Chandan Kumar, Pravin Kumar (@kumar_pravin), Amol Kahat, Sudhir Verma for organising such interesting meetups where we always learn something new 🙂

 

 

Event Report: April Pune Python Meetup