To gain knowledge or understanding or skill through:
The field of study that gives computers the ability to learn without the need of explicitly programming.
The goal is to device programs that learn and improve performance with experience without human intervention.
- Set of examples (input -> output) for learning
- Used to build model
- Used to test:
- how good your model can predict
- estimate model properties
- It is always outside training data set but follows some probability distribution as training data
- also called predictor
- It is a meaningful attribute
- Internal representation of data
- quantity describing an instance
- property of an instance
- A Record in data base
If we increase the number of records, attributes in a data set, then Machine Learning problem also becomes a Big Data problem.
- It uses training data set consisting of input -> correct output to train the model
- Page Ranking Algorithm
- Next word recommendation in Instant Messaging Application/ Whatsapp/ SMS
- No training data set exists
- Most difficult algorithms are unsupervised learning because there is no “fixed” objective.
- used in Explaratory Data Analysis (EDA)
- used in recommendation systems to determine users who are similar to me from existing database
|We have a set of pre-defined classes and we want to know which class a new object belongs to.||Group a set of objects and find whether there is some relationship between objects.|
|It is predictive modelling. We give pre-defined groups and predict group of new data.||It is descriptive modelling. We try to find groups which occur naturally in data .|
There are 6 items categorised in 2 classes:
Each category has a label e.g. Eatables and Non-Eatables. If we have to predict the class of a new item “strawberry”, then it will be assigned a label “Eatable”
There are 6 items categorised in 2 groups:
Each category is unnamed i.e. there is no label attached to the group. If we have to predict the group of a new item “strawberry” then it will be in the first group.
- How often is the prediction correct?
- Accuracy is not reliable metric for real performance of model because it will yield misleading results if training data set is unbalanced (i.e. number of samples in different classes vary greatly).
- Let number of cats be 95 and number of dogs be 5
- Classifier can easily bias into classifying all samples as cats
- Overall acuuracy = 95%
- BUT 100% recognition rate for cats and 0% recognition rate for dogs
- One of the ways to improve accuracy is to provide more balanced data.
This is one of the interesting things explained by Satish Patil in Pune Python Meetup that:
There is no right or wrong model. There is no best or worst model. There is ONLY useful and non-useful model.
Nobody knows how much percentage of accuracy is good. How much accuracy is needed depends on Business Context.
Consider a company which wants to launch a new product and they want the probability of success of the product using Machine Learning. So, it is the company which DECIDES that if they get probability below 60%, then they will not launch the product. So, this is not something that the developer decides. This totally depends on the business context.
Market Basket Analysis
- Also called affinity analysis
- Association Rule:
- discovering interesting relation/connection/association between specific objects
- Sometimes, certain products are typically purchased together like:
- beer and chips
- beer and diapers
- bread and eggs
- shampoo and conditioner
- So, market basket analysis tells a retailer that promotion involving just one of the items from the set would likely drive sales of the other
- This technique is used by retailers to:
- improve product placement
- new product development
- making discount plans
Titanic Data Set
The titanic data set was used in the machine learning talk in Pune Python Meetup. It can be downloaded here.
There are some features in the data set which can be ignored as they are not important like:
- Passenger ID
- Ticket Number
and there are some important features which help in classifying like:
- Measures how well are the classes separated
- Should be 0 when all data belong to one class
- Entropy can be a measure of quality of model
- It is a measure of how distributed are the probabilities.
- The more equal is the share for the probability values in all the classes, the higher is the entropy. The more skewed is the share among the classes, lesser is the entropy.
- The goal in machine learning is to get a very low entropy in order to make the most accurate decisions and classifications
- A way of graphically representing an sequential decision process
- Non-leaf nodes are labelled with attribute/ question
- Leaf nodes are labelled with class
- Data can contain noise:
- instance can contain error
- wrong classification
- wrong attribute value
- If a particular feature is not used by a tuple or if the feature is not influencing, then it is removed.
- Converting data into interval form
- Machine learning algorithms learn from data so its important to feed it the right data
- Data preprocessing basically involves:
- correcting mistakes
- handle missing values
- handle outliers
- normalize values
- nominal values
- The value of an attribute which is not known or does not exist
- value was not measured
- instrument malfunction
- attribute does not apply
- If a column contains “Not Available”, then it is NOT considered as a missing value.
- samples which are far away from other samples
- They can be mistake/ noise or represent a special behaviour
- Outliers are generally removed
Questions that were asked in meetup
- Can data be extended to multiple dimension?
- Can distance be other than Euclidian?
- Yes, Manhattan distance
- Are there online courses that teach ML intro?
- What is “k” in k-means?
- k is no. of clusters
- Can we use ML for trading?
- Any daily life clustering example
- Any software product based on unsupervised learning?
- Google Maps
- Matrimony/ Dating websites
- Red Coupon (real estate)
- Amazon recommendation
- Order in which features is given, is that important?
- Why do we say that one model is better than the other?
- What if accuracy is not the concern?
- Accuracy is one way of looking at prediction
- Do you think that if model changes, something in feature has changed?
- We have tools like WEKA, so why would anyone prefer Python or R?
- depends on the language available or language the company uses
- How do we know that a particular feature is important or not?
- What if some features are more influential than others? How will the decision tree be affected?
- How to handle outliers in a decision tree?
- Will the algorithm figure out the relationship between input and output?
- This is possible through Regression