- I will introduce the class
- Discuss the book(s) and homework
- Introduce the class website
- Begin the class with the material in Chapter 1 of the book
January 22, 2020
In this class we will be learning about
"Modern Applied Statistics" using computers and lots of data.
What is Statistical Learning? The answer depends on who you talk to.
The author of our book says things, such as,
"… machine learning provides a set of tools that use computers to transform data into actionable knowledge."
or
"… making sense of complex data."
or
"… computer scientist Tom M. Mitchell states that a machine learns whenever it is able to utilize its experience such that its performance improves on similar experiences in the future."
Do an extensive Google search on the terms
Report on what these terms mean.
Do they mean the same thing?
Or are there differences?
Data has long been expensive. Experimental data needed to basically be collected by hand.
Recently (not really, but ok) data has become very cheap. Observational data is being collected automatically and stored in large amounts.
There are still opportunities in the traditional areas of data analysis, but there are new frontiers for people who have the skills to access, interact with, model and analyze, the large amounts of data that are already stored electronically.
Got to know something about databases!
What is the difference between machine learning and data mining?
The main difference is that machine learning is used to group similar observations based on important variables or to develop models for prediction.
While data mining is used to search for "hidden nuggets" in data.
However, much of the same "tools" are discussed in both areas.
The basic learning process:
\(y = f(x_1,x_2,...x_p)\)
Assessing the success of learning:
Models are not perfect, but some are useful. (Is this the exact quote? Who said this?)
This is what is often discussed as checking the assumptions and/or validating the model.
Over fitting is a problem that needs to be avoided.
Steps to apply machine learning to your data:
Selecting a machine learning algorithm:
Understand the data.
What are the observations or unit of observation or examples?
What are the variables or features?
Are the variables numeric or categorical?
Knowing your data can lead to a possible model and learning algorithm.
Types of machine learning algorithms:
In Statistics we consider Inference vs. Prediction
See page 21 of our book for a table that presents the different types of algorithms and what the task is that they are used for.
The table includes the chapters of the book that covers each topic.
R is a very good platform for doing machine learning. There are many packages that have been written for implementing the algorithms. This is the platform that we are going to focus on for the course.
For Big Data see SparkR
Python is another good platform for doing machine learning. The scikit-learn package has grown to be very useful for machine learning.
Both R and Python have there strengths and weaknesses.
There was a bit of a competition between the two.
It would be good to become familiar with both and good at one.
The author summarizes Chapter 1 with, "Machine Learning originated at the intersection of statistics, database science, and computer science. It is a powerful tool, capable of finding actionable insights in large quantities of data."
Get better at using IT tools!!!