Python Machine Learning: Unlock Deeper Insights into Machine Learning With This Vital Guide to Cutting-edge Predictive Analytics (英語) ペーパーバック – 2015/9/23
Kindle 端末は必要ありません。無料 Kindle アプリのいずれかをダウンロードすると、スマートフォン、タブレットPCで Kindle 本をお読みいただけます。
Unlock deeper insights into Machine Leaning with this vital guide to cutting-edge predictive analytics
About This Book
- Leverage Python's most powerful open-source libraries for deep learning, data wrangling, and data visualization
- Learn effective strategies and best practices to improve and optimize machine learning systems and algorithms
- Ask – and answer – tough questions of your data with robust statistical models, built for a range of datasets
Who This Book Is For
If you want to find out how to use Python to start answering critical questions of your data, pick up Python Machine Learning – whether you want to get started from scratch or want to extend your data science knowledge, this is an essential and unmissable resource.
What You Will Learn
- Explore how to use different machine learning models to ask different questions of your data
- Learn how to build neural networks using Pylearn 2 and Theano
- Find out how to write clean and elegant Python code that will optimize the strength of your algorithms
- Discover how to embed your machine learning model in a web application for increased accessibility
- Predict continuous target outcomes using regression analysis
- Uncover hidden patterns and structures in data with clustering
- Organize data using effective pre-processing techniques
- Get to grips with sentiment analysis to delve deeper into textual and social media data
Machine learning and predictive analytics are transforming the way businesses and other organizations operate. Being able to understand trends and patterns in complex data is critical to success, becoming one of the key strategies for unlocking growth in a challenging contemporary marketplace. Python can help you deliver key insights into your data – its unique capabilities as a language let you build sophisticated algorithms and statistical models that can reveal new perspectives and answer key questions that are vital for success.
Python Machine Learning gives you access to the world of predictive analytics and demonstrates why Python is one of the world's leading data science languages. If you want to ask better questions of data, or need to improve and extend the capabilities of your machine learning systems, this practical data science book is invaluable. Covering a wide range of powerful Python libraries, including scikit-learn, Theano, and Pylearn2, and featuring guidance and tips on everything from sentiment analysis to neural networks, you'll soon be able to answer some of the most important questions facing you and your organization.
Style and approach
Python Machine Learning connects the fundamental theoretical principles behind machine learning to their practical application in a way that focuses you on asking and answering the right questions. It walks you through the key elements of Python and its powerful machine learning libraries, while demonstrating how to get to grips with a range of statistical models.
Sebastian Raschka is a PhD student at Michigan State University, who develops new computational methods in the field of computational biology. He has been ranked as the number one most influential data scientist on GitHub by Analytics Vidhya. He has a yearlong experience in Python programming and he has conducted several seminars on the practical applications of data science and machine learning. Talking and writing about data science, machine learning, and Python really motivated Sebastian to write this book in order to help people develop data-driven solutions without necessarily needing to have a machine learning background. He has also actively contributed to open source projects and methods that he implemented, which are now successfully used in machine learning competitions, such as Kaggle. In his free time, he works on models for sports predictions, and if he is not in front of the computer, he enjoys playing sports.
Data Scientist; B.S. in Economics and M.S. in Business Analytics; experienced (though by no means expert) user of Scikit-learn
I've purchased and read (virtually) every Machine Learning book that aims to teach the reader the basics of ML using the Scikit-learn library as the main focus. I've found them to be...less than satisfactory. The examples in other books often use ML techniques in contexts for which they are not intended to be used and/or contexts they are not used in out in the real world (among other issues I have found within them).
In stark contrast, Python Machine Learning by Sebastian Raschka is stunningly-impressive, not only for the breadth and depth of coverage, but also in the manner the information is presented to the reader.
To date, I have not encountered a book on ML that incorporates multiple levels of learning in a manner such as this. It is the textual equivalent of a Neural Network with hundreds of hidden layers running on the latest NVIDIA GPU (if that comparison is lost on you, don’t worry; it’ll all make sense by the time you finish the book).
One of the underlying (though understated) themes in the book is the importance of using visual aids where appropriate to gauge the performance of the algorithms you’re using as well as to understand exactly what is going on behind the scenes, so-to-speak. If you’re a novice user of the Matplotlib graphics library for Python, this book will greatly improve your visualization skills by the time you’re done which I found to be an added bonus.
Another underlying theme is basic optimization using the NumPy library. This is reinforced throughout the book in the examples that you code by hand. Ditto for the Pandas library. To those of you brand-new to Python, you may not fully appreciate this aspect of the book until you gain some more experience and you’ve gone through the book a few times. For those of you who are more experienced users, the examples provide an amazing amount of insight into simple ways to make your code more efficient. Indeed, “Best Practices” abound in Python Machine Learning.
As a final general thought, Sebastian is an active contributor to Scikit-learn; something I do not believe to be the case with the authors of the other books that I’ve read. In order to effectively demonstrate and communicate the power of the Scikit-learn library, you really need to be familiar with it from a fundamental level. Sebastian has this knowledge in spades and that becomes readily apparent as you progress through the book. He makes no assumption about the knowledge base of the reader—he doesn’t have to—because the book incorporates learning styles appropriate for differing skill levels (see below)
FOR BEGINNING USERS:
You may have some experience with Scikit-learn and Python, though not necessarily enough where you have developed some of the “best practices” I mentioned above; you’re still getting comfortable using the library and the Python environment. This book is definitely for you!
The best way to learn this subject is by coding examples. You could not ask for a finer book on the subject; for those just starting their ML journey, you’re in good hands with Sebastian. You’ll get an excellent, hands-on education using some of the most important ML algorithms in use today in the most popular ML library used in Python. You’ll begin to develop good habits and you’ll see from a basic level, how to actually create algorithms on your own, outside of Scikit-learn! Then after you’ve had the experience of coding the algorithm by hand, you’ll move to Scikit-learn and get even more hands-on experience. You will learn about the tried-and-true algorithms that have been around for decades as well as concepts that are still in their infancy and are considered the current state-of-the-art. And as I said, you’ll learn about them by actually using them to build ML models. You’ll see how you take a concept for a project and turn it into reality using some really fantastic algorithms.
FOR INTERMEDIATE USERS:
You are comfortable using Python and Scikit-learn and have participated (or you are considering participating) in one or more Kaggle competitions. You may have some good habits/best practices formed but you’re looking to take the next step; you may know how to take a project from the data gathering and cleaning stage to a final model, but you may not have actually done it or you aren’t sure how to properly evaluate the model you have created in the end stages; you want to gain a thorough understanding of which situations are appropriate for each of the algorithms and more importantly, which situations are NOT appropriate for each of the algorithms; you want to gain a firm knowledge of how the algorithms work and you’re curious about what the state-of-the-art concepts are. Good news!
This book is DEFINITELY for you.
There comes a time in every Data Scientist’s life when you have read everything you can find on how to structure and complete projects and you feel confident that you’re ready. Then you start and you realize during the course of a project that you suddenly have a dozen more questions:
What should I do with all of these missing values?
Should I use PCA and/or other dimensionality reduction techniques?
How many folds should I use in my Cross Validation?
Should I use Nested Cross Validation or will simple K-Fold Cross Validation suffice?
Do I need to standardize my data in order to use run a Logistic Regression algorithm?
How about with a Random Forest?
What performance metric is most appropriate for my model?
What is L1 and L2 regularization (again) and when should I use it?
If you have ever asked yourself any of these questions, rest assured this book will become your go-to reference for these questions as well as questions that you haven’t even thought of yet. Sebastian will fill in the gaps in your knowledge and you’ll gain the confidence to tackle the projects you have been looking forward working on all this time.
FOR ADVANCED USERS:
Much of the information in this book may be familiar to you, however the mathematical concepts behind the algorithms may not be. You may be interested in reading the seminal research on each of the concepts presented in the book. Sebastian has you covered as well. He provides symbolic mathematical proofs for those so-inclined, as well as a multitude of citations for where you can find the research that supports and/or explores the concepts more thoroughly. The book is well-researched and cited and the concepts are given very thorough treatment.
I realize the experience levels described above are subjective. They are present merely to serve as reference points for the readers and to underscore my belief that Python Machine Learning has something for virtually every skill level. I cannot recommend this book more highly!
BONUS - Topics/Algorithms Covered Throughout the Book (there are a TON!):
Stochastic Gradient Descent (SGD)
Support Vector Machines (SVM)
Difference between L1 and L2 regularization (with excellent graphics showing the difference)
Out-of-core/online learning (truly Big Data)
The Kernel Trick
Random Forest Classifier
Parametric vs. Non-parametric (and which are which)
K-Nearest Neighbors (KNN)
Bias/Variance Trade off (great graphics showing the difference)
Standardization (Data Preprocessing)
Scaling (Data Preprocessing)
Correct Mapping of data types for use in Scikit-learn algorithms
Sequential Forward Selection (Feature Selection)
Sequential Backward Selection (Feature Selection)
Feature Importances using Random Forests (Feature Selection)
Common pitfalls (“gotchas”) that can arise with use of Random Forests
Principal Component Analysis (PCA)
Latent Discriminant Analysis (LDA – this topic is almost never covered in similar books)
Kernel PCA + caveats for its use
Use of Pipelines in Scikit-learn for streamlining the modeling process (this never gets coverage and is a big efficiency boost)
Cross Validation (K Fold)
Nested Cross Validation
Common Metrics for Model Evaluation and how to graph each to gauge performance
Ensemble Methods (Majority Voting Classifier)
Plotting Decision Boundaries (important for gaining insight into ensemble performance)
Bootstrap Aggregating (Bagging)
Sentiment Analysis using bag-of-words model
Sentiment Analysis using SGD Classifier and Out-of-Core learning to analyze large document datasets via streaming/mini-batching for Data that is too large to fit in memory at once
Embedding Machine Learning algorithms into web applications using the web framework called Flask—this is a hot skill to have in the job market
Regression Analysis for Continuous Target Variables
Aesthetic adjustments/extensions to Matplotlib graphs using the Seaborn library
Dealing with non-linear relationships in the context of Regression (Data Transformations)
Clustering (K-Means, Agglomerative, Divisive)
Hard vs. Soft clustering
The “Elbow” method for clustering
Common “gotchas” to be aware of when using clustering algorithms
Artificial Neural Networks (ANN)
Multi-Layer Perceptron (MLP) Neural Net
Using the Theano library to run Neural Networks on Graphical Processing Units (GPU)—this is an extremely hot topic and demonstrates the timelessness of many of these algorithms.
Let’s face it, we know that machine learning isn’t an easy subject. You need theory…but you also need practice in the form of some serious coding before you really start understanding it. And this is one area where Sebastian’s book shines: it contains a plethora of really good code examples that are illuminating and well explained, and which cover a very wide range of different machine learning algorithms. And, speaking of code, as another reviewer has pointed out, another huge plus is that, in many places, Sebastian shows you how to gauge the performance of your code and make it more efficient.
For me, the best measure of any book such as this is how many “ah ha!” moments I had while reading it. And I had more than a few while reading Sebastian’s book. One such “ah ha!” moment came while reading chapter 12 (and this also illustrates that nice blend of theory and practice I already mentioned above). In this particular chapter, he discusses training artificial neural networks for image recognition. At the heart of this approach is back propagation, which is pretty much THE bread and butter behind multilayered neural networks. He presents a detailed discussion of back propagation in two separate pieces: one that is intuitive and “top down”; the other a more mathematical, “bottoms up” approach that goes through the algorithm step by step, showing how the gradients are computed and the weights updated. His treatment of back propagation was one of the better explanations I’ve seen and really cleared things up for me.
One last thing I must mention: at the time of release, this was the first machine learning book for Python (to my knowledge) that has an entire chapter devoted to Theano, which he uses to parallelize neural network training. For those who don’t know, Theano is a particularly nice (not to mention very powerful) Python library for doing machine learning, most especially if you can utilize the power of GPU computing. In addition, that particular chapter (13) also introduces the brand new Python library named Keras, which is built on top of Theano and is a really nice library for the rapid building and prototyping of neural networks (in the spirit of Torch). Being a brand new library, his treatment of Keras was necessarily brief, but it was a great starting point.
In conclusion, I am very confident that if you do pick up this book, you won’t be at all disappointed. And be sure and grab the accompanying code for the book on his GitHub repository (just look for “python-machine-learning-book” on github.com/rasbt.) His code is top notch and I’ve yet to encounter any problems with it.