Türkçe hali için burayı ziyaret edebilirsiniz.
This time I wanted to write a technical post. In Computer Engineering Department(CED) of Yildiz Technical University(YTU), as students we are required to complete two projects to graduate from the university. First one is named "Computer Project(CP)" and the second one is "Graduation Project(GP)". These projects are the ones which improve a student's skills in terms of system design and software development and direct them to a specific direction or field in Computer Engineering. One of the reasons I'm writing this article is to help future students to overcome fears about these two projects. Moreover, main aim of this post is to give an idea about Machine Learning(ML) and working with Python.
For those who didn't hear about ML, what ML basically does is enabling computer to act without being explicitly programmed. This means, by learning from past experiences, as we humans do, it tries to perform better. Given a set of examples about objects, like a computer and a telephone, it tries to understand the features that makes the computer a computer and the phone a phone.
More often, Java is used on the projects among the CED of YTU but this isn't our only option. We have capability and enough resources to run some projects in other languages that we are more familiar with. Of course, no matter what you're working on, please do a feasibility search on the language to be 100% sure that the project can be done in the language you prefer. As a result of this, on both projects I chose to work with Python because of several advantages of the language:
1) I feel more comfortable with Python. I mean, given a reasonable task, I feel like Python won't fail me as long as I do the right things. This doesn't mean I'm a Python expert but it means I know I can do it by working on it.
2) Especially in ML there are bunch of researches and projects developed in Python. A problem you face during development, its more likely for you to find the solution to that problem.
3) Thanks to awesome libraries and toolkits, it's really capable of building a real life system. There are many mature libraries and packages that you can use. It's possible to build a ML system at the Back-End and show results on a web application or build an API to a mobile application to train or test your model with your Python system.
4) You're always on the go. You can build a demo application in minutes to test if it's feasible to go with Python. Or you can just make a query on Google or Github to see it yourself. In both projects I mentioned above, most of the time your mentor will give you weekly or two week long tasks and you will be able to complete the task thanks to Python's flexibility and rapid development capabilities.
5) It feels much more confident to know that industry leaders use it to develop the products we use in our everyday life such as "How we use Python at Spotify" or "Practical Machine Learning in Python"
As a result, Python is an incredible programming language also for Data Science including ML other than many general purpose.
You can add many more to the list above but for this post I guess this is enough. Basically main steps of a ML system are like as follows:
- Collecting data
- Data preprocessing and Feature Extraction
- Removing irrelevant information
- Normalization etc.
- Building a model
- Testing the model
- Evaluating the model
I will go each step with some lines of code to give you insight about them and by doing that I will use scikit-learn which is an awesome ML library and isn't the only one.
1) Collecting data
To build a ML system we need to have a dataset to work on. Either you can retrieve data from some websites manually by writing a program or you can use some sample datasets for ML such as Iris dataset. More sample datasets can be seen on the sklearn.datasets page, UCI KDD Archive or UCI Machine Learning Repository.
There are even Python wrappers to load these datasets to your application. Isn't that cool?
2) Preprocessing and Feature Extraction
Once we get the data, usually we need to transform the data to another form or remove some irrelevant information from the data. This means, let say you have a tweet data and want to build a system to do sentiment analysis. Removing web urls, mentions and hashtags to have only relevant information would be a logical thing to do. It is done this way because given a tweet, a web url might not be the feature that defines the sentiment of the tweet.
Another example can be finding the roots of the words to feed your model with unique features. For a sentiment analysis application based on a Turkish dataset, using "gidiyorum" and "gidiyorsun" as two different features will increase your feature count and dimensionality. So that in feature extraction step you may want to use stems of words as features or correct misspelled words using morphological parsers such as Zemberek.
Preprocessing doesn't always mean removing some information. Say we are given another classification problem, spam classification. This time we can replace any web url with a token which represents that there is a web url present in this mail such as $WEBURL$ etc. Basically, this step is called preprocessing.
In some problems we may have features that are scaling in different ranges. For example for a house price prediction problem, the number of rooms of a house can range between 1-6 maybe, but the expanse of house may range between 60-200 square meters. In this kind of problems we run a normalization which is also can be done easily with scikit-learn.
From image processing to text classification, our statistical models work always on vectors and matrices which means actually we need to turn our documents or information to vectors. There are number of vectorizers available in scikit-learn. CountVectorizer is one of them which converts documents to a matrix of token counts. Imagine we want to count every feature's occurence in a context. Here is an example to CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer document = [ 'I love going out at weekends', 'Yesterday we went to theatre with some friends.', 'How many tickets do I need?', 'Weekends are awesome. I love them.'] vectorizer = CountVectorizer(n=1, analyzer='word') X = vectorizer.fit_transform(document).toarray() features = vectorizer.get_feature_names() print X print features
In this setup, every word in the document will be considered as feature but you can change the parameters given to the CountVectorizer to change features you would like to use to build your model. For instance if you give
n=3, analyzer='char' then you will have 3-grams of the document in character basis. So let's see the output of above code snippet:
Normally our variable
X which is the return value from CountVectorizer's
fit_transform call, would be a sparse matrix as you can consider that there are very limited number of non-zero elements. We can see from above output that each line in our document has become a vector and this vectors consist of ones and zeros according to having or not having a feature in a line of document. Our features are listed below and you can match columns of vectors with feature list's elements indexes respectively. By doing so, we've vectorized our text document to be able to build a model.
3) Building a model and training
When it comes to building a model, we need to understand our problem very well. If we are going to predict house prices with given a sample data, it would be more logical to train a linear regression model. Another example is, if we have a unsupervised learning problem such as clustering venues according to their categories and some other features, we may want to build a k-Means model to group similar venues. For demonstration purposes, a basic example is given below.
from sklearn.svm import SVC from sklearn.datasets import load_iris from sklearn.cross_validation import train_test_split dataset = load_iris() X = dataset.data y = dataset.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) model = SVC() model.fit(X_train, y_train)
In the snippet above what we've done is basically loading a sample dataset, splitting it to train and test partitions and training our SVM model with the train set. Our model is a SVC instance which is a Support Vector Classifier with a linear kernel.
4) Testing the model
After fitting the SVM model it is pretty simple to test it with our test set:
predictions = model.predict(X_test)
It's pretty easy isn't it? In predictions list we have the predictions according to our SVM model.
5) Evaluating the model
After building a model, training and then testing it, it's time for evaluation to see how well our model performed for our problem. It's very crucial part of a ML system because we may change our strategy according to this result. There are number of metrics to calculate our model's performance:
- Mean Squared Error (MSE), Mean Absolute Error (MAE)
- Accuracy Score
- Precision, Recall and F1 scores
In this step, we need to know what each value of all of these metrics mean. For instance, if we have skewed data like %90 positive examples and %10 negative examples, accuracy score may not always be saying truth. In such cases we can look at F1 score and decide if our model worked well. To learn how our model worked on our test set we can use sklearn.metrics module with predictions we've got from the model:
from sklearn.metrics import accuracy_score acc_score = accuracy_score(y_test, predictions) print acc_score
Moreover, there are many options that we can go with in a low-performant system. By low-performant system I mean a system that doesn't produce results that we expect. In addition to that, for instance we can decide to sample more data and train in the system after plotting learning curve of our model or we can say that, okay we have much more features than there should be and we need to reduce number of them which is named as dimensionality reduction.
So, we went through a typical ML system and its main components. I hope this give you an idea of working with Python on a ML problem. For people who are looking for some platform to work on in their CP or GP project, I suggest working on a ML system with Python's and the awesome libraries' power. So you can apply a type of problem in ML with a proper dataset and work on it for your CP or GP. Some subtopics in ML:
- Deep learning
- Active learning
- Sentiment analysis
- Recommender systems
- Collaborative filtering
- Image recognition
For the ones who wants to go further in the field, I wanted to give some useful resources which can help building a knowledge about ML:
- Curated list
- Books & Videos:
- Andrew NG
- The Elements of Statistical Learning
- Introduction to Machine Learning
- There are plenty more resources available
- I've taken the course from Stanford University and it was a comprehensive and fundamental course covering different basics. I recommend taking it.
Thank you for reading so far. If you have questions, please leave a comment below.