During the honeymoon phase, the first MOOCs with Andrew Ng repeating “concretely” 10 times in a 6 minutes video, machine learning seems pretty easy and intuitive. There are plenty of Medium articles or tutorials that we can read quickly, even in the Parisian metro, and understand what is explained.

But sooner or later, during an interview or with coworkers, we come to realize that there is far more to data science than just reading blog articles or following well-designed MOOCs. Proper code versioning, clean code habits, advanced machine learning libraries and algorithms, dataviz, advanced probabilities and statistics … Whether on the theoretical or the practical aspect, there is a huge gap between readers, newcomers, (or sometimes bloggers like me haha) and professional practitioners, real (data) scientists. Last year, still in uni, I took a few hours of my free time to bridge that gap, at least the theoretical aspect of it. These are some of the resources I used. As I am always learning I will add more resources as I discover them.

Even though the new state of the art algorithm ( BERT/XLnet) are not included, I still share these NLP notes from CS224n

Wikistat.fr French only, but really worth it! If I had to choose only one site, I would certainly choose this one.

Books

These are some books that I spent time reading. They require much more effort than the above shared pdfs.

MLAPP, Machine Learning a Probabilistic Perspective by Kevin Murphy

ESL, Elements of Statistical Learning, by Hastie, Tibshirani, and Freidman

Deep Learning, by Courville, Goodfellow, and Bengio

While Elements of Statistical Learning was the first I read, I found it too verbose in some parts. Chapters 1 to 4 were really worth my time though (these notes helped me a lot!). Machine Learning a Probabilistic Perspective is my favorite, it is more concise and tries (and fails sometimes) to go straight to the point in every chapter. It is easy to miss some steps in the equations sometimes, but it is part of the learning process haha! Finally, I read the Deep Learning book “just for fun”. As Deep Learning is more experimental than theoretical, with a lot of trial and error, I did not want to spend too much time trying to understand the theory. Understanding the main architectures(MLP, Convnet, RNN …), backpropagation, or why LSTM and GRU architectures solve the gradient vanishing problem was enough for me.

Other ressources

Some other books or pdfs that I found interesting :

In the example below we have two classes: $C_0$ et $C_1$. The points of class $C_0$ follow a normal distribution of variance 4. The points of class $C_1$ follow a normal distribution of variance 1. Class $C_0$ represents 90% of the data set and class $C_1$ represents 10%.
The following image represents a dataset containing 50 points as well as the theoretical distributions of the two classes in the corresponding proportions. The overlapping of the two classes is varied by changing the average of class $C_1$.

The theoretical minimum error probability is given by the area below the minimum of the two overlapping curves. It is given by the following expression.
$$
P(false)=\int_RP(false|x)P(x)dx=\int_R min(P(x|C_0), P(x|C_1))dx
$$
This probability could be used as a separability measure because it measures the overlapping between the two distributions of classes $C_1$ and $C_0$. However, in practice we cannot calculate this integral because we do not have the exact expression of the probability densities.

Separability in the linear case

Another expression of class separability is given by wikipedia in the linear case:

Let $X_0$ and $X_1$ be two sets of points in a n-dimensional Euclidean space. Then $X_0$ and $X_1$ are linearly separable if there are $n+1$ real numbers $w_1,w_2,…w_n,k$ such that for any $x \in X_0 \sum_{i=1}^n w_ix_i>k$ and for any $x \in X_1 : \sum_{i=1}^n w_ix_i<k$. <=”” p=””> However, it does not give any separability measures to be used in concrete cases. </k$.>

My trick: supervised clustering

In theory

In the absence of a ready-made separability measure, I have found a way to estimate the separability of classes:

Perform clustering with an algorithm appropriate to your dataset. See scikit learn page.

Choose k, the number of clusters consistent with silhouette analysis. See sklearn..

This measure involves the conditional entropy of the class conditionally to the cluster, $H(C|K)$, normalized by the entropy of the $H(C)$ class. The lower the conditional entropy, the more important the information given by the K cluster on class C is, and therefore the more homogeneous the clusters are.
The homogeneity score $h$, limited between 0 and 1, is as follows with a maximum value of 1 (perfect homogeneity):

$$
h=1-\frac{H(C|K)}{H(C)}
$$

For more information on this measure, do read this research paper written by Rosenberg and Hirschberg.

It is therefore a supervised clustering, labels are used (involved in the calculation of conditional entropy) to optimize clustering.

In practice

The following image shows the correlation between class separability (and therefore cluster homogeneity) and the performance of several classifiers (Random Forest(RF), KNN, MLP( Multi Layer Perceptron), SVM (RBF Kernel) and Logistic Regression) for an imbalance rate of 1 percent and justifies ( I hope) the use of this class separability measure.

Thus, no matter how much class imbalance there may be, if separability is poor, there is no point in bringing out an artillery of techniques to get around the problem. It will be better to work on the data (feature engineering, creation of new variables, discussion with an expert) to increase class separability.

While trading may be the most exciting and lucrative domain of application of Machine Learning, it is also one of the most challenging. Trading is not only about buying or selling, nor is it just about analysing the financial state of a target company. One of the reasons why it is so difficult to be a top trader is that it requires to consider a large amount of data of different nature. This also explains the machine learning hype in trading. Text, speech, numbers, images … Machine learning algorithms can deal with almost any type of data. In this series of articles, we will introduce an implementation of a not so common deep learning approach to stock price trend prediction based on financial news. Our inspiration comes from the recent research paper “ Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction “ — LCW.

Recent trends in research paper and blog articles

Many approaches introduced in last years’ research papers suffer from incompleteness. One of those approaches consists in designing an algorithm based on last days’ stock prices only. Recurrent Neural Networks are largely used to that extent. Another way is to use sentiment analysis in the trading policy. If you are familiar with machine learning for trading, you have certainly come across “ Stock trading with Twitter using sentiment analysis”. Reinforcement learning is also trendy now, as shown in this paper released in July 2018. While these solutions give impressive results, we think that they underuse the potential power of machine learning algorithms.

What makes this paper so special ?

Nowadays, A.I. is trying to become more human. And algorithmic trading is no exception. Some of the recently published research papers try to design frameworks that imitate real investors. LCW is one of them and that’s why we chose it.

Where is the innovation ?

To us LCW does better at replicating human behavior than many other papers on this topic. As you will understand, this paper is mainly about text mining. Now, imagine you are an investor trying to predict the variation of one stock tomorrow. You may try to get as much information about the company over the last days. And then you get an idea of how the stock price might evolve the next days. This is the use case that LCW tries to solve using a deep learning framework that takes time sequences of articles as input.

The authors have taken into account three characteristics of the learning process followed by an investor struggling with the “chaotic news” :

First, the Sequential Context Dependency. This simply refers to the fact that a single news is more informative within a broader context than isolated.

Second, the Diverse Influence. One critical news can affect the stock price for weeks, whereas a trivial one may have zero effect.

Third, the “Effective and efficient Learning”. It is learning from the more common situations before turning to exceptional cases.

This is not a theoretical paper but rather a math-engineering paper. The design process might be like this:

– We have to deal with sequences of press articles. What neural network could we use for that ?

– Alright, now give me a simple neural network to perform a three class classification ?

– Multilayer perceptron !

There you got it :

As you can see, innovation is in the design of the whole framework. And in the way they connect different neural networks to solve the whole problem. We will let you read more about the architecture in paragraph 4.2 of the paper.

They have also implemented a Self-Paced Learning algorithm. It aims at performing a more effective and efficient learning. You may read paragraph 4.3 for more information about this algorithm. We did not have enough time to implement this part of the paper.

Our workflow

Pierrick and I are two french engineering student currently in our second year of Engineering Master program. We are by no means expert in trading and beginners in machine learning. It is our first paper implementation, and first technical blog post as well, so we are open to any constructive criticism both on our code and articles.

Our workflow is divided in 4 steps :

Scraping

For the purpose of this research project we have scraped all the articles published on reuters.com from 2015 to 2017. We used mainly BeautifulSoup and Urllib library as well as the multiprocessing library. And yes, the whole project is in Python 3.

Articles Vectorization

We chose not to follow the paper on this part. After collecting more than 1 million articles (see 5.1.1 on the paper) they have trained a Word2Vec on the whole vocabulary of their articles. And then, they computed the vector mean of all the words in an article to make a vector representation of it. We preferred to use Doc2Vec for a better representation of the article. Our choice was inspired by this comparison. We used Gensim library for that.

Dataset Creation

This part consisted in creating a dataset that the HAN network can train on. Our data X is a time sequence of vectorized articles from day t-10 to day t, and the target value Y is the variation of the corresponding stock on day t+1. Scikit-Learn and Pickle libraries were very helpful for this task.

Model Training

We used Keras to build the model. The wrapper Time Distributed was of great use to apply attention mechanism to the input and output of the GRU network.

Final word

Implementing this paper was thrilling, and we look forward to writing about each step of the implementation. Many thanks to the authors for this inspiring paper.

To stay in touch, feel free to contact Pierrick or myself on Linkedin !