Modelling the Stock Market using Twitter


M. Sebastian A. Wolfram
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
Stock markets are driven by a multitude of dynamics in which facts and beliefs play
a major role in affecting the price of a company’s stock. In today’s information age,
news can spread around the globe in some cases faster than they happen. While it can
be beneficial for many applications including disaster prevention, our aim in this thesis
is to use the timely release of information to model the stock market. We extract facts
and beliefs from the population using one of the fastest growing social networking
tools on the Internet, namely Twitter. We examine the use of Natural Language Processing techniques with a predictive machine learning approach to analyze millions of
Twitter posts from which we draw distinctive features to create a model that enables
the prediction of stock prices. We selected several stocks from the NASDAQ stock exchange and collected Intra-Day stock quotes during a period of two weeks. We build
different feature representations from the raw Twitter posts and combined them with
the stock price in order to build a regression model using the Support Vector Regression algorithm. We were able to build models of the stocks which predicted discrete
prices that were close to a strong baseline. We further investigated the prediction of
future prices, on average predicting 15 minutes ahead of the actual price, and evaluated the results using a Virtual Stock Trading Engine. These results were in general
promising, but contained also some random variations across the different datasets.
I would like to thank Miles Osborne not only for supervising my thesis, but also for
being a great mentor during the entire process. I will dearly miss his first question at
the beginning of every weekly meeting, ’Are we rich yet?’ I am dedicating this work
to my father, without whom I would not have had the opportunity to begin nor been
able to finish this master. I am very thankful to my wife who has greatly supported me
in the most difficult moments, especially because she sacrificed a lot for the pursuit of
my goals. I want to thank my mom for her constant support and motivational talk and
my sister for sending me a lot of pictures of my cute niece, Amelie.
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(M. Sebastian A. Wolfram)
Chapter 1
One popular area amongst researchers and financial analysts for pattern recognition
and machine learning applications are the highly dynamic and data intensive financial
markets. Besides the evident motivation of gaining an advantage in investment opportunities, predicting the price of a security using various statistical tools, machine
learning techniques or fundamental and technical approaches is still an ongoing field
of extensive research and no methods have yet been discovered which can accomplish
such a task. Efforts that have attempted to solve this problem have merely shown small
and unstable successes. Furthermore, stock market prediction research stands against
widely accepted theories that imply predicting the price of a security is an impossible
task. One such theory is the Random Walk hypothesis (Malkiel, 1996) which states
that the price movement of a stock is no more predictable than the random selection of
successive steps in the positive, negative or equal direction of the value of the stock.
Moreover, the efficient - market hypothesis (EMH) (Fama, 1965) says that the prices
of securities reflect all available information about the current financial standing of the
company while new information made available or otherwise introduced immediately
corrects the new value of the stock. Therefore, these hypotheses say that an attempt
to predict market values is based solely on chance and that investors placing orders do
so at the securities intrinsic value rather than at the anticipated lower buying or higher
selling value.
On the other hand, an experiment conducted by (Lebaron, 1999), in which he created an artificial stock market to study the behavior of decision processes of stock
traders based on timely introduction of information, revealed a lag in the time that
information was introduced and the time the market would adjust itself. To exploit
this theory, research in the analysis of textual data has found subtle success in pre1
Chapter 1. Introduction
dicting stock market prices using text mining and Natural Language Processing (NLP)
techniques. These methods usually extract information from various sources on the
web including news wires, personal and company websites and blogs as well as social
networking communities and micro blogs.
In today’s information age opinions, facts and random chatter are created and exchanged at extraordinary rates. There are many different types of media individuals use
to share information, but it is the infrastructure of wireless networking with the combination of small and inexpensive mobile devices which made the explosion of fast data
and information exchange possible. For individuals involved in this social ensemble,
reading and submitting status updates have become a new way of life. While many
people send their information privately to other groups or individuals, developments in
social networking have created networks which are open to the world. A virtual environment to exchange information and communicate is a great sandbox to learn about
opinions of groups that relate to topics of interest.
Unfortunately, not all data roaming in the social network cloud is meaningful information, which is both a problem for individuals trying to pay attention to status updates
but also to researchers trying to mine for relevant topics while ignoring the noise and
spam that surrounds them.
Using Twitter as a rapid information source has proved to be a useful tool for various scenarios such as analyzing and predicting the spread and belief of the recent swine
flu pandemic (Ritterman et al., 2009; Harvey, 2009) but also for potentially early alert
systems of earthquakes due to people posting Twitter messages as soon as they happen. Generally, these submissions happen within 20 seconds of stronger occurrences
in areas with higher technology density (Earle, 2010). Twitter is also an enormous
discussion forum for many technical and economic topics letting people express their
sentiment over products, services and entire organizations. Directly gathering this information from the population has shown to be a great source of valuable information
for analyzing a company’s branding success and incorporating it in their overall branding strategy (Jansen and Zhang, 2009).
In this thesis, the problem of stock market prediction is coupled with a complex
Information Retrieval (IR) task and we attempt to solve it using NLP techniques, by
transforming the raw Twitter posts into linguistic textual representations such as the
bag of words model. We use different filtering methods to reduce the dimensions of
our raw data and finally approach the task of predicting a price by building a regression
model using the Support Vector Regression (SVR) machine learning algorithm from
Chapter 1. Introduction
statistical learning theory (Vapnik, 1998). Our data for text analysis comes from the
micro-blogging community Twitter and is made available through the University of
Edinburgh, School of Informatics using the Twitter Streaming API1 . As we will describe in the next chapter, there are various incentives to choose Twitter as our data
source but primarily due to the timely release of millions of posts worldwide. These
releases may even be fast enough to retrieve relevant information about stocks capable
of predicting future prices before the market adjusts itself. In addition to the Twitter
data we downloaded Intra-Day stock price information from several NASDAQ stocks
in order to build and train a regression model and predict a discrete stock price.
Main Findings
Most research relating to modelling stock prices using NLP techniques with textual
data has focused on three distinct aspects: First, most data sources came from news
articles or blogs relating to companies rather than micro-blogs. Second, inference was
usually accomplished through classifying the direction of the stock price rather than
the actual amount. And third, the training of prediction models was mostly done using
End of Day stock price data rather than Intra-Day minute stock quotes. This resulted
in systems predicting the stock price for the next trading day. Furthermore, as we
will explore in our literature review in the next chapter, current research has found
only subtle successes in modelling stock prices as well as predicting discrete future
price values. We therefore asked, is it possible to use micro-blogs data to model stock
prices? Additionally, can we predict the price of specific stocks some period of time
into the future? And finally, can we make a reasonable profit? Concluding from our
experimentations, we have found the following results:
- We can use Twitter posts to build a close model of stock prices and we showed
that the regression line of the model approaches our strong baseline in all test
- From experiments which attempted to predict future stock prices, our results
varied across different stock selections; however, in several cases we were able to
attain significant profits from the evaluation of our Virtual Stock Trading Engine.
- Finally, our results indicate that using Twitter as a source of information to predict stock prices contradicts the EMH.
1 api
Chapter 1. Introduction
Although our results did not perform better than our strong baseline, higher accuracies
should be reached given that more time is spent on handling Twitter spam and noise as
well as on feature exploration using techniques we have not yet explored.
Report Structure
We describe the findings stated in the previous section by a detailed explanation of our
methods, experimental setup, and results in the following chapters as outlined below.
• Chapter 2 introduces the background to our thesis including financial market
theories, social media, micro-blogging and Twitter. The Chapter also discusses
related research in the field of market prediction using NLP techniques as well
as other related topics in order to motivate and support our hypothesis.
• In Chapter 3 we discuss the methods used to conduct our experiments and begin by introducing the overall framework design. We then describe the learning
algorithm used as well as the different evaluation methods applied to the experiment results.
• In Chapter 4 we lay out the experimental setup necessary to conduct the experiments. The Chapter includes discussions on data pre-processing, feature
exploration as well as dataset requirements.
• Chapter 5 discusses the implementation of different feature construction approaches which are tested on different stocks and parameter settings. The Chapter details all our findings in sequential order.
• In Chapter 6 we analyze the results obtained from our experiments and draw a
conclusion of the work we conducted. We close our thesis by suggesting future
Chapter 1 Summary
We introduced our research goal of predicting the stock market using Twitter and began
motivating our objective by giving a brief overview of theories governing the financial
markets including related research and explained a few relevant Twitter benefits and
scenarios. We concluded the chapter by pointing out our main findings and laid out the
structure for the rest of this thesis.
Chapter 2
In this chapter we introduce the major topics and theories relating to our project in
order to explain our motivation for the foundations of the framework. First, we explain
the theory behind the EMH and the Random walk hypothesis and follow the discussion
by introducing an experiment based on virtual stock markets that contradicts the EMH.
Then, we continue with an introduction to social media and the micro-blogging Website Twitter. Finally, the chapter concludes with an exploration of relevant work and
techniques applied to the area of IR in relation to the task of predicting stock market
prices. These publications relate to our hypothesis and will further help motivate our
Efficient Market and Random Walk Hypotheses
There have been many attempts to predict the stock prices and the quest to understand
the dynamics behind financial markets is an ongoing research area both in academia
and finance. Standing in the way of such research are theories which claim that the
financial market dynamics are stochastic and ’informationally efficient’. This implies
that predicting the price movement of securities is not possible. The theory of random
walk states that
”the future path of the price level of a security is no more predictable than
the path of a series of cumulated random numbers” (Fama, 1965).
This means that successive stock price changes are independent while the change
in value follows some probability distribution. The independence assumption implies
that the probability distribution of the price pt of a security at time t is independent
Chapter 2. Background
of the probability distribution pt−n where n is the number of previous time units. Additionally, the EMH states that the efficiency of financial markets causes stock prices
to reflect all the information and data known about the security to be incorporated in
the price at the point when a trader places an order, making it impossible to beat the
market. Therefore a stock trader who places an order to purchase or sell stocks does
so at the stock’s intrinsic value instead of an anticipated lower buying or higher selling
EMH is divided into three versions characterized by different strengths, beginning
with the ’Weak’ which asserts that a security reflects all the information available to
the public. The second ’Semi-Strong’ version incorporates the weak version with the
addition that any new information instantly is reflected in price not allowing a trader to
get any advantage. Finally, the last version is termed ’Strong’ and incorporates the first
two versions with the addition that non-public information, such as insider information
and other unknown facts are also reflected in the price. While there is evidence against
the ’Strong’ version of the EMH (Findlay and Williams, 2000), our goal of this thesis is
to find evidence that helps reject (or support) the ’Weak’ and ’Semi-Strong’ version, by
the use of timely releases of information and opinions on Twitter, which could be used
to form a prediction model and beat the market before it adjusts itself to the intrinsic
Simulated Stock Markets
One way to study the dynamics of financial markets is to create an artificial simulation of it and let virtual agents trade by giving them trading rules. Researchers have
attempted to create computer simulations of financial markets and we are introducing
one such work as it showed contradicting indications against the Random Walk theory
which, in part, forms the basis of our hypothesis.
In an experiment conducted by (Lebaron, 1999) an artificial stock market was created to study the behavior of decision processes of stock traders based on the timely
introduction of information. These simulated trading agents would act like their human
counterpart, which was accomplished by applying dynamic rules that would constantly
be evaluated and updated so that agents would use optimal rules whenever they discover them. Using those rules the agents built forecasts of future prices and dividends
and, during trading sessions, began accumulating rules that worked best and prune off
ones that did not. This was achieved by the use of a genetic algorithm. Since the arti-
Chapter 2. Background
ficial market was a multi-agent framework, the setting did not allow agents to interact
with each other; however, they did act upon changes indirectly when agent’s behaviors
affected price changes. The rules translated into actions which triggered trading behaviors. One of the findings was that changes in certain parameters that governed aspects
of time dramatically changed the behavior of trading agents indicating that the time
taken to address new information or changes in the prices was relevant in the strategies
(rules) the agents would optimize. It was found that fast reacting agents selected technical rules while slow reacting agents selected fundamental rules. An interpretation is
also that this difference revealed a lag in the time that information was introduced and
the time the agents had to act before the market would adjust itself.
We view this time-lag as a basis of our thesis and hope to discover patters in Twitter
data which may lead to a valuable short term prediction of the future stock prices,
enabling the execution of a basic profitable trading strategy.
Social Media and Social Networking
Countries which have the infrastructure to offer their citizens affordable and reliable
internet and mobile communication services are actively practicing in Social Media
and Social Networking due to the ever-growing social networking websites (SNW)
and services available on the internet. This phenomenon has existed for many years
where some of the first community websites launched in the mid 90s on services like
Geocities1 . However, these websites where not considered to be social web services
as we know them today. According to a study of Social Networking Sites (Boyd and
Ellison, 2008) a definition is as follows:
”We define social network sites as web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection,
and (3) view and traverse their list of connections and those made by others within the system. The nature and nomenclature of these connections
may vary from site to site.”
Some of the best examples of SNW available and popular today are Facebook2 or
the micro blogging community Twitter3 . Other similar services such as MySpace4 and
Chapter 2. Background
Friendster5 have lost popularity but remain part of many available choices. Moreover,
services such as LinkedIn6 , YouTube7 or Flickr8 focus on specialized social networking groups or services. LinkedIn, for example, has a target audience consisting of
business professionals giving members the opportunity to stay connected with previous colleagues and increasing one’s outreach to new career opportunities. YouTube’s
focus is mainly on the user creation and distribution of video content. While these websites have specialized motives, most of them have in common the ability to enable the
exchange of media through online social interaction. Some examples of social media
include but are not limited to electronic media such as images, sound and video, blogs,
slogans, events, and other information or ideas. Further, with worldwide growing mobile connectivity the exchange of social media seems to have no end as it enables its
users to access and create content on the fly.
Besides the websites mentioned so far, a great deal of social networking comes
from the blogging community. There, individual users create web-pages known as
blogs which are paragraphs or longer articles listed in inverted chronological order
(starting with the newest post on top of the website) usually including meta-tags such
as release date and author. Thus, the term blog comes from the combination of the
words web and log. There are many reasons for people to blog, such as expressing
their opinion or emotions on certain topics or just write a diary of specific events. The
popularity of blogging was further fueled through the releases of journals of daily activities of interesting people including serious commentaries on important issues (Nardi
et al., 2004). Many celebrities and influential persons have blogs to communicate with
their followers. From the perspective of text analysis, it is sometimes more difficult
to analyze blogs, unless information about page visits, registered users, and number
of comments is publicly available. With the popularity of blogs came the invention of
micro-blogs. These follow a similar format as regular blogs with the main difference
of being much shorter.
The two most popular micro-blogging SNW are, at the time of this writing, Facebook and Twitter (Source: Figure 2.1 shows the percentage of global internet users visiting Twitter on a daily basis from mid 2008 until
August 2010. These percentages are still on the rise and already outperforming search
Chapter 2. Background
Figure 2.1: Percentage of global internet users visiting on a daily basis.
engines like Yahoo9 and Bing10 . While there are some conceptual and usage differences between Facebook and Twitter, one of the more significant and a major aspect of
this thesis, is the fact that most Twitter users have public profiles in contrast to Facebook where users restrict much more personal information as well as status updates
from users not part of their network. In fact, Twitter accounts are by default public so
that new posts are automatically submitted to the public Twitter timeline.
We are therefore interested in Twitter status update releases since we get access
to millions of user’s opinions and their daily chatter. Twitter is increasingly becoming a good source of data due to a variety of users from different backgrounds and
professions. As with blogs, many celebrities, authors and other influential figures use
Twitter to stay connected with their audiences. Besides individual Twitter users, there
are also many organizational Twitter accounts who release posts into the public stream
with either commercial or informational purposes. These organizations include news
outlets, companies, research/nonprofit organizations, and many more. A second distinction of Twitter is its message limit of 140 characters. This restriction was initially
implemented to conform to the character limit in the Short Message System (SMS)
used in mobile phone text communication and may seem to contradict the expressionism of social media. Nonetheless, it has turned out to be more popular giving rise to
additional benefits. I.e., the character limit requires people to express their opinion and
comments in a concise manner. This results in posts that are to the point without much
of the noise found in traditional blogs and articles. Also, advertisements, tags, links
and many other sources of unrelated text usually found on blogs, news and other in9
Chapter 2. Background
Figure 2.2: Twitter Input/Output Methods. Image from (Krishnamurthy et al., 2008)
formational web-pages do not exist in the raw Twitter feed making it easier to process.
Nonetheless, the absence of such data does not imply that possibly important information is lost. That is because Twitter incorporates some of this data inside the actual
posts. For instance, the hash tag is used to relate posts to specific topics: e.g. ’#twitter’ in a post represents a tag of the topic ’twitter’. Moreover, due to the shortness of
the posts, individuals and organizations are much faster at publishing messages. Organization, who use Twitter to inform their followers about news, products or services
updates do not have to spend much time on writing lengthy articles including the bottlenecks from editing or formatting. Additionally users have the ability to use mobile
devices to publish news as soon as they happen. In fact, there is an entire array of possible input methods as shown in Figure 2.2. Finally, it is also possible to analyze the
importance of a Twitter account by the use of Twitter meta-data. Information about the
number of posts, ’followers’ (number of users that are following the twitter account),
’following’ (number of other users the account is following) and the network of links
between users could be used to determine the rank of a specific users and therefore its
impact on their audiences.
Chapter 2. Background
The combination of millions of publicly released posts, with the speed of the release as well as the limit in character length makes Twitter an ideal data source for our
thesis as it enables us to perform real time search on the belief of the population.
Literature Review
One of the more recent works was published in 2009 by (Schumaker and Chen, 2009).
They attempted to predict the actual price of stocks listed in the S&P 500 using a SVR
algorithm by applying text mining techniques on financial news articles and transforming them into feature representations including bag of words, noun phrases and named
entities. Their objective was to find out how accurate the predictions of the proposed
models could be concerning the task of forecasting the actual stock price 20 minutes
after the release of a news article. Moreover, they were also interested in exploring
the best techniques for analyzing and decomposing news articles using various IR algorithms. They investigated over 9000 news articles relating to stocks on S&P 500.
They tested data on four different models, first a simple linear regression model followed by three models using SVR. Their main finding was that the best performance
was achieved by combining the article terms with the stock price at the release of the
(Mittermayer, 2004) proposed a system that categorized press-release articles using
a Support Vector Machine (SVM) algorithm. These categorizations were then used in
a trading system that attempted to predict price trends immediately after the release
of the article. He used press release articles rather than news articles claiming that
such text would contain a ”better source of unexpected information”. Results found
that the system performed better than a random exchange of securities and returned an
average profit of 0.11% per trade. With certain established trading rules, Mittermayer
was also able to make slight profits taking transaction costs into consideration. A
similar approach was done by C. Fung et al. (Fung et al., 2002) who used pattern
recognition methodologies to model stock price trends using a regression algorithm
in a combination with a SVM classifier that would categorize features extracted from
news article to either predict a stock price increase or decrease. They implemented an
incremental K-means clustering algorithm for filtering articles and associating them
to directional price trends. Their system performed better than random resulting in
moderate profitable successes.
With respect to the use of Twitter as a data source, another more recent work
Chapter 2. Background
conducted by (Tayal and Komaragiri, 2009) compared traditional blogs with microblogs to determine the predicative power on stock prices given the use of either data
source. Their research focused on sentiment analysis of blogs and micro-blogs and
found that in their experiments micro-blogs consistently outperformed blogs in their
predictive accuracy. They obtained their two data sources from the web service Google
Blogsearch11 and Twitter. The system used the stock name to filter and reduce the text
after which they performed sentiment analysis using a lexicon of positive and negative
terms. In the experiments, they predicted the actual stock price of the following day
from the models of each data source. They also found that the character limit of Twitter
helped determine more concise sentiment results since one Twitter post usually relates
to one topic.
A paper written by (Yi, 2009) compared three sources of text from social media
and build predictive models using SVR for each text sources in order to predict a real
valued price. One of the three data sources of this work was also Twitter and results
showed that Twitter was the best performing source of information.
Research by (Lavrenko et al., 2000) focused on constructing language models from
news stories and stock prices by identifying features in articles that indicate whether
or not a particular article type shows patterns for potentially influencing the behavior
of specific stocks. This model has provided evidence that, in time series, news stories
can be associated to trends.
Rather than focusing on the content of news articles, (Peramunetilleke and Wong,
2002) used news article headlines to classify stock movement in either up, down or
steady directions. This was done by using different document and term weight techniques, including term frequency - inverse document frequency (tf-idf) and term frequency - category discrimination frequency (tf-cdf). Results showed better performance than random guessing.
In contrast to the methods introduced so far, which mostly applied either SVM or
SVR models with textual feature representations of posts relating to stocks, the work
done by (Huang et al., 2005) also uses SVM but in combination with financial macroeconomic variables of the NIKKEI 225 Index. While this work uses a different type of
datum, its significance was the comparison of different classification methods on high
dimensional time series data which found that out of all classification methods tested,
SVM performed best due to the algorithm’s advantages of structural risk minimization compared to empirical risk minimization. As we will describe in Chapter 3, this
Chapter 2. Background
element makes SVM less vulnerable to the overfitting problem.
Additionally, in contrast to the machine learning algorithms introduced, the work
conducted by (Thomas and Sycara, 2000) implemented two classification methods to
analyze text posted on discussion forums. The focus of the paper was on the genetic
algorithm. They analyzed different extracted numerical representations from the posts,
including the number of messages as well as the total number of words posted per day
for a given stock. Due to high noise in their dataset, they found that aggregating several runs of the genetic algorithm into a single predictor helped improve performance.
Nonetheless, they were only able to show results from data sources that contained more
than 10,000 posts concerning a stock.
There has been research in areas that are not related to financial markets but which
also use documents and query terms to forecast trends and discrete values. For instance, Google Flu Trends12 is a service which uses aggregated Google search data to
estimate flu activity. The research published in Nature (Ginsberg et al., 2009), found
a correlation between the frequency of keyword searches relating to flu topics and
the number of people that are actually reporting flu symptoms to their doctors. They
compared their results with agencies such as the U.S. Centers for Disease Control and
Prevention (CDC) , who deliver reports about flu outbreaks within 1-2 weeks, and
found that their forecasts not only matched the reports well, but also had earlier flu
detecting signals than the delayed releases from the CDC. While Google Flu Trends is
updated on a daily basis, the system is capable of producing near real-time results, due
to the analysis of real-time user submitted query terms. In the same way as Google Flu
Trends’ results can complement data released by agencies like the CDC, research by
(Connor et al., 2010) shows that using Twitter as a source of sentiment detection for
consumers opinions on presidential job approval can also be a good supplement to the
expensive and slow traditional polling techniques. In this research, one billion Twitter
messages posted between 2008 and 2009 where analyzed by a lexical sentiment analysis. While the results fluctuated on different dataset instances, the highest correlation
between Twitter sentiment and actual polls was a measure of 80%.
Finally, on the subject of feature selection for text classification, (Forman, 2003)
motivated the use of certain heuristics where he presented a study comparing 12 feature selection methods using different text sources such as Reuters or Text RERtrieval
Conference (TREC) and performing classification using the SVM algorithm. He analyzed his results using a variety of measures including accuracy, precision and F12
Chapter 2. Background
measure. His experiments on dataset preparation found that to decrease dimensionality
of a bag of words feature representation without losing classification accuracy, the rare
word count cutoff threshold should be set to low.
Chapter 2 Summary
In this Chapter we introduced theories of financial markets and their implication to our
task. We described the EMH and contrasted it with other research forming arguments
and motivations in favor of our hypothesis. In addition, the Chapter gave an introduction to Social Media and Social Networking, listing the major contributing services of
the social online communication movement. Finally, we introduced related work most
relevant to our goal. A wide range of papers implement the SVM and SVR algorithms
due to its predictive advantage in the field of text classification and regression. While
many papers focus on predicting price movements as well as discrete prices, only a few
use these predictions on time series data of Intra-day stock quotes. To our knowledge,
there has not been any research done that uses Twitter posts as the main data source in
order to forecast real valued prices within a short period of time (several minutes).
Chapter 3
In this Chapter we describe the methods and algorithms we chose to construct our
framework, which is divided into four major components: Data Pre-Processing, Feature Selection and Construction, Regression, and Evaluation. We will concentrate on
explaining the approach of our prediction task and leave individual implementations
of data processing and feature exploration for the next Chapter. We start by briefly
introducing the individual components of our framework.
Framework Design
To accomplish our objectives explained in the Section 1.1 we developed the prediction
framework illustrated in Figure 3.1. The framework is coded using the Python1 programming language. Python is often used for processing text due to its ease of use in
handling files for input/output operations. In the design shown in Figure 3.1, the four
major components of the framework are represented in the dashed boxes, labeled with
their relevant framework component name. Within are depicted individual processes
that play a major role in the overall function of the component. The first component
(Pre-Processing Framework) is responsible for data collection, pre-processing and filtering. Section 3.2 will discuss the methods we used to come up with relevant key
words used to filter the Twitter posts from irrelevant data. Section 3.3 will describe the
methods used to extract quotes from the NASDAQ stock market. The second component (Feature Selection and Construction Framework) involves several NLP techniques
which we will define in Chapter 4 and explore and evaluate in Chapter 5. The third
component (Regression Framework) handles the implementation of the SVR algorithm
Chapter 3. Methods
Figure 3.1: Prediction Framework Design
Chapter 3. Methods
which we will explain in Section 3.4. The final component (Evaluation Framework) is
used to quantify our results. We will describe the methods for different error measures
and evaluation rules in Section 3.5.
Keyword Expansion
One of the first tasks involved in Pre-Processing our data is finding relevant keywords
that we can use to filter out the raw dataset from irrelevant posts. This section describes
how we constructed a list of query words used to search and filter the pre-processed
Having a dataset consisting of millions of twitter posts requires a way to search and
filter so that we end up with post that relate to our task. Therefore, we create a list of
query terms which relate to the company that we want to forecast stock prices. For our
experiments, we chose four different stocks from companies listed in the NASDAQ
index. The companies are Google Inc., Apple Inc., First Solar, Inc. and Intel Corporation with the symbols GOOG, AAPL, FSRL and INTC respectively. For the rest of
this paper, we will refer to a stock using either the company name or the stock symbol.
A discussion about the choice of these stocks will be given in Section 3.5.1. For each
of these symbols, we created a list of no more than 5 keywords that best described
the company and its products and services. For example, for the technology company
Apple, the initial terms were:
apple, mac, ipod, steve, jobs
We chose only five keywords, since the rest would be discovered using a query expansion tool. Query expansion can be compared to a thesaurus, where original query
terms are complimented with terms of similar meaning. This helps broaden search
results since using a single term may leave out relevant documents which could have
been included in the result set if a related query term would have been used. For our
purpose we used Google’s query expansion web service Google Sets2 . The resulting
set of additional 43 terms included words such as ’iphone’, ’windows’, ’macintosh’
and ’google’, but also terms like ’howto’, ’englisch’ or ’cool’. Table 3.1 shows the
complete list obtained by Google Sets. While the list was comprehensive in describing the company, it still lacked some information about stock symbols and company
names. We therefore included a list of stock symbols and company names gathered
Chapter 3. Methods
Table 3.1: Google Sets query expansion results
Initial Key Terms
After Query Expansion
opensource podcast
microsoft tutorial
programme community
podcasting webdesign
from Google Finance3 . From the list generated by Google Sets and the list of related
companies on Google Finance, we manually concatenated the final list of query terms.
This step could however easily be automated by creating a Python script which gathers
the information from these services and outputs the data to a file.
Intra-Day Stock Quote Extraction
To do regression and predict the value of a stock, we have to create a dataset that
combines the stock quotes with instances of relevant Twitter posts. The stock price is
declared as the target label and will be used to compare results of our predictions to
a baseline as well as the actual stock price. There are two types of stock quote that
have been used in stock market predicting using text regression. The first is the End of
Day stock market data which refers to the stock price that was recorded after the last
trade of the day was completed but before the extended after-hour market opens. This
price is useful for looking at statistics that range over long periods of time. The second
type is the Intra-Day stock quotes. These are recorded at different time intervals during
the opening hours of the stock market. Generally, time steps are in the range of one to
twenty minutes, depending on the organization recording the data. While it is simple to
Chapter 3. Methods
download free End of Day data from many financial websites such as Yahoo Finance4 ,
it very difficult to find free of charge, Intra-Day historical stock quotes. Moreover,
Intra-Day stock quotes are more relevant to our prediction task as we should get more
accurate regression estimates in combination with real time Twitter posts as opposed
to using one price for an entire day of Twitter feeds.
Our initial raw Twitter dataset provided by the University of Edinburgh, School of
Informatics5 contained Twitter posts ranging from November 11th 2009 to February
1st 2010. Unfortunately, we were not able to find free of charge Intra-Day historical
stock prices for the specified date range. Therefore, we created a method to download
new stock quotes on the fly. We developed a web crawler using Python which downloaded up-to-date stock information and parsed out the price of a stock in one minute
intervals. To download stock quotes we crawled a service6 on Google Finance which
outputs a simple string of key/value pairs of live stock information. We ran this script
during the regular opening hours of the NASDAQ stock exchange as well as during
extended pre and after market hours. In the end, we crawled a list of four top NASDAQ stocks and collected the data for a period of two weeks starting on July 19 to July
30, 2010. For a complete list of the stocks and sample stock charts over the two week
period refer to Appendix A.
Support Vector Regression
Since we are primarily concerned with the prediction of a real valued Dollar amount,
we decided on the use of a regression algorithm well suited for text analysis. In particular, we chose the SVR algorithm (Smola and Schölkopf, 2004). SVR is the regression counterpart to the popular SVM (Vapnik, 1999) algorithm used for classification.
SVM’s popularity over the last decade is in part due to the superior Structural Risk
Minimization (SRM) as demonstrated in (Gunn et al., 1997) but first introduced in
(Vapnik and Chervonenkis, 1974). It stands in contrast to the established Empirical
Risk Minimization (ERM) principle particularly known from neural networks. The
SRM principle addresses the problem of overfitting, where the model complexity increasingly fits the training data too well resulting in inaccurate predictions when new
data has been observed, by finding a balance between the model’s complexity and the
dataset is no longer available since Twitter requested to stop sharing the data
5 this
Chapter 3. Methods
Figure 3.2: Example of a maximum margin including its support vectors indicated as
double circles on the margin lines. (Image from (Chen et al., 2005))
closeness of fitting the training samples correctly. Even though the SVM algorithm
was designed for classification problems, it has soon after been extended to work for
regression task as well.
SVM is generally used for two class problems and the transition from classification
to regression is a small step but contains a significant difference in loss function. SVM
tries to find an optimal separating hyper plane between two classes. The hyper plane
is optimal since there is only one that can maximize the so called margin. In the
simplest case the maximum margin is the longest distance that separates the two closest
and linearly separable points from two opposing classes. Such points are known as
the support vectors and are retained to build the model used for generalization. See
Figure 3.2 for an example of the margin including its support vectors. In more complex
problems, SVM employs the use of different kernels to map non-separable data points
into a higher dimensional space which allows to linearly separate the classes. The
choice of kernels depends on the domain and in the context of statistical text analysis
a linear kernel function (formula) has proved to be the best choice in our experiments.
Given a set of training samples
x1 , y1 , . . . , (xn , yn )
for a linear regression task with the linear function
f (x) = hw, xi + b
Chapter 3. Methods
we wish to find the weight vectors w to find the optimal function f (x) by minimizing
kwk2 +C ∑ (ξi + ξ∗i )
y − hw, xi i − b
 i
subject to hw, xi i + b − yi
ξi , ξ∗i
≤ ε + ξi
≤ ε + ξ∗i
The constant C > 0 is also known as the regularization constant (or regularizer)
and determines how flat the function is and how much deviations larger than ξ are permitted. SVR is accomplished by the use a different loss function as in SVM (Smola,
1996), one that includes a distance measure and allows sparseness in the support vectors. One example of such a function is the ε-insensitive loss function.
if |ξ| ≤ ε
|ξ|ε :=
|ξ| − ε otherwise.
Finally, w̄ from the regression function given in equation (3.2) is defined as
w̄ = ∑ βi xi
b̄ = − hw̄, (xr + xs )i
Additionally, β are the coefficients of the samples, where samples that have nonzero coefficients are the support vectors. SVR uses therefore a small subset of the data
to construct the final model.
Figure 3.3 illustrates such a case where points outside the shaded area contribute
to the cost to a certain extent as the deviations are penalized in a linear fashion.
Besides using a regression algorithm to predict the value of a stock price, SVR has
also been found to work well in time series forecasting application such as in the works
by (Mukherjee et al., 1997; U. Thissen, R. Van Brakel, A.P. De Weiher, W.J. Melssen,
2003; Muller et al., 1997)
Chapter 3. Methods
Figure 3.3: The soft margin loss setting for a linear SVM (Image from (Schölkopf and
Smola, 2002))
Evaluation Methods
In this section we define evaluation methods that will be applied to our experiments in
order to compare and draw conclusions from our results.
Stock Selection
As mentioned before, we chose four different stocks from the NASDAQ stock exchange: Google Inc. (GOOG), Apple Inc.(AAPL), First Solar, Inc. (FSLR) and Intel
Corporation (INTC). All experiments are conducted in the same date span starting
from Monday, July 19, 2010 through Friday, July 30, 2010.
Our aim was to select two popular stocks (Google & Apple) that were frequently
mentioned on Twitter. Second, we wanted to add a stock that was less popular and out
of a specialized domain. For this we chose the solar company First Solar. And finally
we added one stock which did not have a significant change in value over the period
of our experimentations in order see if the learning algorithm would have difficulty to
find distinctive patterns. A snapshot of the chart of INTC is shown in Figure 3.4 which
shows the prices during the time period of our experimentation. The remaining charts
of AAPL, GOOG, and FSLR can be found in appendix A in the Figures A.1, A.2, and
A.3 respectively. All experiments are evaluated on all four stock datasets except when
specifically specified otherwise.
We carried out multiple runs of individual experiments with different parameter
settings as well as kernel selections of the SVR algorithm and found that the linear
Chapter 3. Methods
Figure 3.4: Intel Corporation stock chart snapshot (INTC)
kernel worked best with parameter settings c = 0.1 and ε = 0.5. Other experiments use
a validation set to determine the optimal parameters of c and ε which is explained in
more detail in Section (5.5).
Error Measure
For each experiment we calculated the Mean Squared Error (MSE) of a strong baseline.
The MSE is defined as
MSE (t) =
1 N
(y (xi ) − ti )2
N i=1
where t is the target value, y the predicted value and N the number of samples in the
testing test. This baseline is calculated by taking the Simple Moving Average (SMA)
of a series of stock prices in the testing set. The SMA is defined as
SMAt =
∑ (Pt−i)
where t represents the latest value in the time series, T is the number of time series
steps and P is the price at each step.
We used the target price of each testing sample as the starting point and summing
it up with the 59 previous ticks. Since we captured stock quotes every minute, the
running average spanned over a period of 1 hour (a total of 60 ticks), making it a very
strong baseline for our algorithm to beat. Often, a random baseline is also included for
Chapter 3. Methods
comparison; however, we believe that in our case a random baseline is too weak and
Virtual Stock Trading Engine
Finally, to better understand the meaning of our results and the impact of the MSE
values, we created a Virtual Stock Trading Engine to evaluate whether it is possible
to make a profit from the predictions the model generated. The engine imitates a day
trading agent and follows general rules to formulate a fairly realistic simulation environment. At the start of a trading session the agent is given an initial capital. In all our
evaluations, we chose to set the capital of each trial to 10,000 units. The trial begins by
looping through the time series of testing examples beginning with the oldest instance.
Although the test set contained instances which belonged to the time periods of preand after-market hours, we decided not to include these instances and allow the agent
to only place orders within the regular opening hours of the NASDAQ stock exchange
which are between 9:30AM and 4:00PM Eastern Daylight Time (EDT). However, in
our final experimentations, where we predict prices during the entire two week period,
we change this restirction also allowing trades to be carried out during extended hours.
Every transaction was accompanied with a transaction costs or commission which we
set to 1 unit based on the popular online broker firm Interactive Brokers7 . The agent
was able to place regular market or short orders at any given time of the regular opening hours. Moreover, for each of the stocks in our experiments, we set a minimum
profit target value which included at least twice the commission - once for a purchase
order and once for a selling order (or the equivalent for short orders). Unfortunately,
since we were not able to capture the asking or bid prices, we replaced these values
with current price at each time interval.
During a trading session the agent follows the following rules:
• Given the current capital, current stock price and the predicted stock price. The
agent calculates the potential profit per stock, how many stocks can be purchased
with the current capital and the total profit, given the total number of stocks the
agent can purchase with the available capital. This calculation also takes the
commission into account.
7; while this broker firm offers low transaction
costs, customers must have a minimum of $10,000.00 starting capital to open an account. On this
basis we chose our starting capital of 10,000 units.
Chapter 3. Methods
• If the potential profit is bigger than the target profit, the agent places an order
purchasing the maximum number of stocks possible. This applies to both market
and short orders.
• At each time interval, the agent checks whether stocks can be sold or shortened.
Since we used the predicted value to calculate the potential profit, the agent must
sell/short the orders a least 15 minutes after the purchase was made.
⇒ We repeat this process until there are no training instances during market hours
left after which the total profit/loss is returned.
It is important to note that once an order has been placed and paid for, the agent
cannot place another order for at least 15 minutes, after which the orders are liquidated
and ready for further transactions. After this point, it is not guaranteed that a new order
will be placed. A new order entirely depends on the threshold of the target profit and
whether the variables of capital, current and predicted price meet that desired profit. In
our experiments, we found that different target profit values rendered different results.
Therefore we ran the virtual stock trading engine with several different target profit
thresholds and averaged them up.
Chapter 3 Summary
This chapter presented the general framework including a brief introduction to the individual components. We motivated our choices of algorithms as well as described
methods of obtaining different data collections required for our task. Finally, we introduced the different methods used for the evaluation of our experimental results.
Chapter 4
Experimental Setup
In this chapter we explore the experimental setup required to for our experimentations.
Besides explaining the main processes of data cleanup and feature selection & construction we also discuss data requirements and dataset construction.
Twitter Data
In IR, text is usually defined as documents, where a document can be journal articles,
WebPages, emails, news stories, books, etc. In our case, the documents are micro-blogs
and therefore short sentences or single paragraphs. Where most documents require a
lengthy process before being published, Twitter posts or tweets are released within
seconds from millions of users each day. Due to this huge number of posts, we must
put more emphasis on problems concerning relevance and evaluation of information,
including filtering out a lot of noise surrounding relevant information.
The raw Twitter posts are gathered from the Informatics Forum at the University
of Edinburgh using the Twitter streaming API. Twitter releases chucks of data in short
time intervals (about 15 minutes) which are only a subset of the full public Twitter
timeline. The size of each file is not constant and is managed by Twitter. In its raw
form, the data consists of one line per post with each line having the following fields
separated by tabs:
• Date (e.g. Sun Jul 18 10:33:28 +0000 2010)
• Username (max 20 characters)
• Text (max 140 character)
Chapter 4. Experimental Setup
• Source (e.g. web)
Raw Data Cleanup
Raw Twitter data contains a very large amount of noise that is not relevant to the query
task at hand. Not only does such noise render prediction of stock prices much more
difficult if it is not filtered out correctly, but it also contributes to longer processing
times. After analyzing the raw data, it became apparent that much noise comes from
non English languages, automated bot postings, and many sources of spam. In this
thesis we are only concerned using posts that are written in Latin languages such as
English. Therefore, we wanted to exclude languages like Chinese due to their different
character set. An easy way to filter out non Latin languages is by checking if a post
contains Unicode characters in which case it is excluded from the dataset. However,
many posts written in Latin languages may still contain some Unicode characters and
penalizing those may remove posts that could be relevant to the IR task. For this reason,
we checked each character in a post and counted the number of ASCII and Unicode
characters. We removed the post if the ratio of ASCII characters was above some
threshold. We found that using a threshold of at least 95% or more ASCII characters
worked very well. We cleaned up the data set by removing around one third of the
posts. Figure 4.1 shows a snapshot of the data that has been removed using this process.
While the reduction was already every extensive we looked into further methods to
perform additional clean up in order to remove posts submitted by unwanted automated
bots and other sources of spam. (Mowbray, 2010) analyzed statistics of the behavior
of Twitter users and found that regular users do not submit more than 100 tweets per
day. He also discovered that starting from June 2009 a huge number of Twitter accounts started publishing tweets that exceeded the 100 mark and in some cases even
went over 1,000. In his paper, he correlates this sudden rise of automated postings with
literature releases that occurred a few months prior. These documents include the Twitter API handbook released in April 2009 as well as marketing books such as “Twitter
Marketing Tips”(Brooks, 2009) or “Dominate Your Market with Twitter: Tweet Your
Way to Business Success”(Jon Smith, 2009).
(Krishnamurthy et al., 2008) created a detailed characterization of Twitter and analyzed, among others, the characterization of user accounts. He found that there are
three distinct Twitter groups. The first group has many followers but at the same time,
the group does not follow many accounts in return. He labels accounts in this groups
Chapter 4. Experimental Setup
Figure 4.1: Samples of posts that have been removed by the Raw Data Cleanup process where each line represents a separate post.
broadcasters which include radio stations that use Twitter to broadcast the current
songs being played, as well as news or media outlets, such as the New York Times or
BBC, which are broadcasting current headlines. The second group is label acquaintances and represent the users with a ratio of followers to following that is close to 1.
Users who use Twitter on a regular basis fall into this group. A third group has a much
larger number of following accounts compared to the followers and is characterized as
potential spammers as these accounts try to connect with any user they can, in hope
of being followed in return. As a result, these groups start spamming all the followers
they managed to obtain.
Another interesting finding by (Huberman et al., 2009) explains two types of networks amongst twitter users. Even though users may have many followers and following in their network, there is only a small subset of users to whom they have posted a
tweet directly to. Huberman et al. define the former as the dense and the latter as the
sparse network. The sparse network proves to be the more influential network since
users belonging to it are more engaged in back to back communications. On the other
hand, accounts who constantly submit posts to their entire follower base rather than a
subset are therefore a separate group, which in many cases can also be characterized
as spammers but could also be broadcasters.
Since the release of the research papers mentioned in this section, new Twitter
features and tools have been made available which simplify the automation of posting
tweets. Given these findings and the complexity of determining different groups of
users while distinguish spammers from non spammers, we decided not to remove posts
generated by automated bots in order to ensure not to lose any meaningful data such as
broadcastings from news outlets and other. Instead, we leave the task of spam detection
Chapter 4. Experimental Setup
and Twitter account ranking and categorization as future work.
After cleaning up the dataset from noise the next objective was to further filter the
posts in order to retrieve a new dataset that contained stock related messages. Here
we applied the query expansion process introduced in Section 3.2, which used Google
Sets automated keyword expansion webservice.
Feature Construction
Our first set of features had to be very simplistic and a base for future improvement. We
created a vector space model using a basic bag of words approach which is usually a
standard in IR and text mining applications (Croft et al., 2009). This model is beneficial
due to its capabilities to implement term weighting, ranking, or relevance feedback. In
this model each document is part of a t dimensional vector space where t is the number
of indexed words. A document, in our case a Twitter post, is represented by a vector
of such indices:
Di = (di1 , di2 , . . . , dit )
where di j represents the weight of the jth word. A Corpus of n Twitter posts is
represented as a matrix where each row is a separate bag of words representation of
one Twitter post and each column describes the weight that is attributed to a word for
a given document:
Term1 Term2 . . . Termt
As input we used our preprocessed and filtered dataset which relates to a specific
stock in question. For the bag of words approach we needed to split each post into
individual words, which we accomplished by the use of a tokenizer. Our tokenizer
uses Python’s build-in shlex1 library which is a module for lexical text analysis that
is useful for parsing text and create tokens. The shlex library is very powerful in
terms of the granularity of tokenizing text and has additional features such as ignoring certain characters which may be important to specialized queries. For example,
Chapter 4. Experimental Setup
if our tokenizer encounters a Twitter topic keyword i.e. ’#finance’, we do not want
to tokenize this sequence into ’#’ and ’finance’ since hash tags are used to identify
Twitter topics similar to tags in blogs. Additionally, the tokenizer would completely
split Uniform Resource Locators (URL)s into fragments due to many non-alphabetic
characters. Posts that contain frequent occurrences of specific URLs may be important
information for the regression task. In order to retain occurrences of URLs in posts
we used regular expressions to identify, index, and then remove any URLs before allowing the tokenizer to continue. Finally we added additional rules that determined
whether a token should be kept. For example, numbers found in tweets can relate to
anything and are in most cases useless noise that should be excluded. Additionally,
any tokens that are one character long will not add much to the predictive power, such
as words like ’a’, ’I’ or any punctuations and symbols. We were careful to not add too
many such rules for several reasons. First, numbers can in some cases have important
information; for instance, when users chat about specific products, a number will indicate which models or versions they refer to. As an example, in the string ’Windows
7’ the tokenizer would ignore the number seven. Additionally, dates or times are also
ignored. A second problem with our parser is that strings such as ’I.B.M.’ will not be
retained since the tokenizer would produce 5 individual tokens of ’I’ ’.’ ’B’ ’.’ ’M’. It
was beyond the scope of this thesis to find an optimal tokenizer that best fits the task
of parsing Twitter posts for predicting stock quotes. We therefore leave this feature for
future work. The final output of the tokenizer is a list of key/value pairs where each
line consists of a distinct word with its frequency value respectively. At this point the
frequency is not a required attribute for the construction of the bag of words, but will
be used in calculating term weights as described in Section 5.3.
Dataset Construction
The final step involved before training a regression model is to construct a dataset
that follows the required format of the SVR algorithm. In section 3.4 we introduced
the concepts behind SVR and the reason why the algorithm is well suited for text
regression. There are numerous SVR implementations available on the web and we
decided to employ the LIBSVM (Chang and Lin, 2001) toolkit due to its cross platform
implementations as well as numerous successes in research literature.
LIBSVM requires the following format to create both training and testing datasets:
Chapter 4. Experimental Setup
< label1 > < index11 >:< value11 > . . . < index1t >:< value1t >
< label2 > < index21 >:< value21 > . . . < index2t >:< value2t >
< labeln > < indexn1 >:< valuen1 > . . . < index2t >:< valuent >
where t is the index of the current word in the bag of words and n is the number
of samples in the dataset. This list contains index-value pairs where each < indexnt >
represents the index of a distinct word from the bag and < valuent > corresponds to
the frequency of the word occurring the current sample. The index is an incremental
integer value starting at one enumerating the set of features. This representation is
sparse as zero values are ignored so that only non-zero values are represented with
their respective index. Furthermore, < labeln > is the target value used for regression,
in our case, the price of the stock we are interested in at the time the twitter post was
released. Before matching the target price with a post, we had to convert the date-time
from the recorded stock quotes to match the time zone of the Twitter posts. In our case,
the Twitter posts used the Coordinated Universal Time (UTC) time zone whereas the
Stock prices were recorded using the EDT.
Chapter 4 Summary
In this Chapter we laid out the experimental setup requirements and described the
remaining components which play an important role in our framework design. Raw
Data cleanup is the first process which helps remove a small portion of spam and posts
that are written in other languages. We explained the process for the creation of the bag
of words feature vector with the use of a tokenizer to split individual words. Finally,
we looked at the requirements of the dataset which are needed to transform the feature
vector into a format compatible with the LIBSVM tool.
Chapter 5
Implementation and Results
In this chapter we will discuss the implementation of different feature construction
approaches and evaluate their performance on multiple datasets. We then analyze the
results and explore new ideas and improvements. We begin our description with the
most basic feature representation followed by improvements over our results.
Simple Bag of Words
Our first objective was to test the performance of the bag of words feature set which
we obtained from the filtered Twitter posts. The posts were filtered using keywords obtained by the query expansion algorithm explained in Section 3.2. In this experiment
we matched the text with the stock price that existed at the same time of the release
of the post. At this point, we were not yet interested in forecasting a future price, but
rather test if we can build a regression model of the posts that can fit the time line
of our selected stocks. We added two additional filters to the already pre-processed
dataset: One filtered the number of features and another the number of posts: The
dimensionality of our feature space was initially relatively high and contained after
preprocessing on average over 250,000 distinct terms most of which were redundant
and had therefore low frequency counts (e.g. ’zzzzzzzzzzzz’). We applied a threshold on the frequency of each term and kept only features with three or more counts
(Joachims, 1998). Additionally, we only included posts that contained at least three or
more distinct activated features, i.e. where the attribute value 6= 0. This helped remove
posts which had no meaningful information due to the fact that they were either too
short or contained numerous repetitions of the same words.
The model for this experiment was built using two different training sets. The first
Chapter 5. Implementation and Results
Table 5.1: Simple Bag of Words - experimentation results
Stock Symbol
Full Dataset Size
Baseline MSE
Prediction MSE
Reduced Dataset Size 14,046
Baseline MSE
Prediction MSE
Figure 5.1: APPL Simple Bag of Words - experimentation results
contained all training examples retained after preprocessing the raw data. The second
contained a reduced form of the full training set in order to speed up training of the
model and also test whether or not we could achieve similar performance as with the
complete set. In order to reduce the training set and maintain the same proportions of
examples with respect to the time series, we removed every n example(s), rather than
cropping the entire set from the bottom or top.
Before training, we also removed 10% from the bottom of the training sets and kept
those examples for testing. The bottom 10% corresponds to examples with dates at the
end of the time series. The training sets comprised of samples from Monday, July 19
to Thursday July 29. The test sets contained samples from the remaining day, Friday,
July 30, 2010. Table 5.1 shows the results of the simple bag of words experiment of
all four stocks.
While none of the experiments performed better than the baseline, Apple and
Google were much closer than Intel Corporation and especially First Solar. As described in Section 3.5.2, the baseline of each sample is calculated by taking the SMA
of the last 60 ticks (one tick per minute) starting from the current price at the release
Chapter 5. Implementation and Results
Figure 5.2: GOOG Simple Bag of Words - experimentation results
Figure 5.3: FSLR Simple Bag of Words - experimentation results
Figure 5.4: INTC Simple Bag of Words - experimentation results
Chapter 5. Implementation and Results
date of the Twitter post. None of the results seemed to follow the actual price line.
The prediction results of AAPL for instance simply overlap the actual price line as
seen in Figure 5.1 and exhibit very strong noise. In Figure 5.4 the predicted price
for INTC was not even in the same region and showed no correlation with the actual
price. We found the reason for these extreme error score differences by investigating
the historical stock charts which can be found in Appendix A. For AAPL (Figure A.1),
the average price spanning over the time period of our training data is similar to the
average price predicted in the experiment. This is the case for all four stocks. The
Regression model therefore built a price line that correlated with the average of the
training data, making it ineffective for our task. To build a better regression model,
we needed features that represented the current state of the price. We decided to include the price of the stock as a new feature in the training set. This approach will be
discussed in section 5.2.
The examination of the results of the full datasets compared to the reduced sets
indicated that the performance was rather similar. However, since the model performance of this experiment was poor, we decided to investigate the results of the two
dataset sizes in the next experiments, rather than inferring a meaning at this stage.
Another point to notice is the difference in the size of the datasets accross stock
symbols. While we expected to have bigger datasets from the Apple set, it was almost
half the size of Google’s or First Solar’s sets. Moreover, we expected to find fewer
matches concerning FSLR which, to the contrary, contained almost as many posts as
the GOOG dataset. To find clues about the reasons behind these numbers we decided
to look at the bag of words representations for each stock and compare it with the
query terms initially created. After ordering the terms of the FSLR bag of words by
frequency, besides finding many stop words on top of the list, we also found words
from the original list of expanded query terms such as ’video’ with 52,177 counts,
’world’ with 28,149 counts, and ’house’ with 20,783 counts. On the other hand, terns
like ’green’, ’wind’ , ’energy’, and ’solar’ had frequencies of 6685, 4377, 4198 and
1389 respectively. The comparison of the other stocks showed that the AAPL bag of
words had query terms with top frequencies such as ’iphone’, ’media’, ’google’, and
’ipad’. Similarly, the GOOG bag of words contained query terms with high frequencies such as ’youtube’, ’msn’, ’twitter’ and ’ business’. At first glance, these terms
seem to have more relevance to the company than the top query terms of FSLR. We
concluded that the performance of the query expansion algorithm was responsible for
these deviations. Furthermore, the query terms generated by the algorithm are ques-
Chapter 5. Implementation and Results
Figure 5.5: AAPL Simple Moving Average - experimentation results
tionable, given that the term ’video’ was generated for the First Solar query terms but
not for Google’s list. We will therefore explore the tf-idf algorithm in section 5.3 to
determine weights and possible solutions for optimizing query term generation as well
as feature selection improvements.
Moving Average and Stop Words
In the research conducted by (Schumaker and Chen, 2009), which analyzed news articles to predict companies’ stock prices, they found that adding both the bag of words
features as well as the price of the stock at the release of a news article greatly improved results over their baseline. We decided to add a similar feature, but rather than
adding the stock price which was current at the release of the Twitter post, we used
the SMA to account for sudden dips or spikes in price movements. Therefore, our
new feature was calculated similar to the baseline as described in Section 3.5.2: we
averaged the price of the stock at the release of the Twitter post with the previous 59
minute ticks.
Additionally we added two minor improvements, first removing the most common
stop words found in the bag of words representations of all stocks and, second, performing stemming on every tokens using the stemmer from the NLTK1 toolkit. These
improvements are standard techniques used in IR and may help to further decrease
the prediction error (Croft et al., 2009). Table 5.2 shows the results after running the
We ran the experiments on the entire dataset as well as the reduced set as described
in the previous section and found that the results were again reasonably similar. We
Chapter 5. Implementation and Results
Table 5.2: Simple Moving Average - experimentation results
Stock Symbol
Full Dataset Size
Baseline MSE
Prediction MSE + SMA Feature
Reduced Dataset Size
Baseline MSE
Prediction MSE + SMA Feature
Reduced Dataset Size - stop/stem
Baseline MSE
Prediction MSE + SMA Feature - stop/stem 0.3888
Figure 5.6: GOOG Simple Moving Average - experimentation results
Figure 5.7: FSLR Simple Moving Average - experimentation results
Chapter 5. Implementation and Results
Figure 5.8: INTC Simple Moving Average - experimentation results
also found that adding the SMA as a feature improved the MSE score for all stocks very
close to the baseline, but not outperforming it. As expected, stemming and removing
stop words increased the performance in all cases as shown in the last row of Table 5.2.
On the other hand, the baseline also increased in 3 out of 4 cases because stemming
and stop words removal also affected the size of the training and test sets, which in
turn affected the baseline calculation. Figures 5.5, 5.6, 5.7 and 5.8 show the results
of the experiment in the time line with the prediction price plotted against the actual
price. As the results were still not satisfactory, we turned our focus to finding possible
improvements relating to the query terms.
Weighted Query Terms
The analysis of the results obtained so far showed that our model is capable of building
a regression line which follows the curve of the actual stock price. However, the performance was not well enough, as the prediction line still deviated too strongly from
the actual stock prices indicating that the list of keywords we used to construct our
features either retained Tweets that were not relevant enough or that the retained posts
did not have terms that would help discriminate the direction of the stock price. In
this section we look at the process of feature construction in more detail and start by
analyzing the Twitter posts that were retained after applying the query term filters used
to remove irrelevant information. Our goal was to understand if there was a useful
correlation between price changes and key word frequencies of posts. As an example,
we will focus our discussion on the Apple stock.
We counted the frequencies of each key word in the entire corpus and found that
keywords such as ’iphone’, ’media’, ’google’, ’web’, ’apple’, and ’ipad’ had the highest frequency count and terms such as ’msft’, ’aapl’, ’mot’, ’macosx’, ’goog’ contained
Chapter 5. Implementation and Results
very low frequencies. Most of the latter keywords are the stock symbols of the companies that compete with Apple while the former keywords relate more to the Apple
company and its product line. Then, there are also keywords that are generally common, such as ’media’, ’google’ and ’web’.
We decided to take some of the keywords and plot them against the stock price
to see if there was a correlation between the price and the number of mentions of
each word and to observe if any keyword had a bigger impact than others. We found
that keywords that had a low frequency did not show noticeable correlations to price
changes. On the other hand, keywords that had a very high frequency count and which
related directly to the company including currently popular products and services did
show considerable correlation with price changes e.g. ’apple’ and ’ipad’. Yet, other
high frequency terms that did not relate to the company or that described less popular
products or services such as the term ’ipod’ or ’media’ did not show similar correlations. Figure 5.9 depicts the keyword counts of two keywords, ’aaple’ and ’ipod’ as
well as the price of the AAPL stock. The interval on the x-axis is represented in hours.
The frequency of the keywords is aggregated over every hour. Similarly, the price of
the stock is averaged over every hour. While we are not expecting the frequencies to
model the price, we are able to see spikes in keyword mentions of the term ’apple’
during several strong price changes of the stock. We also included the term ’ipod’
to demonstrate as a comparison that it did not show reasonable correlations with the
Given that a number of terms showed correlations with price changes we decided
to investigate the text of these posts during the periods of time of high activity to see
what sort of language was used and what the meaning and sentiment reflected. Below,
we show five random samples from July 21 as we witnessed both considerable spikes
in price increase as well as in frequencies of key terms on that date. (Usernames are
replaced by asterisks to protect their privacy)
- theres a #mendeley iphone app.. going to check it out. (via
- apple enjoys solid q3 on strong ipad, iphone sales
- flashlight app secretly lets you enable iphone tethering #macworld
- apple profits soar thanks to iphone and ipad: sky newsother technology compa-
Chapter 5. Implementation and Results
Figure 5.9: AAPL query term frequencies of ’ipod’ and ’apple’ against the Apple stock
price. For the ’apple’ term, we can observe correlation between high term frequencies
and major price changes as on July 21, where both the price and the term count rise
nies have been posting good profits wi...
- @*********: the ipad clearly cannibalized mac sales last quarter ” except the
The samples exhibit sentiment that indicates a positive trend for Apple as the discussions touched on the topic of Apple’s positive quarterly profits. Therefore the increased mention of the keyword ’apple’ on Twitter correlates with the increase in price
of the stock. These are the type of posts we would like our keyword filter method to
While we mentioned many benefits associated with the short size of Twitter posts
in Section 2.3, a drawback is that more query terms are required to select posts that may
contain important information capable of building a close regression model. Since the
previous results showed that the majority of key terms did not have noticeable correlations with price changes, we did not want to penalize these key terms by removing
them entirely from the filter process. But at the same time, we did not want to select
tweets just on the basis of matching a seemingly unimportant key term. We therefore
assigned weights between 0 and 1 to the key terms. This was mainly a semi-manual
process where we examined the price changes and frequencies of the key terms from
charts we generated, such as the one in Figure 5.9
Generating the weighted query terms was the first part to improve our filter method.
We also needed to assign a weight to each occurrence of the query terms in a Twitter
Chapter 5. Implementation and Results
post. For this task, we counted the occurrences of every query term appearing in each
Twitter post as well as the total number of times each word appeared in the entire
corpus. With these counts we used the tf-idf algorithm to calculate the weights of each
post. This algorithm has two parts:
t fik = fik
where t fik is the term frequency weight of term k in a post and fik is the number of
occurrences of the term k in the post.
id fk is the inverse document frequency weight for term k and N is the number post
id fk = log
in the Twitter dataset. nk is the number of posts in which term k occurs. To calculate the
final weight, equations (5.1) and equations (5.2) are multiplied. Usually equation (5.1)
is normalized in order to avoid favoring longer documents since they may contain a
higher query term count regardless of the general importance of the term. Since twitter
posts are limited by 140 characters, we decided to remove the normalization factor.
In our first experiments, the Pre-Processing step retained all tweets that contained
at least one of the query terms. With the calculation of the query term weights as well
as the tf-idf weights described in the previous section, we want to reduce the dataset by
only selecting posts that adequately match the weighted query terms. To compare how
close a document matches our weighted query terms, we apply the cosine similarity
measure defined as:
∑tj=1 di j · q j
Cosine (Di , Q) = q
∑tj=1 di2j · ∑tj=1 q2j
where Di is the vector of weighted query terms occurring in the Twitter posts and
Q is the list of weighted query terms. The numerator is the sum of the inner product
of the post weights and the query term weights. The denominator normalizes the resulting weight by dividing the inner product of the lengths of both vectors. We found
that using a threshold of 0.35 gave the best results returning the most relevant Twitter
posts. As a remark to our previous experimentation with different dataset sizes, we
would like to point out that the cosine similarity measure reduced the initial dataset
considerably removing posts that did not reach the threshold. Therefore we stopped
our experimentations with two separate dataset sizes as described in Sections 5.1 and
Chapter 5. Implementation and Results
Table 5.3: Weighted Query Terms Filter Method - experimentation results
Stock Symbol
Baseline MSE
Prediction MSE + SMA Feature 0.2578
Figure 5.10: AAPL Weighted Query Terms Filter Method - experimentation results
After running the experiments on the new datasets we achieved improvements over
the previous results as shown in Table 5.3.
It should be noted that the MSE of the baseline also improved due to the fact that
we retained a smaller subset of posts. Figures 5.10, 5.11, 5.12, and 5.13 show the
prediction results of AAPL, GOOG, FSLR, and INTC respectively against the actual
price of each stock. The reduction of the datasets had the highest impact on the INTC
stock. This is contrary to our assumptions made in Section 3.5.1, where we presumed
that FSLR would have fewer samples as the other stocks.
Figure 5.11: GOOG Weighted Query Terms Filter Method - experimentation results
Chapter 5. Implementation and Results
Figure 5.12: FSLR Weighted Query Terms Filter Method - experimentation results
Figure 5.13: INTC Weighted Query Terms Filter Method - experimentation results
Table 5.4: 15 Minute Predictions - experimentation results
Stock Symbol
Baseline MSE
15 minute pred. MSE
Profit/loss (10,000 starting capital) 10,130.54
9,983.7 9923.65
Figure 5.14: AAPL 15 Minute Predictions - experimentation results
Chapter 5. Implementation and Results
Figure 5.15: GOOG 15 Minute Predictions - experimentation results
Figure 5.16: FSLR 15 Minute Predictions - experimentation results
Figure 5.17: INTC 15 Minute Predictions - experimentation results
Chapter 5. Implementation and Results
5.4 15 Minute Predictions
Up until now, experiment predictions were done using the price at the time of the
release of the Twitter posts. Yet, our goal is to predict the stock price t minutes into
the future, so that an agent could use a trading rule and decide whether to act upon
the prediction and buy or short a security or stay idle until a better prediction is made
thereafter. Since the results of the experiments in section 5.3 approached the baseline
in all cases fairly well, we decided to transform our datasets in order to use stock quotes
15 minutes after Twitter posts have been released. The reason we chose a 15 minute
prediction is merely due to the fact that new chunks of Twitter posts are released to us
every 15 minutes. In this set of experiments we started using our Virtual Stock Trading
Engine introduced in section 3.5.3.
Table 5.4 shows the results after transforming the target values in the datasets while
Figures 5.14 to 5.17 show again the predicted price against the actual of all four stocks.
As we expected, in every test case, the results have slightly decreased, because in the
previous dataset, the SMA feature was much closer to the target value. Whereas now
it was 15 minutes further away. Nonetheless, the results are still satisfactory enough to
continue experimenting in this domain. Furthermore, results from the trading engine
showed mixed results. With 10,000 units starting capital, and 1 unit transaction cost2 ,
only AAPL and GOOG successfully turned the investment during one trading session
into profits.
Parameter Validation
In all previous experiments, we have used 90% of the time series data for training and
the final day (10%) for testing. In a real world application, we would probably not
train a model to predict prices for an entire day since that model may ignore new and
unseen events that could not have been captured during training. Also, while the MSE
approached the baseline in all cases fairly well, we did not manage to beat the baseline
in any test case. In particular, we wanted to know if a model would perform similar
or better, if it could capture features in a shorter time interval. We therefore asked if
it was possible to train several models using subsets of the dataset and forecast prices
during the entire two week period rather than just on the last day. Specifically, can we
take a few hours or, even several minutes of Twitter posts and predict the stock price
2 based
on the transaction costs of the online broker firm
Chapter 5. Implementation and Results
with similar accuracy? Each new model takes the form of
< train1 >< test1+15min >< train2 >< test2+15min > . . . < traint >< testt+15min >
where < traint > represents a subset of n training instances, < testt+15min > contains one testing sample which is at least 15 minutes away from the train samples, and
t is the number of possible train/test subsets for the two week period. We made some
modifications to the Virtual Stock Trading Engine in order to accommodate the dataset
transformations. Finally, all previous experiments built a model using the same SVR
parameters as reported in Section 3.5.1; and while we reported the experiments on the
most effective parameters, we still continued experimenting on different control parameters. Around 60% of the time, the same parameters performed best. However, as
an additional improvement, we wanted to find out if a validation set could always select
the optimal parameters for the algorithm with the assumption that a trained model was
capable of forecasting twice the distance into the future. The reason for this assumption is due to the fact that all samples contain the 15 minute forecast price as the target
value. This means that, in a real world scenario, validation could only happen after
observing the actual target value, which would be revealed after 15 minutes. The same
assumption is made (after validation) for the case of predicting the final price which
must therefore be twice the distance (30 minutes) into the future. For the experiments
conducted in this section, the datasets then take the form of
< train1 >< val1+15min >< test1+30min > . . . < traint >< valt+15min >< testt+30min >
where the training set now also includes a validation set, with < valt+15min > 15
minutes into the future for validation and parameter optimization and < testt+30min > 30
minutes into the future, for the actual prediction using the optimized parameters. Table
5.5 shows the results of the 15 minute predictions as well as the 30 minute predictions
including 15 minute validation. In the first three rows of the Table we included results
from the validation set without making any future forecasts in order to compare the
performance of validation on a simple model. This is equivalent to a 0 minutes wait
period before using samples for validation or testing.
We also evaluated the 15 minute predictions without validations by plotting the
prediction results as dots (in red) against the plot of the actual price line (in blue)
Chapter 5. Implementation and Results
Table 5.5: Multiple Equally Sized Datasets - experimentation results
Stock Symbol
Baseline MSE
Prediction 0m val. & 0m pred.
Profit/loss 0m val. & 0m pred.
Baseline MSE
Prediction no val. & 15m pred.
Profit/loss no val. & 15m pred.
10155.41 10099.55
Baseline MSE
Prediction 15m val. & 30m pred.
Profit/loss 15m val. & 30m pred.
Figure 5.18: AAPL Multiple Equally Sized Datasets - experimentation results
Figure 5.19: AAPL Multiple Equally Sized Datasets - prediction error results
Chapter 5. Implementation and Results
Figure 5.20: GOOG Multiple Equally Sized Datasets - experimentation results
Figure 5.21: GOOG Multiple Equally Sized Datasets - prediction error results
Figure 5.22: FSLR Multiple Equally Sized Datasets - experimentation results
Figure 5.23: FSLR Multiple Equally Sized Datasets - prediction error results
Chapter 5. Implementation and Results
as seen in Figure 5.18, 5.20, and 5.22 for AAPL, GOOG, and FSLR respectively.
Furthermore, we plot on a separate figure the difference in price between the predicted
and the actual price as seen in Figure 5.19, 5.21 and 5.23. Each point on the graph
comes from a single prediction during the two week interval which are connected to
form a line. While we cannot draw direct comparisons to our previous experiments due
to the fact that we are using different test sets, the results for the 15 minute prediction
without validation are similar to previous results. These results show that using shorter
training sets to build a model for short term prediction is possible. Once again, we
do not have any valuable results from INTC since the dataset was too small. A point
to note on all graphs showing the prediction dots vs actual price line are the deep
square heaps and dips which look like irregularities. These are in fact the price values
during the pre or after market hours which were included in these experiments since
predictions where now made across multiple days.
In terms of the experiments over different forecasting distances, we found mixed
results concerning the Virtual Stock Trading Engine. The first experiment used samples for validation and testing that were 0 minutes away from the training instances.
While this is not a possible real world scenario, our aim was to test the performance
of the parameter optimization. As expected, in all cases of the first experiment a profit
was attained. The second experiment results are rather unexpected, since in two out of
the three cases the profit increased over the previous experiment. However, in the case
of FSLR, the profit decreased substantially as anticipated. Finally, the third experiment
had a disadvantage of predicting 30 minutes into the future. And as we expected, more
test cases show significant losses. In the case of AAPL however, we have a high increase in profits. We assume that the reason is probably due to increased random noise
as we attempt to predict further into the future.
Accumulating Training Data
Our final experiment built upon the previous experiments which used multiple small
datasets. This time, however, we asked if we could increase the accuracy over time if
we kept all previously seen samples and reused them in all subsequent training sets.
Therefore, rather than using equally sized training sets, we began by constructing a
small training set containing 100 training instances and a test set with on sample. The
samples drawn for this initial dataset were again from the very beginning of the time
series. The following training sets were built by accumulating previous training data
Chapter 5. Implementation and Results
Table 5.6: Accumulating Training Data - experimentation results
Stock Symbol
Baseline MSE
Prediction with Accumulating Training MSE 2.6591
Figure 5.24: AAPL Accumulating Training Data - experimentation results
and concatenating them with new samples. We added one new sample after every
prediction was made; however, this new sample was also 15 minutes away from the
last predicted value. In this experiment, the training set gradually grew containing
more and more samples. Using this method, we expected to see higher errors at the
beginning of the timeline while, during the course of increased training samples, the
error should gradually decrease.
Figures 5.24, 5.26, and 5.28 show the results of the actual against the predicted
prices. The first aspect to notice is that there are many more prediction dots. This is
due to the fact that we are accumulating one sample at a time to build each new training
set rather than having fewer fixed sized training sets as in Section 5.5. Moreover, the
predicted values model the actual price lines more accurately from a graphical point
Figure 5.25: AAPL Accumulating Training Data - prediction error results
Chapter 5. Implementation and Results
Figure 5.26: GOOG Accumulating Training Data - experimentation results
Figure 5.27: GOOG Accumulating Training Data - prediction error results
Figure 5.28: FSLR Accumulating Training Data - experimentation results
Figure 5.29: FSLR Accumulating Training Data - prediction error results
Chapter 5. Implementation and Results
of view. Nonetheless, our aim of this experiment was to test whether accumulating
training data would over time decrease the prediction error. We can measure this by
comparing Figures 5.25, 5.27, and 5.29 with the equivalent Figures from the previous
experiment 5.19, 5.21, and fig:exp5bfslr respectively. It is not very clear if accumulating training data improved results over time, but certain spike decreases show this
effect. Certainly in the case of the GOOG stock, and to a certain extent also in the case
of AAPL is a decrease in error noticeable. FLSR, on the other hand, shows signs of
strong variations towards the right of the graphs in both Figure 5.23 and 5.29. These
spikes seem correlate to the strong decline in the actual price seen in Figures 5.22 and
5.28, which may not have been captured in the trained model. From a error score
perspective, we have again similar results as before.
Chapter 5 Summary
In this chapter we presented different experiments to validate whether it is possible to
model the stock market using Twitter and further test if we can predict future stock
prices. We stared with the basic bag of words approach in the first experiment. We
added the SMA as a feature in the second experiment which significantly improved
results. We then fine-tuned the results by exploring feature selection techniques. We
turned our attention to a new set of experiments and tested the prediction of future
price movements including the use of a validation set to fine-tune model parameters.
Moreover, experiments were conducted on subsets of the dataset in order to predict
prices during the entire two week period.
Chapter 6
The experiments described in the previous chapter can be divided into two categories.
The first set of experiments concentrated on building a regression model of the current
stock price using Twitter posts. That is, the target value in each data sample used
the price of a stock that was current at the release of a Twitter post. The second set
of experiments attempted to use a target price that was set in the future. A further
distinction that can be made is the use of the data to create training and test sets. Our
focus in the initial experiments used 90% of the data in chronological order for training
and the final 10% for testing. In contrast, later experiments divided the dataset into
several training and testing sets in order to predict prices during the entire two week
period of available data.
Our discussion begins by analyzing the first category of experiments. Results from
the simple bag of words model in Section 5.1 indicate one major problem. Using
the entire training data from the first nine out of the ten business days for training
had merely predicted an average of the training period. Thus we saw a straight line
with heaps of noise spikes as depicted in Figure 5.1. We expected that using a lot
of data for training would help accumulate distinctive support vectors. However, in
time series prediction the older data becomes the more it loses its predictive power.
Since we gave the entire data the same importance, that is, we did not decrease the
weight of older data, the algorithm used the entire dataset equally creating the average
lines we have shown. To overcome this problem, we adjusted our feature vector by
using the SMA of the last 60 minutes of the stock price. This method had proved to
be successful in the research conducted by (Schumaker and Chen, 2009), where they
Chapter 6. Conclusion
used the last known price rather than the SMA. An additional improvement that may
help in reducing the importance of older data could be borrowed from an algorithm in
Reinforcement Learning (RL). In RL, learning environments are described by a set of
finite or infinite states in which the learner find itself. While there are different learning
approaches, learned values for each state are generally stored in a value function. A
concept called Eligibility Traces is used to give more importance to recently visited
states while decreeing the influence of states that have not been visited in a while
(Sutton and Barto, 1998). These values are then updated in the value function. This
same principle could be applied to the posts, by giving older posts smaller weight than
more recent posts. Furthermore, posts that lie to far in the pasts could be pruned off.
While the new SMA feature did not yet address the problems of time series data
we mentioned, it improved the results significantly. As shown in Figure 5.5 the prediction line adjusted to the actual price line, but still contained many spikes, which
indicated not only that we had not yet found features that were relevant to our task, but
also that there were a lot of samples which were probably noise or spam. We started
addressing this problem in Section 5.3 by analyzing the query terms used to reduce
and filter out the raw dataset of irrelevant posts. One key problem with the short size
of Twitter posts is that using too few query terms will inevitably skip posts which may
have strong predictive content. As we described our use of query expansion algorithm
in Section 3.2 a second problem arose: Using too many keywords that relate to our
query will include many posts that are not relevant to our task. It is essential to find
the right balance which we attempted by applying weights to the query terms. However, other methods should also be taken into consideration in future work such as
calculating information gain on individual features in order to remove the lest helpful
features and apply appropriate weights to more informative features. While these improvements helped increase the accuracy significantly towards the baseline as seen in
Table 5.3 one of the most important contributors to our remaining prediction error is
spam and noise. Twitter’s rapid growth in popularity has triggered a constant battle
between relevancy and spam in Twitter posts, which has been fueled by the releases of
the Twitter API. Furthermore, we should take into account who is posting information
that is considered relevant and influential, which posts are being re-tweeted most often, and which users have the strongest reach. This information can be obtained from a
combination of Twitter meta-data and the network of connected users. This knowledge
could then be used to assign additional weights to different posts. We believe that the
highest improvements in our thesis can be gained from identifying and removing spam
Chapter 6. Conclusion
as well as identifying and ranking relevant sources (i.e. user accounts). Given our
experimentation of different stocks, we obtained comparable results in most cases but
expected that the error measures of First Solar (FSLR) and Intel Corporation (INTC)
to be worse than those of Google (GOOG) and Apple (AAPL). We found that FSLR
and INTC datasets did not contain considerable fluctuations in price changes or noticeable spikes and that those factors contributed to accuracy observed. Therefore future
experimentation should include more data with various fluctuations as well as stocks
that are not part of a technical domain.
In the second category of experiments, we examined the prediction of future prices
as well as the implementation of new training/testing intervals. We changed the target
value of all samples in our dataset to use the price 15 minutes ahead of the current
value. Results in Table 5.4 show an increase in the MSE, more than doubling from
0.2578 to 0.5959 for Apple, with similar results for the other tested stocks. Since this
was expected, we moved on to address the problem of the datasets. As we explored
above, using the 90% of the dataset for training and the remaining 10% for testing
proved to exhibit problems in time series prediction. For this reason, we created multiple shorter datasets, in order to capture relevant information in close proximity of
the prediction at hand. To test our results of the 15 minute predictions, we created a
Virtual Stock Trading Engine described in Section 3.5.3. The stock trading engine did
not take into account all the predictions made by the model. It only selected those that
would yield a potential profit which was above a threshold of twice the commission
rate. While we found variations across datasets,generally, our findings indicate that
using Twitter as a source of near real-time information to predict the price ahead of
time can be used to make reasonable profits before the market adjusts itself. As the
time difference increases the profits become less stable, as shown in Table 5.5. From
the results using smaller datasets, we can see that the results are similar to our beginning experiments. While the MSE results are still not better than the baseline, the
graph of the predictions indicate closer matches as seen in figure 5.24 with less noise
as compared to previous experiments.
We set out to build a regression model of stocks using Twitter and Intra-Day minute
data. We used several NLP techniques to pre-process the raw Twitter text including
word tokenization, stop-word removal, and stemming of words. We implemented fil-
Chapter 6. Conclusion
tering techniques that included keyword expansion, term weighting using the tf-idf
weighting scheme, and the cosine similarity measure to reduce the dataset and create a
feature vector space of the tokenized and weighted terms. In our experimentations, we
found that the best feature vector with the lowest error measure was obtained from our
feature selection methods with the addition of a new feature which we constructed using the average stock price over the last 60 ticks of minute stock quotes. Results show
the predictions were very close to our strong baseline with an MSE score of 0.2578
for Apple versus the baseline MSE of 0.221. Similarly Google’s MSE score was 1.675
compared to the 1.6123 baseline. Finally, we also found that predicting the future price
can be achieved at short distances (15 minutes) into the future, but accuracy becomes
unstable as the forecast distance increases (30 minutes). From our work, we conclude
that information and beliefs can be extracted from the population giving a small but
significant advantage in predicting market prices. In Section 6.3 we will point out
possible improvements for strengthening our claims.
Future Work
For future work, we would be interested in exploring additional features as well as
filtering methods using Twitter meta-data. For example, meta-data contains information about the number of followers and following users. These values can be used to
determine important and influential users. The meta-data field ’statuses count’ may be
useful to distinguish spammers from real users as real users publish on average less
than 100 posts per day (Mowbray, 2010). Parallel to new improved features, the problem of spam must also be addressed. Twitter meta-data as well as Twitter statistics, and
Twitter specific functions such as re-tweets and hash-tags could be a starting point.
Important information may have been lost in the process of tokenizing the data.
Therefore, creating proper rules that are specific to the Twitter dataset in conjunction
with the topic of financial markets may lead to additional improvements. Additionally,
using larger datasets including stocks from different domains as well as recording stock
data from time periods with more volatile price changes and stock volumes would be
an important field of research. Sentiment analysis has proved to discriminate the belief
of the population over different topics. Therefore adding lexical knowledge about
positive and negative terms could potentially lead to additional meaningful features.
Appendix A
Stock Charts
The following figures A.1 (AAPL), A.2 (GOOG), and A.3 (FSLR) respectively, illustrate the snapshots of the charts of three out of the four stocks we selected for our
experiments. The chart for INTC can be found in Section 3.5.1. The snapshots are
from the time period between July 19th, 2010 to July 20th, 2010. This data was retrieved from the Google Finance1 website.
Appendix A. Stock Charts
Figure A.1: Apple Inc. stock chart snapshot (AAPL)
Figure A.2: Google Inc. stock chart snapshot (GOOG)
Appendix A. Stock Charts
Figure A.3: First Solar, Inc. stock chart snapshot (FSLR)
Appendix B
The following is a list of acronyms used in this paper:
CDC U.S. Centers for Disease Control and Prevention
EDT Eastern Daylight Time
EMH efficient - market hypothesis
ERM Empirical Risk Minimization
IR Information Retrieval
MSE Mean Squared Error
NLP Natural Language Processing
SMA Simple Moving Average
SMS Short Message System
SNW social networking websites
SRM Structural Risk Minimization
SVM Support Vector Machine
SVR Support Vector Regression
tf-cdf term frequency - category discrimination frequency
tf-idf term frequency - inverse document frequency
Appendix B. Acronyms
TREC Text RERtrieval Conference
URL Uniform Resource Locators
UTC Coordinated Universal Time
