Financial Machine Learning (CMSE11475)
Group Project Assignment
2023/2024
Content
Content................................................................................................................................................................................................. 1
Project Description......................................................................................................................................................................... 2
Individual Project: ......................................................................................................................................................................... 2
Project Deadline and Submission:........................................................................................................................................... 2
Project topic ................................................................................................................................................................................... 2
Project Hints ................................................................................................................................................................................... 2
Suggested Topics ............................................................................................................................................................................ 3
Forecasting Limit Order Book ............................................................................................................................................... 3
Forecasting Stock Volatility.................................................................................................................................................... 5
Forecasting High Frequency Cryptocurrency Return.................................................................................................. 7
Project Description
The project aims to practice the use of state-of-art machine learning models to analyse financial data and
solve financial problems.
Individual Project:
The project is individual project. No group is required. Students shall select their own topic with data to
complete their own research question alone. Cooperation and discussion with each other in the learning
process is encouraged but the project shall be completed by students’ own work, not a grouped work.
Project Deadline and Submission:
Individual projects run from 15
th January 2024 (week 1) to 29th March 2024 (week 10).
The deadline of submission is 14:00, Thursday, 4
th April 2024.
The submision of the project includes the project report and all implementation codes (do NOT submit any
data). The code shall work on the originally provided datasets. The report and the codes shall be ZIPPED to
one package for submission.
The report MUST follow the given template. All sections are required. The code MUST have complete and
detailed comments for every major logical section.
Project topic
Each student should individually choose a topic from the following suggested topics (with provided data) as
your own project. You are encouraged to revise/improve the project topic to make it more practical,
challenging, and suitable for your own research question. It’s fine if many students select the same suggested
topics as their projects as long as the codes and project reports are significantly distinctive.
The aim of this project is to apply at least THREE out of five techniques illustrated in the course (Deep Neural
Network; XGBoost; Cross-validation; Ensemble Model; Interpretability) to solve a financial problem.
Project Hints
All suggested topics are based on the computer lab examples with some changes and extensions. You can
easily find similar methods and models in the computer lab examples. Carefully studying those examples
and codes are crucial for understanding this course and complete the group coursework.
Suggested Topics
Forecasting Limit Order Book
Topic
Can we use deep neural network to forecast the high-frequency return at multiple horizon for stocks using
their limit order book information?
Data
10-level high frequency Limit Order Book of five stocks: Apple, Amazon, Intel, Microsoft, and Google on 21st
June 2012. Data size from 40MB to 100+MB. You can select to use part of the data.
Method
You may define the following features:are the ask and bid price of 10 levels (𝑖 = 1, … ,10), and 𝑣𝑡
𝑖,𝑎
and 𝑣𝑡
𝑖,𝑏
are the volume of 10 levels
(𝑖 = 1, … ,10). 𝑠𝑡
𝐿𝑂𝐵 ∈ ℝ40
2) Bid-Ask Order Flow (OF)
𝑏𝑂𝐹𝑡,𝑖 = {
𝑣𝑡
𝑖,𝑏
, 𝑖𝑓 𝑏𝑡
𝑖 > 𝑏𝑡−1
𝑖
𝑣𝑡
𝑖,𝑏 − 𝑣𝑡−1
𝑖,𝑏
,𝑖𝑓 𝑏𝑡
𝑖 = 𝑏𝑡−1
𝑖
−𝑣𝑡
𝑖,𝑏
, 𝑖𝑓 𝑏𝑡
𝑖 < 𝑏𝑡−1
𝑖
𝑎𝑂𝐹𝑡,𝑖 = {
𝑣𝑡
𝑖,𝑎
, 𝑖𝑓 𝑎𝑡
𝑖 > 𝑎𝑡−1
𝑖
𝑣𝑡
𝑖,𝑎 − 𝑣𝑡−1
𝑖,𝑎
,𝑖𝑓 𝑎𝑡
𝑖 = 𝑎𝑡−1
𝑖
−𝑣𝑡
𝑖,𝑎
, 𝑖𝑓 𝑎𝑡
𝑖 < 𝑎𝑡−1
𝑖
𝑂𝐹𝑖 ∈ ℝ20
3) Order Flow Imbalance (OFI)
𝑂𝐹𝐼𝑡 = 𝑏𝑂𝐹𝑡,𝑖 − 𝑎𝑂𝐹𝑡,𝑖
𝑂𝐹𝐼𝑡 ∈ ℝ20
The features can be defined as a vector
𝐗𝑡 = (𝑠𝑡
𝐿𝑂𝐵
, 𝑏𝑂𝐹𝑡,𝑖
, 𝑎𝑂𝐹𝑡,𝑖
,𝑂𝐹𝐼𝑡)
𝑇
The total dimension of feature vector 𝐗𝑡
is 40+20+10=70. 𝐗𝑡 ∈ ℝ70
.
The target is the the LOB mid-point return 𝐫𝑡 over 𝐻 future horizons (𝐻 ≥ 1).
𝐫𝑡 = (𝑟𝑡,1, … , 𝑟𝑡,𝐻)
𝑇
This project is to estimate the function 𝑓(∙), that takes a sequence of historical 𝐗𝑡 as input and generates
vector 𝐫𝑡 as output:
𝐫𝑡 = 𝑓(𝐗𝑡
,𝐗𝑡−1, 𝐗𝑡−2, … , 𝐗𝑡−𝑾)
Where 𝑾 is the look back window, 𝐫𝑡 = (𝑟𝑡,1, … , 𝑟𝑡,𝐻)
𝑇
𝑗 = 1, … , 𝐻.
This topic shall use LSTM as one of the potential models. You may try to train the LSTM model with the raw
70-dimension features 𝐗𝑡 with different 𝑾. You may also extract the features with lower dimensions 𝑀 < 70
by autoencoder and then train the LSTM model using the extracted features with different 𝑾. You can provide
a comparison of those two methods.
This project shall also address the question of the feature importance.
Forecasting Stock Volatility
Topic
This topic comprises two subtopics, both pertaining to volatility forecasting. These subtopics are as follows:
1) Is stock volatility path-dependent?
2) Is stock volatility past-dependent?
To address these questions, you have the option to employ various machine learning models for forecasting
stock return volatility. This can be achieved either by utilising past returns (path-dependent) or past volatilities
(past-dependent).
Addressing either of the aforementioned sub-questions fulfils the coursework requirements for the
FML course. There is no need to complete work for both questions.
Data
In computer lab_3_1, we show the method to download stock prices from Yahoo Finance. This topic uses the
stock adjusted prices to calculate its volatility. You shall calculate the volatility as the standard deviation of the
𝑁 daily arithmetic returns, but it's essential to note that this volatility should be computed based on returns
within distinct, non-overlapping 𝑁-day intervals. 𝑁 can be five or ten days. The following figure shows the
volatility calculation, where 𝑟𝑖
is the daily return and 𝜎𝑖
is the five-day volatility.
To successfully complete the coursework, you must choose a minimum of two stocks to assess one of the
aforementioned questions. The selection of these stocks should align with your personal interests.
Method
The topic is to investigate whether the volatility is path-dependent or past-dependent. But the length 𝐿 of
the path and past are unknown. You can select 𝐿 as 5, 10, 15, 20, or 40 days in the investigation and conclude
with a best 𝐿. Please decide by yourself what lengths 𝐿 to select in your coursework.
For the question of path-dependent, the input features contain the daily returns in past 𝐿 days:
𝐗𝑡 = (𝑟𝑡−1, 𝑟𝑡−2, 𝑟𝑡−2, … , 𝑟𝑡−𝐿
)
𝑇
The output is the volatility 𝑦𝑡 = 𝜎𝑡
. Please be aware that the returns in 𝐗𝑡
shall not be included in the
calculation of the output volatility 𝑦𝑡
. As illustrated in figure below, to forecast the volatility 𝜎𝑡
, you can use
the daily returns 𝑟𝑡−1, 𝑟𝑡−2,…, 𝑟𝑡−𝐿
in past 𝐿 days.
For the question of past-dependent, the input features contain the previous 𝐿 volatilities:
𝐗𝑡 = (𝜎𝑡−1, 𝜎𝑡−2, 𝜎𝑡−3, … , 𝜎𝑡−𝐿
)
𝑇
The output is the volatility 𝑦𝑡 = 𝜎𝑡
.
This topic shall use any of the machine learning models.
This topic may also answer what length 𝐿 generate the best forecasting results for the path- and pastdependence.
Forecasting High Frequency Cryptocurrency Return
Topic
This topic is to study how machine learning models perform in forecasting 15-minute ahead return in any of
the 14 popular cryptocurrencies.
Data
A dataset “cryptocurrency_prices.csv” of millions of rows of 1-minute frequency market data dating back to
2018 is provided for building the model. The dataset contains 14 popular cryptocurrencies, distinguished by
asset IDs. The details of the asset IDs and names are in the file “asset_details.csv”. You may choose any
cryptocurrencies to forecast. The “Weight” in the file is to calculate the whole market of cryptocurrency and
will be introduced in next section.
Asset_ID Weight Asset_Name
2 2.397895273 Bitcoin Cash
0 4.304065093 Binance Coin
1 6.779921907 Bitcoin
5 1.386294361 EOS.IO
7 2.079441542 Ethereum Classic
6 5.894402834 Ethereum
9 2.397895273 Litecoin
11 1.609437912 Monero
13 1.791759469 TRON
12 2.079441542 Stellar
3 4.406719247 Cardano
8 1.098612289 IOTA
10 1.098612289 Maker
4 3.555348061 Dogecoin
In the file “cryptocurrency_prices.csv”, the target has been calculated and provided as the column “Target”.
The target is derived from the log return over the future 15 minutes, for each cryptocurrency asset 𝑎 as the
residual of 15 minutes log return Target𝑡
𝑎
. Noted that, in each row, the “Target” has already been aligned as
the future 15 minute return residual and is to be forecasted. (Target: Residual log-returns for the asset over
a 15 minute horizon.)
We can see the features included in the dataset as the following:
timestamp: All timestamps are returned as second Unix timestamps (the number of seconds elapsed since
1970-01-01 00:00:00.000 UTC). Timestamps in this dataset are multiple of 60, indicating minute-by-minute
data.
Asset_ID: The asset ID corresponding to one of the crytocurrencies (e.g. Asset_ID = 1 for Bitcoin). The mapping
from Asset_ID to crypto asset is contained in asset_details.csv.
Count: Total number of trades in the time interval (last minute).
Open: Opening price of the time interval (in USD).
High: Highest price reached during time interval (in USD).
Low: Lowest price reached during time interval (in USD).
Close: Closing price of the time interval (in USD).
Volume: Quantity of asset bought or sold, displayed in base currency USD.
VWAP: The average price of the asset over the time interval, weighted by volume. VWAP is an aggregated
form of trade data.
Method
You may define some additional features. For example, the past 5 minute log return, the past 5 minute
absolute log return, past 5 minute highest, past 5 minute lowest, etc.
You may try simple models, i.e., linear tree, and complex models, i.e., LSTM and compare their forecasting
performance.
If using LSTM, you may also study what length of the looking back window provide the best forecasting
performance.
In addition, the feature importance shall also be studied to show which features contribute to the stock relative
performance in the future the best.
Appendix
This appendix introduces how the target is calculated.
The log return at time 𝑡 for asset 𝑎 is calculated as:
𝑅𝑡
𝑎 = log (
𝑃𝑡+16
𝑎
𝑃𝑡+1
𝑎 )
As the crypto asset returns are highly correlated, forecasting returns for individual asset shall remove the
market signal from individual asset returns. Therefore, the weighted average cryptocurrency market return 𝑀𝑡
is defined as:
is the weight for each cryptocurrency and is defined in the column “Weight” in the file
“asset_details.csv”.
Then, a beta is calculated for each asset 𝛽
Where the bracket 〈∙〉 calculate the rolling window average over the past 3750 minute windows.
Then, a regression residual is defined as the target for each asset Target𝑡
BUT, you don’t need to do this calculation. The target values have been calculated and provided in the 请加QQ:99515681 邮箱:99515681@qq.com WX:codehelp