Tutorial: Exploratory Data Analysis (EDA) on Quantitative Data
The quanteda package is specifically designed for conducting comprehensive exploratory data analysis (EDA) on historical stock returns within a time series framework. Its functionalities include the calculation of return and risk metrics, as well as the simulation of time series returns based on historical performance. The EDA provided by this package encompasses visualizing the presence of missing values and exploring the distribution of returns.
Various financial metrics are computed to assess the historical performance of a given stock. These metrics include total return, annualized return, annualized volatility, and the Sharpe ratio. For users without a financial background, please refer to the provided links for a more comprehensive understanding of these metrics. The package also offers the capability to simulate returns on time series for a particular stock, leveraging the observed returns distribution and key return and risk metrics.
This tutorial aims to illustrate the practical application of the functions within the quanteda package.
The package’s functions are as follows:
plot_missing_vals: This function is utilized to visualize the presence of missing values within the dataset.plot_num_dist: Used for visualizing the distribution of numerical values, providing insights into the characteristics of the data.generate_financial_metrics: This function computes various financial metrics, including total return, annualized return, annualized volatility, and the Sharpe ratio, offering a comprehensive analysis of a stock’s historical performance.generate_return_series: Employed to simulate time series returns for a given stock based on the observed returns distribution and specified return and risk metrics.
Import the functions
import requests
import zipfile
import warnings
import pandas as pd
from io import BytesIO
from quanteda.plot_missing_vals import plot_missing_vals
from quanteda.plot_num_dist import plot_num_dist
from quanteda.generate_financial_metrics import generate_financial_metrics
from quanteda.generate_return_series import generate_return_series
The dataset utilized for demonstration purposes was curated by Akbilgic, Oguz at the Istanbul Stock Exchange and is sourced from the UCI Machine Learning Repository. To align the dataset with the requirements of our functions, we performed some basic data wrangling. In practice, the DataFrame passed to functions within this package is expected to have an index in the date range format.
For our demonstration, we have selected three stock indices, namely SP, FTSE, and NIKKEI, and obtained their daily returns data for the period from January 1, 2009, to December 31, 2009. This timeframe and selection of stock indices were chosen to showcase the functionality of the quanteda package with a specific subset of the data.
Download the Raw Data
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")
zip_url ='https://archive.ics.uci.edu/static/public/247/istanbul+stock+exchange.zip'
response = requests.get(zip_url)
with open('data/data.zip', 'wb') as zip_file:
zip_file.write(response.content)
with zipfile.ZipFile('data/data.zip', 'r') as zip_ref:
zip_ref.extractall('data/')
df = pd.read_excel('data/data_akbilgic.xlsx', skiprows=1)
Preprocess Data
df['date'] = pd.to_datetime(df['date'], format='%d-%b-%y')
df.set_index('date', inplace=True)
df.index.name = 'index'
df = df[(df.index >= '2009-01-01') & (df.index <= '2009-12-31')]
index_returns = df[['SP', 'FTSE', 'NIKKEI']].asfreq('D')
index_returns.head()
| SP | FTSE | NIKKEI | |
|---|---|---|---|
| index | |||
| 2009-01-05 | -0.004679 | 0.003894 | 0.000000 |
| 2009-01-06 | 0.007787 | 0.012866 | 0.004162 |
| 2009-01-07 | -0.030469 | -0.028735 | 0.017293 |
| 2009-01-08 | 0.003391 | -0.000466 | -0.040061 |
| 2009-01-09 | -0.021533 | -0.012710 | -0.004474 |
plot_missing_vals
The plot_missing_vals function, designed to accept a Pandas DataFrame as a parameter, serves the purpose of visualizing the presence of missing values. Identifying and addressing missing values in a timely manner is crucial to ensure accuracy in historical performance evaluation. In the context of financial returns, missing records often occur during weekends or statutory holidays when the stock exchange is not trading.
plot_missing_vals(index_returns)
plot_num_dist
The plot_num_dist function, designed to take a Pandas DataFrame as a parameter, serves the purpose of visualizing the distribution of time series returns for stocks. Understanding the return distribution is crucial, particularly when simulating future returns for a stock, as the random generated returns are based on the assumption of this historical distribution.
plot_num_dist(index_returns.ffill())
Upon analysis of the index_returns DataFrame, we can see that the daily returns of the three selected indices (SP, FTSE, and NIKKEI) are predominantly normally distributed. This insight is valuable in guiding the simulation of future returns, providing a foundation for modeling the randomness of return movements.
generate_financial_metrics
The function generate_financial_metrics is used to evaluation stock performance and risk. The generate_financial_metrics function takes two parameters: a Pandas DataFrame representing historical stock returns and the risk-free rate annual_risk_free as a float. The parameter annual_risk_free is set to 0.0 by default. The output of generate_financial_metrics includes two return metrics (total return, annualized return), one risk metric (annulized volatility) and one risk adjusted performance metric (sharpe ratio). These financial metrics are important indicators of the stock performance. Annualized return and annulized volativity are also the key input in simulating future returns.
metrics = generate_financial_metrics(index_returns.ffill())
metrics
| count | total_return | annual_return | annual_volatility | sharpe_ratio | |
|---|---|---|---|---|---|
| SP | 361 | 0.233877 | 0.236469 | 0.295960 | 0.798988 |
| FTSE | 361 | 0.362978 | 0.367000 | 0.256957 | 1.428256 |
| NIKKEI | 361 | 0.369559 | 0.373654 | 0.320181 | 1.167006 |
generate_return_series
The function generate_return_series is used to simulate time series returns given an expected return, volatility and return distribution of a stock. The parameters passed to this function are the following:
expected_annual_return: Expected annualized return as a decimal (e.g., 0.05 for 5%).annual_volatility: Annualized volatility as a decimal (e.g., 0.2 for 20%).n_rows: Number of days, hours, or minutes (rows) to generate.num_series: Number of independent time series (columns) to generate.freq: The frequency of returns (‘D’ for daily, ‘H’ for hourly, ‘min’ for minute).dist: Type of return distribution (only supports Normal and Log-normal distribution).start_date: Start date for the series in the format ‘YYYY-MM-DD’.
The values of expected_annual_return, annual_volatility, freq, dist and start_date are based on the analysis from the previous three functions. Below, the function is modeling 365 independent daily returns of index SP based on the historical annualized return, volatility and distribution. The resulting data is stored in a Pandas Dataframe.
expected_annual_return = metrics.loc['SP', 'annual_return']
annual_volatility = metrics.loc['SP', 'annual_volatility']
n_rows=365
freq='D'
dist='normal'
start_date= index_returns.index.max()
generate_return_series(
expected_annual_return,
annual_volatility,
n_rows=365,
freq='D',
num_series=1,
dist='normal',
random_state=524,
start_date='2024-01-01')
| series_1 | |
|---|---|
| 2024-01-01 | -0.021539 |
| 2024-01-02 | 0.025164 |
| 2024-01-03 | 0.029528 |
| 2024-01-04 | 0.029991 |
| 2024-01-05 | 0.028716 |
| ... | ... |
| 2024-12-26 | -0.025290 |
| 2024-12-27 | 0.012590 |
| 2024-12-28 | 0.005374 |
| 2024-12-29 | 0.010130 |
| 2024-12-30 | -0.005789 |
365 rows × 1 columns
Reference
“Akbilgic,Oguz. (2013). ISTANBUL STOCK EXCHANGE. UCI Machine Learning Repository. https://doi.org/10.24432/C54P4J.”