Tutorial: Exploratory Data Analysis (EDA) on Quantitative Data

The quanteda package is specifically designed for conducting comprehensive exploratory data analysis (EDA) on historical stock returns within a time series framework. Its functionalities include the calculation of return and risk metrics, as well as the simulation of time series returns based on historical performance. The EDA provided by this package encompasses visualizing the presence of missing values and exploring the distribution of returns.

Various financial metrics are computed to assess the historical performance of a given stock. These metrics include total return, annualized return, annualized volatility, and the Sharpe ratio. For users without a financial background, please refer to the provided links for a more comprehensive understanding of these metrics. The package also offers the capability to simulate returns on time series for a particular stock, leveraging the observed returns distribution and key return and risk metrics.

This tutorial aims to illustrate the practical application of the functions within the quanteda package.

The package’s functions are as follows:

plot_missing_vals: This function is utilized to visualize the presence of missing values within the dataset.
plot_num_dist: Used for visualizing the distribution of numerical values, providing insights into the characteristics of the data.
generate_financial_metrics: This function computes various financial metrics, including total return, annualized return, annualized volatility, and the Sharpe ratio, offering a comprehensive analysis of a stock’s historical performance.
generate_return_series: Employed to simulate time series returns for a given stock based on the observed returns distribution and specified return and risk metrics.

Import the functions

import requests
import zipfile
import warnings
import pandas as pd

from io import BytesIO

from quanteda.plot_missing_vals import plot_missing_vals
from quanteda.plot_num_dist import plot_num_dist
from quanteda.generate_financial_metrics import generate_financial_metrics
from quanteda.generate_return_series import generate_return_series

The dataset utilized for demonstration purposes was curated by Akbilgic, Oguz at the Istanbul Stock Exchange and is sourced from the UCI Machine Learning Repository. To align the dataset with the requirements of our functions, we performed some basic data wrangling. In practice, the DataFrame passed to functions within this package is expected to have an index in the date range format.

For our demonstration, we have selected three stock indices, namely SP, FTSE, and NIKKEI, and obtained their daily returns data for the period from January 1, 2009, to December 31, 2009. This timeframe and selection of stock indices were chosen to showcase the functionality of the quanteda package with a specific subset of the data.

Download the Raw Data

warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")
zip_url ='https://archive.ics.uci.edu/static/public/247/istanbul+stock+exchange.zip'
response = requests.get(zip_url)
with open('data/data.zip', 'wb') as zip_file:
    zip_file.write(response.content)
with zipfile.ZipFile('data/data.zip', 'r') as zip_ref:
    zip_ref.extractall('data/')
    df = pd.read_excel('data/data_akbilgic.xlsx', skiprows=1)

Preprocess Data

df['date'] = pd.to_datetime(df['date'], format='%d-%b-%y')
df.set_index('date', inplace=True)
df.index.name = 'index'
df = df[(df.index >= '2009-01-01') & (df.index <= '2009-12-31')]

index_returns = df[['SP', 'FTSE', 'NIKKEI']].asfreq('D')
index_returns.head()

	SP	FTSE	NIKKEI
index
2009-01-05	-0.004679	0.003894	0.000000
2009-01-06	0.007787	0.012866	0.004162
2009-01-07	-0.030469	-0.028735	0.017293
2009-01-08	0.003391	-0.000466	-0.040061
2009-01-09	-0.021533	-0.012710	-0.004474

`plot_missing_vals`

The plot_missing_vals function, designed to accept a Pandas DataFrame as a parameter, serves the purpose of visualizing the presence of missing values. Identifying and addressing missing values in a timely manner is crucial to ensure accuracy in historical performance evaluation. In the context of financial returns, missing records often occur during weekends or statutory holidays when the stock exchange is not trading.

plot_missing_vals(index_returns)

`plot_num_dist`

The plot_num_dist function, designed to take a Pandas DataFrame as a parameter, serves the purpose of visualizing the distribution of time series returns for stocks. Understanding the return distribution is crucial, particularly when simulating future returns for a stock, as the random generated returns are based on the assumption of this historical distribution.

plot_num_dist(index_returns.ffill())

Upon analysis of the index_returns DataFrame, we can see that the daily returns of the three selected indices (SP, FTSE, and NIKKEI) are predominantly normally distributed. This insight is valuable in guiding the simulation of future returns, providing a foundation for modeling the randomness of return movements.

`generate_financial_metrics`

The function generate_financial_metrics is used to evaluation stock performance and risk. The generate_financial_metrics function takes two parameters: a Pandas DataFrame representing historical stock returns and the risk-free rate annual_risk_free as a float. The parameter annual_risk_free is set to 0.0 by default. The output of generate_financial_metrics includes two return metrics (total return, annualized return), one risk metric (annulized volatility) and one risk adjusted performance metric (sharpe ratio). These financial metrics are important indicators of the stock performance. Annualized return and annulized volativity are also the key input in simulating future returns.

metrics = generate_financial_metrics(index_returns.ffill())
metrics

	count	total_return	annual_return	annual_volatility	sharpe_ratio
SP	361	0.233877	0.236469	0.295960	0.798988
FTSE	361	0.362978	0.367000	0.256957	1.428256
NIKKEI	361	0.369559	0.373654	0.320181	1.167006

`generate_return_series`

The function generate_return_series is used to simulate time series returns given an expected return, volatility and return distribution of a stock. The parameters passed to this function are the following:

expected_annual_return: Expected annualized return as a decimal (e.g., 0.05 for 5%).
annual_volatility: Annualized volatility as a decimal (e.g., 0.2 for 20%).
n_rows: Number of days, hours, or minutes (rows) to generate.
num_series : Number of independent time series (columns) to generate.
freq: The frequency of returns (‘D’ for daily, ‘H’ for hourly, ‘min’ for minute).
dist: Type of return distribution (only supports Normal and Log-normal distribution).
start_date: Start date for the series in the format ‘YYYY-MM-DD’.

The values of expected_annual_return, annual_volatility, freq, dist and start_date are based on the analysis from the previous three functions. Below, the function is modeling 365 independent daily returns of index SP based on the historical annualized return, volatility and distribution. The resulting data is stored in a Pandas Dataframe.

expected_annual_return = metrics.loc['SP', 'annual_return']
annual_volatility =  metrics.loc['SP', 'annual_volatility']
n_rows=365
freq='D'
dist='normal'
start_date= index_returns.index.max()

generate_return_series(
    expected_annual_return, 
    annual_volatility, 
    n_rows=365, 
    freq='D', 
    num_series=1, 
    dist='normal', 
    random_state=524,
    start_date='2024-01-01')

	series_1
2024-01-01	-0.021539
2024-01-02	0.025164
2024-01-03	0.029528
2024-01-04	0.029991
2024-01-05	0.028716
...	...
2024-12-26	-0.025290
2024-12-27	0.012590
2024-12-28	0.005374
2024-12-29	0.010130
2024-12-30	-0.005789

365 rows × 1 columns

Reference

“Akbilgic,Oguz. (2013). ISTANBUL STOCK EXCHANGE. UCI Machine Learning Repository. https://doi.org/10.24432/C54P4J.”

Tutorial: Exploratory Data Analysis (EDA) on Quantitative Data

Import the functions

Download the Raw Data

Preprocess Data

plot_missing_vals

plot_num_dist

generate_financial_metrics

generate_return_series

Reference

`plot_missing_vals`

`plot_num_dist`

`generate_financial_metrics`

`generate_return_series`