Cookbook
All the examples presented here can be found in examples. For the details of each function or class, refer to the API documentation. All examples were tested on a desktop with 12th Gen Intel(R) Core(TM) i9-12900KS and 128GB RAM and 1GB network
Example 1. Generating firm characteristics
Firm characteristics generation
This example shows the full process of i) data download, ii) factor generation, and iii) firm characteristic generation. In this example,
Funda is merged with fundq to create quarterly updated annual accounting data.
Funda and fundq are assumed to be available 4 months later.
The latest market equity (me) is used when generating firm characteristics.
All firm characteristics defined in PyAnomaly are generated.
Generated factors are saved to ‘./output/factors_monthly’ and ‘./output/factors_daily’, and firm characteristics are saved to ‘./output/merge’.
Processing time: approx. 45 minutes to download data and 15 minutes for the rest.
Initialization
Import packages
from pyanomaly.globals import *
from pyanomaly.wrdsdata import WRDS
from pyanomaly.characteristics import FUNDA, FUNDQ, CRSPM, CRSPD, Merge
from pyanomaly.factors import make_all_factor_portfolios
from pyanomaly.analytics import *
Set log file path and start a timer.
# Set log file path as config.log_dir + this module name.
set_log_path(__file__) # You can also give a different log file name: e.g., set_log_path('logfile.log')
# Start timer.
start_timer('Example 1.')
# To replicate JKP's SAS code as closely as possible, uncomment the following line.
# set_config(replicate_jkp=True)
Download data frm WRDS
WRDS is a class to download data from WRDS.
wrds = WRDS(wrds_username) # Use your WRDS user id.
# Download all necessary data asynchronously.
# If network is slow or memory is insufficient, set run_in_executer=False.
wrds.download_all(run_in_executer=True)
# Create crspm(d) from m(d)sf and m(d)seall and add gvkey to them.
wrds.preprocess_crsp()
Factor generation
Make factor portfolios: FF3, FF5, HXZ4, and SY4. These are used in some firm characteristics, such as the residual momentum.
References: make_all_factor_portfolios(), make_factor_portfolios()
make_all_factor_portfolios(monthly=True, daily=True) # Monthly and daily factors
Firm characteristic generation
Firm characteristics are generated by FUNDA, FUNDQ, CRSPM, CRSPD,
and Merge classes.
To generate characteristics in a certain column of mapping.xlsx, set alias = column name.
For instance, alias = 'jkp' will generated firm characteristics defined in ‘jkp’ column.
Setting alias = None will generate all available firm characteristics.
Note that some firm characteristics will be created twice using different implementations.
alias = None
# Start date ('yyyy-mm-dd'). Set to None to create characteristics from as early as possible.
sdate = None
Generate firm characteristics from crspm.
crspm = CRSPM(alias=alias)
# Load crspm.
crspm.load_data(sdate)
# Filter data: shrcd in [10, 11, 12]
crspm.filter_data()
# Fill missing months by populating the data.
crspm.populate(method=None)
# Some preprocessing, e.g., creating frequently used variables.
crspm.update_variables()
# Merge crspm with the monthly factors generated earlier.
crspm.merge_with_factors()
# Display the firm characteristics to be generated.
crspm.show_available_chars()
# Create characteristics.
crspm.create_chars()
# Postprocessing: delete temporary variables and inspect the generated data.
crspm.postprocess()
Generate firm characteristics from crspd.
crspd = CRSPD(alias=alias)
# Load crspd.
crspd.load_data(sdate)
# Filter data: shrcd in [10, 11, 12]
crspd.filter_data()
# Some preprocessing, e.g., creating frequently used variables.
crspd.update_variables()
# Merge crspd with the daily factors generated earlier.
crspd.merge_with_factors()
# Display the firm characteristics to be generated.
crspd.show_available_chars()
# Create characteristics.
crspd.create_chars()
# Postprocessing: delete temporary variables and inspect the generated data.
crspd.postprocess()
# Delete raw crspd data to save memory.
del crspd.cd
Generate firm characteristics from fundq.
fundq = FUNDQ(alias=alias)
# Load fundq.
fundq.load_data(sdate)
# fundq has some duplicates (same datedate/gvkey). Drop duplicates.
fundq.remove_duplicates()
# Convert values in another currency (currently only CAD) to USD.
fundq.convert_currency()
# Populate data to monthly.
# limit=3: Forward fill up to 3 months.
# lag=4: Shift data 4 months, i.e., data is available 4 months later.
# new_date_col='date': Populated data has index 'date'/'gvkey', and 'datadate' becomes a column.
fundq.populate(MONTHLY, limit=3, lag=4, new_date_col='date')
# Make quarterly variables from ytd variables and use them to fill missing quarterly variables.
fundq.create_qitems_from_yitems()
# Some preprocessing, e.g., creating frequently used variables.
fundq.update_variables()
# Display the firm characteristics to be generated.
fundq.show_available_chars()
# Create characteristics.
fundq.create_chars()
# Postprocessing: delete temporary variables and inspect the generated data.
fundq.postprocess()
Generate firm characteristics from funda.
funda = FUNDA(alias=alias)
# Load funda.
funda.load_data(sdate)
# Convert values in another currency (currently only CAD) to USD.
funda.convert_currency()
# Populate data to monthly.
# limit=12: Forward fill up to 12 months.
# lag=4: Shift data 4 months, i.e., data is available 4 months later.
# new_date_col='date': Populated data has index 'date'/'gvkey', and 'datadate' becomes a column.
funda.populate(MONTHLY, limit=12, lag=4, new_date_col='date')
# Generate quarterly-updated funda data from fundq and merge them with funda.
funda.merge_with_fundq(fundq)
# Some preprocessing, e.g., creating frequently used variables.
funda.update_variables()
# Add the market equity of crspm to funda.
funda.add_crsp_me(crspm)
# Display the firm characteristics to be generated.
funda.show_available_chars()
# Create characteristics.
funda.create_chars()
# Postprocessing: delete temporary variables and inspect the generated data.
funda.postprocess()
Merge FUNDA, FUNDQ, CRSPM, and CRSPD and generate firm characteristics that require data from multiple data sources. Merge.data contains all firm characteristics and raw data from FUNDA, FUNDQ, CRSPM, AND CRSPD.
merge = Merge(alias)
# Merge all data together.
merge.preprocess(crspm, crspd, funda, fundq)
# Display the firm characteristics to be generated.
merge.show_available_chars()
# Create characteristics.
merge.create_chars()
# Postprocessing: delete temporary variables and inspect the generated data.
merge.postprocess()
# Save the results to config.output_dir + 'merge'.
# The default behavior of Panel.save() is to save all the columns of Panel.data.
# You can reduce file size by choosing columns to save. The following code saves firm characteristics + columns.
columns = ['gvkey', 'datadate', 'primary', 'exchcd', 'me', 'ret', 'exret', 'rf']
merge.save(other_columns=columns)
elapsed_time('End of Example 1.')
Example 2. Sorting-based portfolio analysis
Sorting-based portfolio analysis.
This example demonstrates how to construct quantile portfolios and carry out 1-D or 2-D sorts.
Quintile portfolios will be generated from the 12-month momentum and their mean returns and t-values will be computed.
5x5 2-D sort will be conducted on momentum and size.
Firm-level cross-sectional regression will be conducted.
Factor regressions will be carried out using FF 3-factors.
It is assumed that the firm characteristic has been created and saved to config.output_dir + ‘crspm’.
See also make_char_portfolios() for characteristic portfolio creation replicating JKP’s SAS code.
Processing time: approx. 10 seconds.
Initialization
Import packages
from pyanomaly.fileio import read_from_file
from pyanomaly.analytics import *
Set log file path and start a timer.
set_log_path(__file__)
start_timer('Example 2')
# Jitted functions are slow when first called due to compile time. This example runs faster by disabling jit.
set_config(disable_jit=True)
Set variables
char = 'ret_12_1' # 12-month momentum
char_class = char + '_class' # Momentum class column
ret_col = 'exret' # Return column
weight_col = 'me' # Market equity column
split = 5 # Qintile. You can also do split = [0.2, 0.4, 0.6, 0.8].
labels = ['H', 2, 3, 4, 'L', 'H-L'] # Momentum portfolio names: 'H-L' for long short.
Load and prepare data
References: datatools.shift(), filter().
# Read crspm data (not raw data but the output data).
data = read_from_file('crspm')
# Shift data except for return so that all the variables at t are as of t-1 and the return is over t-1 to t.
data = shift(data, 1, excl_cols=[ret_col])
# Drop nan returns and weights.
data = data.dropna(subset=[ret_col, weight_col])
# If you want to exclude bottom 20% based on NYSE size...
log(f'crspm shape before size filtering: {data.shape}')
data['nyse_me'] = np.where(data.exchcd == 1, data[weight_col], np.nan) # NYSE me
data = filter(data, weight_col, (0.2, None), ginfo='date', by='nyse_me')
log(f'crspm shape after size filtering: {data.shape}')
[2024/02/22 14:52] crspm shape before size filtering: (3958767, 76)
[2024/02/22 14:52] crspm shape after size filtering: (1752862, 77)
1-D sort
References: classify(), one_dim_sort(), time_series_average()
elapsed_time('1D sort start.')
# Classify stocks on each date on momentum. class 0: highest momentum, 5: lowest momentum.
data[char_class] = classify(data[char], split, ascending=False, ginfo='date')
# Make value-weighted quantile portfolios.
# Set weight_col to None for equal-weight portfolios.
# qpfs will have index = date/char_class and columns = [ret_col]
qpfs = one_dim_sort(data, char_class, ret_col, weight_col=weight_col)
# You can add more columns to the portfolios. Here, I add signal (mean of the characteristic) and
# n_firms (number of firms).
qpfs[['signal', 'n_firms']] = one_dim_sort(data, char_class, char, function=['mean', 'count'])
# If you want to see the average size and price of the portfolios...
qpfs[['size', 'prc']] = one_dim_sort(data, char_class, [weight_col, 'prc'])
# Rename the portfolios. Without this, the names are 0, 1, ...
relabel_class(qpfs, labels)
# Check the result.
print(f'Quantile portfolios sorted on {char}')
print(qpfs)
Quantile portfolios sorted on ret_12_1
exret signal n_firms size prc
date ret_12_1_class
1927-01-31 H -0.008 0.479 72 83.725 73.075
2 0.006 0.145 72 137.217 92.345
3 0.001 0.010 71 85.660 81.256
4 -0.004 -0.107 72 53.240 54.126
L -0.007 -0.333 72 24.329 34.560
H-L -0.001 0.812 0 59.396 38.515
1927-02-28 H 0.055 0.447 74 93.941 81.642
... ... ... ... ...
H-L 0.038 1.041 0 20643.687 81.385
2023-12-31 H 0.040 0.575 449 33467.685 119.837
2 0.035 0.083 449 31027.949 1333.695
3 0.045 -0.067 449 22789.652 101.212
4 0.074 -0.207 449 14269.428 64.336
L 0.118 -0.425 450 5509.468 35.076
H-L -0.077 1.000 -1 27958.216 84.760
# Calculate the time-series mean and t-value of the returns of the quantile portfolios.
mean, tval = time_series_average(qpfs[ret_col], cov_type='HAC', cov_kwds={'maxlags': 12})
print(f'\n Mean of portfolio returns sorted on {char}')
print(pd.concat([mean, tval], axis=1))
elapsed_time('1D sort end.')
Mean of portfolio returns sorted on ret_12_1
mean t-val
ret_12_1_class
H 0.011 5.503
2 0.008 4.899
3 0.006 3.727
4 0.006 3.269
L 0.003 1.478
H-L 0.007 4.335
2-D sort
References: two_dim_sort()
elapsed_time('2D sort start.')
# Make 5x5 portfolios on momentum and size.
char2 = 'me' # Market equity
char_class2 = char2 + '_class' # Size class column
split2 = 5 # Quintiles
labels2 = ['S', 2, 3, 4, 'B', 'S-B'] # Size portfolio names
# Classify stocks on each date on size. class 0: smallest, 5: biggest.
data[char_class2] = classify(data[char2], split2, ascending=True, ginfo='date')
# Make 5x5 value-weighted portfolios.
# qpfs2 will have index = date/char_class and columns = char_class2.
qpfs2 = two_dim_sort(data, char_class, char_class2, ret_col, weight_col=weight_col, output_dim=2)
# Rename the portfolios.
relabel_class(qpfs2, labels)
relabel_class(qpfs2, labels2, axis=1)
# Calculate the time-series mean and t-value of the returns of the quantile portfolios.
mean, tval = time_series_average(qpfs2, cov_type='HAC', cov_kwds={'maxlags': 12})
print(f'\nMean of portfolio returns sorted on {char} and {char2}')
print('\nMean')
print(mean)
print('\nt-stat')
print(tval)
elapsed_time('2D sort end.')
Mean of portfolio returns sorted on ret_12_1 and me
Mean
me_class S 2 3 4 B S-B
ret_12_1_class
H 0.014 0.014 0.014 0.012 0.010 0.004
2 0.012 0.010 0.009 0.009 0.008 0.004
3 0.011 0.009 0.008 0.008 0.006 0.005
4 0.009 0.008 0.007 0.006 0.006 0.004
L 0.003 0.004 0.004 0.004 0.003 -0.000
H-L 0.011 0.010 0.009 0.008 0.007 0.004
t-stat
me_class S 2 3 4 B S-B
ret_12_1_class
H 5.564 6.146 6.057 5.805 5.101 2.633
2 4.923 4.948 4.532 5.241 4.778 2.738
3 4.462 4.504 4.023 4.115 3.538 3.278
4 3.569 3.490 3.192 3.348 3.298 2.411
L 1.012 1.491 1.540 1.707 1.394 -0.111
H-L 6.516 5.822 5.286 4.877 3.885 2.873
Firm-level cross-sectional regression
References: crosssectional_regression()
Cross-sectionally regress return at t on t-1 momentum and size.
Take the time-series average of the coefficients and examine their significance.
elapsed_time('CS regression start.')
y_col = ret_col
X_cols = [char, char2]
# Run the regressions. Output: mean coefficients, t-values, and coefficient time-series.
mean, tval, coef = crosssectional_regression(data, y_col, X_cols, add_constant=True, cov_type='HAC',
cov_kwds={'maxlags': 12})
print('\nMean of the coefficients')
print(pd.concat([mean, tval], axis=1))
print('\nCoefficients time-series')
print(coef)
elapsed_time('CS regression end.')
Mean of the coefficients
mean t-stat
const 0.006 3.273
ret_12_1 0.008 4.742
me -0.000 -1.627
Coefficients time-series
const ret_12_1 me
1927-01-31 0.012 0.009 -0.000
1927-02-28 0.059 -0.011 -0.000
1927-03-31 -0.029 0.055 0.000
1927-04-30 -0.012 0.050 0.000
... ... ...
2023-09-30 -0.063 0.005 0.000
2023-10-31 -0.072 0.004 0.000
2023-11-30 0.090 0.018 -0.000
2023-12-31 0.105 0.000 -0.000
Factor regression
Regress quantile portfolios’ returns on the FF-3 factors.
elapsed_time('Factor regression start.')
# Sample period.
sdate = '1952-01'
edate = '2020-12'
qpfs = qpfs[sdate:edate]
# Read factors.
# Here, I use the factors generated by PyAnomaly. You can use different sources if you like.
factors = read_from_file(config.factors_monthly_fname)
# factors = WRDS.read_data('factors_monthly') # factors from K.French website.
# X: Constant + FF3.
X = factors.loc[sdate:edate, ['mktrf', 'smb_ff', 'hml']].values
# X = factors.loc[sdate:edate, ['mktrf', 'smb', 'hml']].values # If you use K. French data.
X = sm.add_constant(X)
# Run the regression for each portfolio.
result = {}
for pf in labels:
# y: Returns of a quantile portfolio
y = qpfs.loc[(slice(None), pf), ret_col].values
model = sm.OLS(y, X).fit()
# Make an output table.
result[pf] = pd.DataFrame({'coefs': model.params, 'tval': model.tvalues}, index=['const', 'mktrf', 'smb', 'hml'])
result = pd.concat(result, axis=1)
print('\nFactor regression: FF3')
print(result)
elapsed_time('Factor regression end.')
Factor regression: FF3
H 2 3 4 L H-L
coefs tval coefs tval coefs tval coefs tval coefs tval coefs tval
const 0.005 6.662 0.002 3.585 0.000 0.258 -0.002 -3.292 -0.007 -7.505 0.011 8.052
mktrf 1.013 58.519 0.956 85.346 0.941 93.837 1.000 83.939 1.220 58.471 -0.207 -6.128
smb 0.191 7.170 -0.116 -6.736 -0.098 -6.386 -0.043 -2.377 0.233 7.265 -0.042 -0.813
hml -0.494 -23.610 -0.116 -8.538 0.082 6.757 0.294 20.425 0.568 22.526 -1.062 -26.020
Example 3. Quantile portfolio construction and evaluation
Quantile portfolio construction and evaluation.
This example demonstrates how to construct quantile portfolios (and the long-short) based on a firm characteristic and evaluate their performance.
Tercile portfolios will be generated from the 12-month momentum.
Different ways of setting the transaction cost will be demonstrated.
It is assumed that the firm characteristic has been created and saved to config.output_dir + ‘crspm’.
Processing time: approx. 10 seconds.
Initialization
Import packages
import matplotlib.pyplot as plt
from pyanomaly.globals import *
from pyanomaly.fileio import read_from_file
from pyanomaly.tcost import TransactionCost, TimeVaryingCost
from pyanomaly.analytics import *
from pyanomaly.wrdsdata import WRDS
Set log file path and start a timer.
set_log_path(__file__)
start_timer('Example 3')
# Jitted functions are slow when first called due to compile time. This example runs faster by disabling jit.
set_config(disable_jit=True)
Set variables
char = 'ret_12_1' # 12-month momentum
char_class = char + '_class' # Momentum class column
ret_col = 'exret' # Return column
weight_col = 'me' # Market equity column
split = 3 # Tercile. You can also do split = [0.3, 0.7] for 3:4:3 split.
labels = ['H', 'M', 'L', 'H-L'] # Portfolio names: 'H-L' for long short.
Load and prepare data
# Read crspm data (not raw data but the output data).
data = read_from_file('crspm')
# NYSE me
data['nyse_me'] = np.where(data.exchcd == 1, data[weight_col], np.nan)
# Delete unnecessary columns to save memory (optional).
keep_columns(data, [char, ret_col, weight_col, 'nyse_me'])
# Shift data except for return so that all the variables at t are as of t-1 and the return is over t-1 to t.
data = shift(data, 1, excl_cols=[ret_col])
# Risk-free rates.
# Set `month_end=True` since the crspm date has also been shifted to month end.
rf = WRDS.get_risk_free_rate(month_end=True)
Transaction costs
PyAnomaly provides a powerful way to set transaction costs via TransactionCost and TimeVaryingCost
classes. Below are some examples of setting transaction costs.
# No transaction costs.
costfcn = None
# Transaction cost of 30 basis points.
costfcn = 0.003
# 20 (30) bps when buying (selling).
costfcn = TransactionCost(buy_linear=0.002, sell_linear=0.002)
In this example, we assume transaction cost varies over time and with size, following Brandt et al. (2009).
costfcn = TimeVaryingCost(data[weight_col])
Make tercile portfolios
In PyAnomaly, Portfolio class is used to make a portfolio and Portfolios class is used to
handle a group of portfolios. make_quantile_portfolios() uses these classes to construct quantile portfolios.
# Exclude bottom 20% based on NYSE size. This should come after setting the time varying cost
# because `TimeVaryingCost` requires all firms' me as the input.
data = filter(data, weight_col, (0.2, None), by='nyse_me', ginfo='date')
# Classify data on char. The highest momentum stock will be labeled 0 and the lowest 2.
data[char_class] = classify(data[char], split, ascending=False, ginfo='date')
portfolios = make_quantile_portfolios(data, char_class, ret_col, weight_col, rf=rf, costfcn=costfcn, names=labels)
Evaluate the portfolios
Portfolios can be evaluated using Portfolio.eval() for a single portfolio or Portfolios.eval() for a
group of portfolios.
# `annualize_factor=12` will annualize the results as crspm is monthly data.
pfperfs, pfvals = portfolios.eval(annualize_factor=12)
# Performance metrics.
print('\nPerformance')
print(pfperfs)
# Time-series of portfolio value, return, ...
print('\nValues')
print(pfvals)
Performance
H M L H-L
mean -0.006 -0.080 -0.079 -0.135
std 0.195 0.191 0.247 0.167
sharpe -0.029 -0.419 -0.319 -0.809
cum 0.692 -6.404 -7.451 -14.777
mdd 0.877 1.000 1.000 1.000
mdd start 1929-08-31 1929-08-31 1928-04-30 1932-06-30
mdd end 1984-07-31 2009-02-28 2009-02-28 2023-12-31
msd 0.509 0.557 0.618 0.783
msd start 2008-05-31 1973-09-30 1932-02-29 1932-06-30
msd end 2009-02-28 1974-09-30 1932-06-30 1932-09-30
turnover 5.592 7.716 6.272 12.238
lposition 472.351 471.740 472.158 472.351
sposition 0.000 0.000 0.000 472.158
Values
grossret grossexret val1 val lposition sposition cost tover ...
H H H H H H H H ...
1927-01-31 -0.002 -0.004 0.998 1.000 121 0.000 0.000 0.000 ...
1927-02-28 0.046 0.044 1.044 0.998 123 0.000 0.009 0.415 ...
1927-03-31 0.019 0.016 1.065 1.044 123 0.000 0.016 0.720 ...
1927-04-30 0.019 0.017 1.085 1.065 127 0.000 0.013 0.535 ...
... ... ... ... ... ... ... ... ...
2023-09-30 -0.062 -0.066 9274.546 9884.677 786 0.000 15.306 0.265 ...
2023-10-31 -0.025 -0.030 9039.294 9274.546 771 0.000 20.416 0.393 ...
2023-11-30 0.109 0.104 10023.511 9039.294 759 0.000 17.488 0.401 ...
2023-12-31 0.035 0.031 10378.316 10023.511 749 0.000 23.490 0.462 ...
Plot cumulative returns.
# Plot cumulative returns.
pfvals['cumret'].plot()
plt.show()
# Plot selected portfolios' cumulative returns.
pfvals['cumret'][['H', 'L', 'H-L']].plot()
plt.show()
Evaluate the portfolios for a sub-period and ignoring transaction costs.
pfperfs1, pfvals1 = portfolios.eval(sdate='2001-01-01', edate='2010-12-31', annualize_factor=12,
return_type='gross')
print('\nPerformance: 2001-01 - 2010-12')
print(pfperfs1)
# Plot cumulative returns.
pfvals1['cumret'][['H', 'L', 'H-L']].plot()
plt.show()
Performance: 2001-01 - 2010-12
H M L H-L
mean 0.026 -0.001 -0.001 0.027
std 0.169 0.149 0.256 0.200
sharpe 0.152 -0.005 -0.005 0.134
cum 0.325 0.093 -0.126 0.057
mdd 0.518 0.462 0.689 0.450
mdd start 2007-10-31 2007-10-31 2001-01-31 2008-06-30
mdd end 2009-02-28 2009-02-28 2009-02-28 2010-01-31
msd 0.402 0.259 0.404 0.396
msd start 2008-05-31 2008-08-31 2008-08-31 2009-02-28
msd end 2008-11-30 2008-11-30 2008-11-30 2009-05-31
turnover 6.289 7.606 6.150 12.898
lposition 677.483 676.883 677.208 677.483
sposition 0.000 0.000 0.000 677.208
Access a member portfolio.
# To access portfolio 'H'
pf_h = portfolios['H']
# You can also access the returns of `eval()` as follows.
print('\nPerformance: H')
print(pf_h.performance) # pf_h_perf
print('\nValues: H')
print(pf_h.value) # pf_h_val
# To see the positions (constituents)
print('\nPositions: H')
print(pf_h.position)
elapsed_time('End of Example 6.')
Performance: H
H
mean 0.026
std 0.169
sharpe 0.152
cum 0.325
mdd 0.518
mdd start 2007-10-31
mdd end 2009-02-28
msd 0.402
msd start 2008-05-31
msd end 2008-11-30
turnover 6.289
lposition 677.483
sposition 0.000
Values: H
grossret grossexret val1 val lposition sposition cost tover netret netexret cumret drawdown drawdur drawstart succdown succdur succstart
1927-01-31 -0.002 -0.004 0.998 1.000 121 0.000 0.000 0.000 -0.002 -0.004 NaN NaN NaN NaT NaN NaN NaT
1927-02-28 0.046 0.044 1.044 0.998 123 0.000 0.009 0.415 0.037 0.035 NaN NaN NaN NaT NaN NaN NaT
1927-03-31 0.019 0.016 1.065 1.044 123 0.000 0.016 0.720 0.004 0.002 NaN NaN NaN NaT NaN NaN NaT
1927-04-30 0.019 0.017 1.085 1.065 127 0.000 0.013 0.535 0.007 0.005 NaN NaN NaN NaT NaN NaN NaT
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2023-09-30 -0.062 -0.066 9274.546 9884.677 786 0.000 15.306 0.265 -0.063 -0.068 NaN NaN NaN NaT NaN NaN NaT
2023-10-31 -0.025 -0.030 9039.294 9274.546 771 0.000 20.416 0.393 -0.028 -0.032 NaN NaN NaN NaT NaN NaN NaT
2023-11-30 0.109 0.104 10023.511 9039.294 759 0.000 17.488 0.401 0.107 0.103 NaN NaN NaN NaT NaN NaN NaT
2023-12-31 0.035 0.031 10378.316 10023.511 749 0.000 23.490 0.462 0.033 0.029 NaN NaN NaN NaT NaN NaN NaT
Positions: H
id ret wgt rf exret val val1 val0 cost
date
1927-01-31 10049 -0.030 0.004 0.002 -0.032 0.004 0.003 0.004 0.000
1927-01-31 10102 -0.020 0.002 0.002 -0.023 0.002 0.002 0.002 0.000
1927-01-31 10129 -0.009 0.003 0.002 -0.012 0.003 0.003 0.003 0.000
1927-01-31 10145 0.021 0.024 0.002 0.019 0.024 0.024 0.024 0.000
... ... ... ... ... ... ... ... ...
2023-12-31 93369 -0.005 0.000 0.004 -0.009 1.821 1.811 0.000 0.011
2023-12-31 93374 0.000 0.000 0.000 0.000 0.000 0.000 2.853 0.017
2023-12-31 93427 0.171 0.000 0.004 0.167 2.416 2.830 2.728 0.002
2023-12-31 93429 -0.024 0.001 0.004 -0.028 7.898 7.707 8.944 0.006
Example 4. Defining a new characteristic
Define a new characteristic
This example demonstrates how to define a new characteristic by inheriting CRSPM.
A new characteristic, ‘excess_ret_change’, will be defined, which is the excess return over the market return divided by the one-year average excess return (I have no idea if this has any predictive power).
It is assumed that data has been downloaded from WRDS.
Processing time: approx. 18 seconds.
Initialization
Import packages
from pyanomaly.globals import *
from pyanomaly.characteristics import CRSPM
from pyanomaly.analytics import *
Set log file path and start a timer.
set_log_path(__file__)
start_timer('Example 4')
Define a new characteristic method
Inherit CRSPM and add a new method c_excess_ret_change() that defines the new characteristic.
Method name should be of the form ‘c_[characteristic name]’.
class MyCRSPM(CRSPM):
def c_excess_ret_change(self):
"""Change of excess return over the market return."""
# Make a short (one-line) docstring for description.
# This is displayed when 'show_available_chars()' is called.
cm = self.data
# One-year average excess return
avg_exret = self.rolling(cm.exret - cm.mktrf, 12, 'mean')
# Excess return.
exret = cm.exret - cm.mktrf
# Characteristic
char = exret / avg_exret
return char
Generate the firm characteristic
# Create a MyCRSPM instance.
crspm = MyCRSPM()
# Load crspm.
crspm.load_data()
# Filter data: shrcd in [10, 11, 12]
crspm.filter_data()
# Fill missing months by populating the data.
crspm.populate(method=None)
# Some preprocessing, e.g., creating frequently used variables.
crspm.update_variables()
# Merge crspm with the monthly factors generated earlier.
# Need this as excess_ret_change uses the market return.
crspm.merge_with_factors()
# Display the firm characteristics to be generated.
crspm.show_available_chars()
Notice that excess_ret_change has been added.
beta_60m Market beta (Org, JKP). Fama and MacBeth (1973)
... ...
excess_ret_change Change of excess return over the market return.
... ...
turn Share turnover (Org, GHZ). Datar, Naik, and Radcliffe (1998)
Generate firm characteristics
# Generate excess_ret_change.
# Without any argument, all characteristics defined in CRSPM and MyCRSPM will be generated.
crspm.create_chars('excess_ret_change')
# Some other characteristics defined in CRSPM can also be generated.
crspm.create_chars(['ret_12_1', 'ret_1_0']) # 12M momentum and short-term reversal.
# Check the result
print(crspm[['excess_ret_change', 'ret_12_1', 'ret_1_0']])
excess_ret_change ret_12_1 ret_1_0
date permno
1986-01-31 10000 NaN NaN NaN
1986-02-28 10000 NaN NaN -0.257
1986-03-31 10000 NaN NaN 0.365
1986-04-30 10000 NaN NaN -0.099
... ... ...
2023-09-30 93436 8.384 -0.027 -0.030
2023-10-31 93436 -24.680 0.100 -0.197
2023-11-30 93436 3.204 0.032 0.195
2023-12-31 93436 -0.337 0.949 0.035
# Postprocessing: delete temporary variables and inspect the generated data.
crspm.postprocess()
[2024/02/22 17:53] No. unique dates: 1177, Date range: 1925-12-31 00:00:00 - 2023-12-31 00:00:00
[2024/02/22 17:53] No. unique ids: 29055
[2024/02/22 17:53] Nans, Infs
no. nans percentage no. infs percentage
excess_ret_change 471639 0.116 0 0.000
ret_12_1 467584 0.115 0 0.000
ret_1_0 114462 0.028 0 0.000
[2024/02/22 17:53] Replacing inf with nan...
[2024/02/22 17:53] Descriptive statistics
count mean std min 0.1% 1% 25% 50% 75% 99% 99.9% max
excess_ret_change 3605208.000 3.712 3781.662 -1261260.427 -969.774 -95.979 -1.973 0.945 4.066 98.078 952.120 6076629.096
ret_12_1 3609263.000 0.135 0.773 -1.000 -0.956 -0.834 -0.208 0.049 0.324 2.513 6.902 436.685
ret_1_0 3962385.000 0.011 0.185 -1.000 -0.692 -0.400 -0.066 0.000 0.070 0.579 1.500 24.000
# Save the data (raw data + firm characteristics) to config.output_dir + 'mycrspm'.
# To save only firm characteristics and a few columns of the raw data, use 'other_columns' argument.
crspm.save()
# The saved file can be loaded using crspm.load().
crspm = MyCRSPM().load()
print(crspm[['excess_ret_change', 'ret_12_1', 'ret_1_0']])
elapsed_time('End of Example 4.')
excess_ret_change ret_12_1 ret_1_0
date permno
1986-01-31 10000 NaN NaN NaN
1986-02-28 10000 NaN NaN -0.257
1986-03-31 10000 NaN NaN 0.365
1986-04-30 10000 NaN NaN -0.099
... ... ...
2023-09-30 93436 8.384 -0.027 -0.030
2023-10-31 93436 -24.680 0.100 -0.197
2023-11-30 93436 3.204 0.032 0.195
2023-12-31 93436 -0.337 0.949 0.035
Example 5. Firm characteristics using year-end ME
Firm characteristics using year-end ME
This example demonstrates how to generate funda firm characteristics using year-end ME instead of the latest ME.
All firm characteristics defined in FUNDA are generated.
Only USD-denominated stocks are sampled: currency conversion is unnecessary.
Only funda is used without merging with fundq: merging the two datasets as in Example 1 is also possible.
It is assumed that funda data is available 6 months later.
It is assumed that data has been downloaded from WRDS.
Processing time: approx. 105 seconds.
Initialization
Import packages
from pyanomaly.characteristics import FUNDA, CRSPM
from pyanomaly.analytics import *
Set log file path and start a timer.
set_log_path(__file__)
start_timer('Example 5')
Market equity (from CRSP)
# Load crspm.
crspm = CRSPM()
crspm.load_data()
# Filter data: shrcd in [10, 11, 12]
crspm.filter_data()
# Fill missing months by populating the data.
crspm.populate(method=None)
# Some preprocessing, e.g., creating frequently used variables.
# 'me' and 'me_company' (firm-level me) are generated here.
crspm.update_variables()
FUNDA firm characteristics
# Load funda.
funda = FUNDA()
funda.load_data()
# USD stocks only.
funda.filter(('curcd', '==', 'USD'))
# Populate data to monthly.
# limit=12: Forward fill up to 12 months.
# lag=6: Shift data 6 months, i.e., data is available 6 months later.
funda.populate(MONTHLY, limit=12, lag=6, new_date_col='date')
# Some preprocessing, e.g., creating frequently used variables.
funda.update_variables()
# Add year-end market equity to funda.
funda.add_crsp_me(crspm, method='year_end')
# Generate firm characteristics.
funda.show_available_chars()
funda.create_chars()
# Postprocess and save results to './output/funda_yearend_me'.
funda.postprocess()
funda.save('funda_yearend_me')
Example 6. Playing with Panel
Playing with Panel
This example demonstrates how to use Panel class.
Import packages
from pyanomaly.panel import Panel
from pyanomaly.analytics import *
Monthly data
data1 = pd.DataFrame({
'id': [100, 100, 100, 200, 200, 200, 300, 300, 300],
'date': ['2023-03-31', '2023-04-30', '2023-05-31', '2023-03-31', '2023-04-30', '2023-05-31', '2023-03-31',
'2023-04-30', '2023-05-31'],
'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, np.nan, 0.03, -0.05],
'me': [100, 120, 110, 5000, 4500, 4700, 2000, 2100, 1900],
'me_nyse': [np.nan, np.nan, np.nan, 5000, 4500, 4700, 2000, 2100, 1900],
})
data1['date'] = pd.to_datetime(data1['date'])
data1 = data1.sort_values(['id', 'date']).set_index(['date', 'id'])
# Instantiate Panel.
panel1 = Panel(data1, freq=MONTHLY)
del data1
print(panel1.data)
ret me me_nyse
date id
2023-03-31 100 0.010 100 NaN
2023-04-30 100 0.050 120 NaN
2023-05-31 100 0.020 110 NaN
2023-03-31 200 0.030 5000 5000.000
2023-04-30 200 0.110 4500 4500.000
2023-05-31 200 -0.030 4700 4700.000
2023-03-31 300 NaN 2000 2000.000
2023-04-30 300 0.030 2100 2100.000
2023-05-31 300 -0.050 1900 1900.000
- Inspect data. Available options are:
‘summary’: Data shape, number of unique dates, and number of unique ids.
‘id_count`: Number of ids per date.
‘nans’: Number of nans and infs per column.
‘stats’: Descriptive statistics. Same as
data.describe().
panel1.inspect_data(option=['summary', 'id_count', 'nans', 'stats'])
[2024/02/23 17:17] Shape: (9, 3)
[2024/02/23 17:17] No. unique dates: 3, Date range: 2023-03-31 00:00:00 - 2023-05-31 00:00:00
[2024/02/23 17:17] No. unique ids: 3
[2024/02/23 17:17] No. ids per date:
date
2023-03-31 3
2023-04-30 3
2023-05-31 3
dtype: int64
[2024/02/23 17:17] Nans, Infs
no. nans percentage no. infs percentage
ret 1 0.111 0 0.000
me 0 0.000 0 0.000
me_nyse 3 0.333 0 0.000
[2024/02/23 17:17] Descriptive statistics
count mean std min 0.1% 1% 25% 50% 75% 99% 99.9% max
ret 8.000 0.021 0.049 -0.050 -0.050 -0.049 0.000 0.025 0.035 0.106 0.110 0.110
me 9.000 2281.111 2017.588 100.000 100.080 100.800 120.000 2000.000 4500.000 4976.000 4997.600 5000.000
me_nyse 6.000 3366.667 1506.873 1900.000 1900.500 1905.000 2025.000 3300.000 4650.000 4985.000 4998.500 5000.000
Quarterly data
data2 = pd.DataFrame({
'id': [100, 100, 100, 200, 200, 200],
'date': ['2023-03-31', '2023-06-30', '2023-09-30', '2023-01-31', '2023-04-30', '2023-07-31'],
'prc': [100, 120, 110, 5000, 4500, 4700],
})
data2['date'] = pd.to_datetime(data2['date'])
data2 = data2.sort_values(['id', 'date']).set_index(['date', 'id'])
# Instantiate Panel.
panel2 = Panel(data2, freq=QUARTERLY)
del data2
print(panel2.data)
prc
date id
2023-03-31 100 100
2023-06-30 100 120
2023-09-30 100 110
2023-01-31 200 5000
2023-04-30 200 4500
2023-07-31 200 4700
Populate data to monthly.
panel2.populate(MONTHLY, method='ffill', limit=3)
print(panel2.data)
prc
date id
2023-03-31 100 100
2023-04-30 100 100
2023-05-31 100 100
2023-06-30 100 120
2023-07-31 100 120
2023-08-31 100 120
2023-09-30 100 110
2023-01-31 200 5000
2023-02-28 200 5000
2023-03-31 200 5000
2023-04-30 200 4500
2023-05-31 200 4500
2023-06-30 200 4500
2023-07-31 200 4700
Id-level operations
The methods below perform the operation per id and accounting for the data frequency.
For instance, if a data is a quarterly data populated monthly, shift(1) will shift the data by 3 (one quarter)
within each id.
Shift panel1.data by one month.
data1_lag1 = panel1.shift(n=1)
print(data1_lag1)
ret me me_nyse
date id
2023-03-31 100 NaN NaN NaN
2023-04-30 100 0.010 100.000 NaN
2023-05-31 100 0.050 120.000 NaN
2023-03-31 200 NaN NaN NaN
2023-04-30 200 0.030 5000.000 5000.000
2023-05-31 200 0.110 4500.000 4500.000
2023-03-31 300 NaN NaN NaN
2023-04-30 300 NaN 2000.000 2000.000
2023-05-31 300 0.030 2100.000 2100.000
Shift panel2.data by one quarter.
data2_lag1 = panel2.shift(n=1)
print(data2_lag1)
prc
date id
2023-03-31 100 NaN
2023-04-30 100 NaN
2023-05-31 100 NaN
2023-06-30 100 100.000
2023-07-31 100 100.000
2023-08-31 100 100.000
2023-09-30 100 120.000
2023-01-31 200 NaN
2023-02-28 200 NaN
2023-03-31 200 NaN
2023-04-30 200 5000.000
2023-05-31 200 5000.000
2023-06-30 200 5000.000
2023-07-31 200 4500.000
Shift panel1.data['ret'] by 2 months.
The two lines below are equivalent.
ret_lag2 = panel1.shift(panel1['ret'], 2)
ret_lag2 = panel1.shift('ret', 2)
print(ret_lag2)
date id
2023-03-31 100 NaN
2023-04-30 100 NaN
2023-05-31 100 0.010
2023-03-31 200 NaN
2023-04-30 200 NaN
2023-05-31 200 0.030
2023-03-31 300 NaN
2023-04-30 300 NaN
2023-05-31 300 NaN
Quarterly price difference. The two lines below are equivalent.
prc_diff = panel2.diff(panel2.data['prc'], 1)
prc_diff = panel2.diff('prc', 1)
print(prc_diff)
date id
2023-03-31 100 NaN
2023-04-30 100 NaN
2023-05-31 100 NaN
2023-06-30 100 20.000
2023-07-31 100 20.000
2023-08-31 100 20.000
2023-09-30 100 -10.000
2023-01-31 200 NaN
2023-02-28 200 NaN
2023-03-31 200 NaN
2023-04-30 200 -500.000
2023-05-31 200 -500.000
2023-06-30 200 -500.000
2023-07-31 200 200.000
Quarterly price percentage change. The two lines below are equivalent.
prc_change = panel2.pct_change(panel2.data['prc'], 1)
prc_change = panel2.pct_change('prc', 1)
print(prc_change)
date id
2023-03-31 100 NaN
2023-04-30 100 NaN
2023-05-31 100 NaN
2023-06-30 100 0.200
2023-07-31 100 0.200
2023-08-31 100 0.200
2023-09-30 100 -0.083
2023-01-31 200 NaN
2023-02-28 200 NaN
2023-03-31 200 NaN
2023-04-30 200 -0.100
2023-05-31 200 -0.100
2023-06-30 200 -0.100
2023-07-31 200 0.044
Two-period cumulative returns.
cum_ret = panel1.cumret('ret', 2)
print(cum_ret)
date id
2023-03-31 100 NaN
2023-04-30 100 0.060
2023-05-31 100 0.071
2023-03-31 200 NaN
2023-04-30 200 0.143
2023-05-31 200 0.077
2023-03-31 300 NaN
2023-04-30 300 NaN
2023-05-31 300 -0.021
One-period ahead returns.
fut_ret = panel1.futret('ret', 1)
print(fut_ret)
date id
2023-03-31 100 0.050
2023-04-30 100 0.020
2023-05-31 100 NaN
2023-03-31 200 0.110
2023-04-30 200 -0.030
2023-05-31 200 NaN
2023-03-31 300 0.030
2023-04-30 300 -0.050
2023-05-31 300 NaN
Rolling-apply a function to each id. Supported functions: ‘sum’, ‘mean’, ‘std’, ‘var’.
rolling_sum = panel1.rolling(panel1[['ret', 'me']], 2, 'sum', min_n=1)
print(rolling_sum)
ret me
date id
2023-03-31 100 0.010 100.000
2023-04-30 100 0.060 220.000
2023-05-31 100 0.070 230.000
2023-03-31 200 0.030 5000.000
2023-04-30 200 0.140 9500.000
2023-05-31 200 0.080 9200.000
2023-03-31 300 NaN 2000.000
2023-04-30 300 0.030 4100.000
2023-05-31 300 -0.020 4000.000
To apply a user-defined function to each id, use apply_to_ids().
Below is the same as the above except that the return value is ndarray.
rolling_sum = panel1.apply_to_ids(panel1[['ret', 'me']], roll_sum, None, 2, 1)
print(rolling_sum)
[[ 1.0e-02 1.0e+02]
[ 6.0e-02 2.2e+02]
[ 7.0e-02 2.3e+02]
[ 3.0e-02 5.0e+03]
[ 1.4e-01 9.5e+03]
[ 8.0e-02 9.2e+03]
[ nan 2.0e+03]
[ 3.0e-02 4.1e+03]
[-2.0e-02 4.0e+03]]
Date-level operations
To apply a user-defined function to each date, use apply_to_dates().
# Functino to apply: sum of each column.
def sum0(data):
return np.sum(data, axis=0).reshape(1, -1)
column_sum = panel1.apply_to_dates(panel1[['ret', 'me']], sum0, None)
print(column_sum)
[[ nan 7.10e+03]
[ 1.90e-01 6.72e+03]
[-6.00e-02 6.71e+03]]
Rolling regression
The first column of the first argument is used as the dependent variable and the rest as the independent variables. First K (number of independent varialbes) columns of the return are coefficients, the next column is R-squared, and the last column is idiosyncratic volatility (standard deviation of residuals).
retval = panel1.rolling_regression(panel1[['ret', 'me']], 3, min_n=2, add_const=True)
retval = pd.DataFrame(retval, columns=['alpha', 'me', 'R2', 'Idio vol'], index=panel1.data.index)
print(retval)
alpha me R2 Idio vol
date id
2023-03-31 100 NaN NaN NaN NaN
2023-04-30 100 -0.190 0.002 1.000 0.000
2023-05-31 100 -0.193 0.002 0.923 0.006
2023-03-31 200 NaN NaN NaN NaN
2023-04-30 200 0.830 -0.000 1.000 0.000
2023-05-31 200 0.659 -0.000 0.222 0.062
2023-03-31 300 NaN NaN NaN NaN
2023-04-30 300 NaN NaN NaN NaN
2023-05-31 300 -0.810 0.000 1.000 0.000
Panel merge
Two panels can be merged as follows.
panel1.merge(panel2, on=['date', 'id'], how='left')
print(panel1.data)
ret me me_nyse prc
date id
2023-03-31 100 0.010 100 NaN 100.000
2023-04-30 100 0.050 120 NaN 100.000
2023-05-31 100 0.020 110 NaN 100.000
2023-03-31 200 0.030 5000 5000.000 5000.000
2023-04-30 200 0.110 4500 4500.000 4500.000
2023-05-31 200 -0.030 4700 4700.000 4500.000
2023-03-31 300 NaN 2000 2000.000 NaN
2023-04-30 300 0.030 2100 2100.000 NaN
2023-05-31 300 -0.050 1900 1900.000 NaN
Example 7. Table download
Table download
This example demonstrates how to download a new table from WRDS. Different ways of downloading comp.secm table are demonstrated.
The table will be saved to ‘./input/comp/secm’.
Initialization
Import packages
from pyanomaly.globals import *
from pyanomaly.wrdsdata import WRDS
Set log file path and start a timer.
set_log_path(__file__)
start_timer('Example 7')
WRDS instantiation
References: WRDS.
wrds = WRDS(wrds_usename)
Data download
Method 1: Download the entire table at once.
wrds.download_table('comp', 'secm', date_cols=['datadate']) # 'datadate's type will be converted to datetime.
Method 2: Download the entire table asynchronously.
Download data every interval years. For a small size data, this can be slower than download_table().
wrds.download_table_async('comp', 'secm', date_col='datadate', interval=5)
Method 3: Download only some fields.
sql = 'datadate, gvkey, cshoq, prccm' # fields to select
wrds.download_table_async('comp', 'secm', sql=sql, date_col='datadate')
Method 4: Download data using a complete query statement.
Below is equivalent to the above. Note that the query statement must contain ‘WHERE [date_col] BETWEEN {} and {}’.
sql = f"""
SELECT datadate, gvkey, cshoq, prccm
FROM comp.secm
WHERE datadate between '{{}}' and '{{}}'
"""
wrds.download_table_async('comp', 'secm', sql=sql, date_cols=['datadate'])