Cookbook

All the examples presented here can be found in examples. For the details of each function or class, refer to the API documentation. All examples were tested on a desktop with 12th Gen Intel(R) Core(TM) i9-12900KS and 128GB RAM and 1GB network

Example 1. Generating firm characteristics

Firm characteristics generation

This example shows the full process of i) data download, ii) factor generation, and iii) firm characteristic generation. In this example,

  • Funda is merged with fundq to create quarterly updated annual accounting data.

  • Funda and fundq are assumed to be available 4 months later.

  • The latest market equity (me) is used when generating firm characteristics.

  • All firm characteristics defined in PyAnomaly are generated.

Generated factors are saved to ‘./output/factors_monthly’ and ‘./output/factors_daily’, and firm characteristics are saved to ‘./output/merge’.

Processing time: approx. 45 minutes to download data and 15 minutes for the rest.

Initialization

Import packages

from pyanomaly.globals import *
from pyanomaly.wrdsdata import WRDS
from pyanomaly.characteristics import FUNDA, FUNDQ, CRSPM, CRSPD, Merge
from pyanomaly.factors import make_all_factor_portfolios
from pyanomaly.analytics import *

Set log file path and start a timer.

# Set log file path as config.log_dir + this module name.
set_log_path(__file__)  # You can also give a different log file name: e.g., set_log_path('logfile.log')

# Start timer.
start_timer('Example 1.')

# To replicate JKP's SAS code as closely as possible, uncomment the following line.
# set_config(replicate_jkp=True)

Download data frm WRDS

WRDS is a class to download data from WRDS.

wrds = WRDS(wrds_username)  # Use your WRDS user id.

# Download all necessary data asynchronously.
# If network is slow or memory is insufficient, set run_in_executer=False.
wrds.download_all(run_in_executer=True)

# Create crspm(d) from m(d)sf and m(d)seall and add gvkey to them.
wrds.preprocess_crsp()

Factor generation

Make factor portfolios: FF3, FF5, HXZ4, and SY4. These are used in some firm characteristics, such as the residual momentum.

References: make_all_factor_portfolios(), make_factor_portfolios()

make_all_factor_portfolios(monthly=True, daily=True)  # Monthly and daily factors

Firm characteristic generation

Firm characteristics are generated by FUNDA, FUNDQ, CRSPM, CRSPD, and Merge classes.

To generate characteristics in a certain column of mapping.xlsx, set alias = column name. For instance, alias = 'jkp' will generated firm characteristics defined in ‘jkp’ column. Setting alias = None will generate all available firm characteristics. Note that some firm characteristics will be created twice using different implementations.

alias = None
# Start date ('yyyy-mm-dd'). Set to None to create characteristics from as early as possible.
sdate = None

Generate firm characteristics from crspm.

crspm = CRSPM(alias=alias)

# Load crspm.
crspm.load_data(sdate)

# Filter data: shrcd in [10, 11, 12]
crspm.filter_data()

# Fill missing months by populating the data.
crspm.populate(method=None)

# Some preprocessing, e.g., creating frequently used variables.
crspm.update_variables()

# Merge crspm with the monthly factors generated earlier.
crspm.merge_with_factors()

# Display the firm characteristics to be generated.
crspm.show_available_chars()

# Create characteristics.
crspm.create_chars()

# Postprocessing: delete temporary variables and inspect the generated data.
crspm.postprocess()

Generate firm characteristics from crspd.

crspd = CRSPD(alias=alias)

# Load crspd.
crspd.load_data(sdate)

# Filter data: shrcd in [10, 11, 12]
crspd.filter_data()

# Some preprocessing, e.g., creating frequently used variables.
crspd.update_variables()

# Merge crspd with the daily factors generated earlier.
crspd.merge_with_factors()

# Display the firm characteristics to be generated.
crspd.show_available_chars()

# Create characteristics.
crspd.create_chars()

# Postprocessing: delete temporary variables and inspect the generated data.
crspd.postprocess()

# Delete raw crspd data to save memory.
del crspd.cd

Generate firm characteristics from fundq.

fundq = FUNDQ(alias=alias)

# Load fundq.
fundq.load_data(sdate)

# fundq has some duplicates (same datedate/gvkey). Drop duplicates.
fundq.remove_duplicates()

# Convert values in another currency (currently only CAD) to USD.
fundq.convert_currency()

# Populate data to monthly.
# limit=3: Forward fill up to 3 months.
# lag=4: Shift data 4 months, i.e., data is available 4 months later.
# new_date_col='date': Populated data has index 'date'/'gvkey', and 'datadate' becomes a column.
fundq.populate(MONTHLY, limit=3, lag=4, new_date_col='date')

# Make quarterly variables from ytd variables and use them to fill missing quarterly variables.
fundq.create_qitems_from_yitems()

# Some preprocessing, e.g., creating frequently used variables.
fundq.update_variables()

# Display the firm characteristics to be generated.
fundq.show_available_chars()

# Create characteristics.
fundq.create_chars()

# Postprocessing: delete temporary variables and inspect the generated data.
fundq.postprocess()

Generate firm characteristics from funda.

funda = FUNDA(alias=alias)

# Load funda.
funda.load_data(sdate)

# Convert values in another currency (currently only CAD) to USD.
funda.convert_currency()

# Populate data to monthly.
# limit=12: Forward fill up to 12 months.
# lag=4: Shift data 4 months, i.e., data is available 4 months later.
# new_date_col='date': Populated data has index 'date'/'gvkey', and 'datadate' becomes a column.
funda.populate(MONTHLY, limit=12, lag=4, new_date_col='date')

# Generate quarterly-updated funda data from fundq and merge them with funda.
funda.merge_with_fundq(fundq)

# Some preprocessing, e.g., creating frequently used variables.
funda.update_variables()

# Add the market equity of crspm to funda.
funda.add_crsp_me(crspm)

# Display the firm characteristics to be generated.
funda.show_available_chars()

# Create characteristics.
funda.create_chars()

# Postprocessing: delete temporary variables and inspect the generated data.
funda.postprocess()

Merge FUNDA, FUNDQ, CRSPM, and CRSPD and generate firm characteristics that require data from multiple data sources. Merge.data contains all firm characteristics and raw data from FUNDA, FUNDQ, CRSPM, AND CRSPD.

merge = Merge(alias)

# Merge all data together.
merge.preprocess(crspm, crspd, funda, fundq)

# Display the firm characteristics to be generated.
merge.show_available_chars()

# Create characteristics.
merge.create_chars()

# Postprocessing: delete temporary variables and inspect the generated data.
merge.postprocess()

# Save the results to config.output_dir + 'merge'.
# The default behavior of Panel.save() is to save all the columns of Panel.data.
# You can reduce file size by choosing columns to save. The following code saves firm characteristics + columns.
columns = ['gvkey', 'datadate', 'primary', 'exchcd', 'me', 'ret', 'exret', 'rf']
merge.save(other_columns=columns)

elapsed_time('End of Example 1.')

Example 2. Sorting-based portfolio analysis

Sorting-based portfolio analysis.

This example demonstrates how to construct quantile portfolios and carry out 1-D or 2-D sorts.

  1. Quintile portfolios will be generated from the 12-month momentum and their mean returns and t-values will be computed.

  2. 5x5 2-D sort will be conducted on momentum and size.

  3. Firm-level cross-sectional regression will be conducted.

  4. Factor regressions will be carried out using FF 3-factors.

  • It is assumed that the firm characteristic has been created and saved to config.output_dir + ‘crspm’.

See also make_char_portfolios() for characteristic portfolio creation replicating JKP’s SAS code.

Processing time: approx. 10 seconds.

Initialization

Import packages

from pyanomaly.fileio import read_from_file
from pyanomaly.analytics import *

Set log file path and start a timer.

set_log_path(__file__)
start_timer('Example 2')

# Jitted functions are slow when first called due to compile time. This example runs faster by disabling jit.
set_config(disable_jit=True)

Set variables

char = 'ret_12_1'  # 12-month momentum
char_class = char + '_class'  # Momentum class column
ret_col = 'exret'  # Return column
weight_col = 'me'  # Market equity column
split = 5  # Qintile. You can also do split = [0.2, 0.4, 0.6, 0.8].
labels = ['H', 2, 3, 4, 'L', 'H-L']  # Momentum portfolio names: 'H-L' for long short.

Load and prepare data

References: datatools.shift(), filter().

# Read crspm data (not raw data but the output data).
data = read_from_file('crspm')

# Shift data except for return so that all the variables at t are as of t-1 and the return is over t-1 to t.
data = shift(data, 1, excl_cols=[ret_col])

# Drop nan returns and weights.
data = data.dropna(subset=[ret_col, weight_col])

# If you want to exclude bottom 20% based on NYSE size...
log(f'crspm shape before size filtering: {data.shape}')
data['nyse_me'] = np.where(data.exchcd == 1, data[weight_col], np.nan)  # NYSE me
data = filter(data, weight_col, (0.2, None), ginfo='date', by='nyse_me')
log(f'crspm shape after size filtering: {data.shape}')
[2024/02/22 14:52] crspm shape before size filtering: (3958767, 76)
[2024/02/22 14:52] crspm shape after size filtering: (1752862, 77)

1-D sort

References: classify(), one_dim_sort(), time_series_average()

elapsed_time('1D sort start.')

# Classify stocks on each date on momentum. class 0: highest momentum, 5: lowest momentum.
data[char_class] = classify(data[char], split, ascending=False, ginfo='date')

# Make value-weighted quantile portfolios.
# Set weight_col to None for equal-weight portfolios.
# qpfs will have index = date/char_class and columns = [ret_col]
qpfs = one_dim_sort(data, char_class, ret_col, weight_col=weight_col)

# You can add more columns to the portfolios. Here, I add signal (mean of the characteristic) and
# n_firms (number of firms).
qpfs[['signal', 'n_firms']] = one_dim_sort(data, char_class, char, function=['mean', 'count'])

# If you want to see the average size and price of the portfolios...
qpfs[['size', 'prc']] = one_dim_sort(data, char_class, [weight_col, 'prc'])

# Rename the portfolios. Without this, the names are 0, 1, ...
relabel_class(qpfs, labels)

# Check the result.
print(f'Quantile portfolios sorted on {char}')
print(qpfs)
Quantile portfolios sorted on ret_12_1
                       exret  signal  n_firms      size      prc
date       ret_12_1_class
1927-01-31 H              -0.008   0.479       72    83.725   73.075
           2               0.006   0.145       72   137.217   92.345
           3               0.001   0.010       71    85.660   81.256
           4              -0.004  -0.107       72    53.240   54.126
           L              -0.007  -0.333       72    24.329   34.560
           H-L            -0.001   0.812        0    59.396   38.515
1927-02-28 H               0.055   0.447       74    93.941   81.642
                          ...     ...      ...       ...      ...
           H-L             0.038   1.041        0 20643.687   81.385
2023-12-31 H               0.040   0.575      449 33467.685  119.837
           2               0.035   0.083      449 31027.949 1333.695
           3               0.045  -0.067      449 22789.652  101.212
           4               0.074  -0.207      449 14269.428   64.336
           L               0.118  -0.425      450  5509.468   35.076
           H-L            -0.077   1.000       -1 27958.216   84.760
# Calculate the time-series mean and t-value of the returns of the quantile portfolios.
mean, tval = time_series_average(qpfs[ret_col], cov_type='HAC', cov_kwds={'maxlags': 12})
print(f'\n Mean of portfolio returns sorted on {char}')
print(pd.concat([mean, tval], axis=1))

elapsed_time('1D sort end.')
 Mean of portfolio returns sorted on ret_12_1
                mean  t-val
ret_12_1_class
H              0.011  5.503
2              0.008  4.899
3              0.006  3.727
4              0.006  3.269
L              0.003  1.478
H-L            0.007  4.335

2-D sort

References: two_dim_sort()

elapsed_time('2D sort start.')

# Make 5x5 portfolios on momentum and size.
char2 = 'me'  # Market equity
char_class2 = char2 + '_class'  # Size class column
split2 = 5  # Quintiles
labels2 = ['S', 2, 3, 4, 'B', 'S-B']  # Size portfolio names

# Classify stocks on each date on size. class 0: smallest, 5: biggest.
data[char_class2] = classify(data[char2], split2, ascending=True, ginfo='date')

# Make 5x5 value-weighted portfolios.
# qpfs2 will have index = date/char_class and columns = char_class2.
qpfs2 = two_dim_sort(data, char_class, char_class2, ret_col, weight_col=weight_col, output_dim=2)

# Rename the portfolios.
relabel_class(qpfs2, labels)
relabel_class(qpfs2, labels2, axis=1)

# Calculate the time-series mean and t-value of the returns of the quantile portfolios.
mean, tval = time_series_average(qpfs2, cov_type='HAC', cov_kwds={'maxlags': 12})
print(f'\nMean of portfolio returns sorted on {char} and {char2}')
print('\nMean')
print(mean)
print('\nt-stat')
print(tval)

elapsed_time('2D sort end.')
Mean of portfolio returns sorted on ret_12_1 and me

Mean
me_class           S     2     3     4     B    S-B
ret_12_1_class
H              0.014 0.014 0.014 0.012 0.010  0.004
2              0.012 0.010 0.009 0.009 0.008  0.004
3              0.011 0.009 0.008 0.008 0.006  0.005
4              0.009 0.008 0.007 0.006 0.006  0.004
L              0.003 0.004 0.004 0.004 0.003 -0.000
H-L            0.011 0.010 0.009 0.008 0.007  0.004

t-stat
me_class           S     2     3     4     B    S-B
ret_12_1_class
H              5.564 6.146 6.057 5.805 5.101  2.633
2              4.923 4.948 4.532 5.241 4.778  2.738
3              4.462 4.504 4.023 4.115 3.538  3.278
4              3.569 3.490 3.192 3.348 3.298  2.411
L              1.012 1.491 1.540 1.707 1.394 -0.111
H-L            6.516 5.822 5.286 4.877 3.885  2.873

Firm-level cross-sectional regression

References: crosssectional_regression()

  • Cross-sectionally regress return at t on t-1 momentum and size.

  • Take the time-series average of the coefficients and examine their significance.

elapsed_time('CS regression start.')

y_col = ret_col
X_cols = [char, char2]

# Run the regressions. Output: mean coefficients, t-values, and coefficient time-series.
mean, tval, coef = crosssectional_regression(data, y_col, X_cols, add_constant=True, cov_type='HAC',
                                            cov_kwds={'maxlags': 12})

print('\nMean of the coefficients')
print(pd.concat([mean, tval], axis=1))
print('\nCoefficients time-series')
print(coef)

elapsed_time('CS regression end.')
Mean of the coefficients
           mean  t-stat
const     0.006   3.273
ret_12_1  0.008   4.742
me       -0.000  -1.627

Coefficients time-series
            const  ret_12_1     me
1927-01-31  0.012     0.009 -0.000
1927-02-28  0.059    -0.011 -0.000
1927-03-31 -0.029     0.055  0.000
1927-04-30 -0.012     0.050  0.000
           ...       ...    ...
2023-09-30 -0.063     0.005  0.000
2023-10-31 -0.072     0.004  0.000
2023-11-30  0.090     0.018 -0.000
2023-12-31  0.105     0.000 -0.000

Factor regression

Regress quantile portfolios’ returns on the FF-3 factors.

elapsed_time('Factor regression start.')

# Sample period.
sdate = '1952-01'
edate = '2020-12'

qpfs = qpfs[sdate:edate]

# Read factors.
# Here, I use the factors generated by PyAnomaly. You can use different sources if you like.
factors = read_from_file(config.factors_monthly_fname)
# factors = WRDS.read_data('factors_monthly')  # factors from K.French website.

# X: Constant + FF3.
X = factors.loc[sdate:edate, ['mktrf', 'smb_ff', 'hml']].values
# X = factors.loc[sdate:edate, ['mktrf', 'smb', 'hml']].values  # If you use K. French data.
X = sm.add_constant(X)

# Run the regression for each portfolio.
result = {}
for pf in labels:
    # y: Returns of a quantile portfolio
    y = qpfs.loc[(slice(None), pf), ret_col].values

    model = sm.OLS(y, X).fit()

    # Make an output table.
    result[pf] = pd.DataFrame({'coefs': model.params, 'tval': model.tvalues}, index=['const', 'mktrf', 'smb', 'hml'])

result = pd.concat(result, axis=1)
print('\nFactor regression: FF3')
print(result)

elapsed_time('Factor regression end.')
Factor regression: FF3
           H              2             3             4             L           H-L
       coefs    tval  coefs   tval  coefs   tval  coefs   tval  coefs   tval  coefs    tval
const  0.005   6.662  0.002  3.585  0.000  0.258 -0.002 -3.292 -0.007 -7.505  0.011   8.052
mktrf  1.013  58.519  0.956 85.346  0.941 93.837  1.000 83.939  1.220 58.471 -0.207  -6.128
smb    0.191   7.170 -0.116 -6.736 -0.098 -6.386 -0.043 -2.377  0.233  7.265 -0.042  -0.813
hml   -0.494 -23.610 -0.116 -8.538  0.082  6.757  0.294 20.425  0.568 22.526 -1.062 -26.020

Example 3. Quantile portfolio construction and evaluation

Quantile portfolio construction and evaluation.

This example demonstrates how to construct quantile portfolios (and the long-short) based on a firm characteristic and evaluate their performance.

  1. Tercile portfolios will be generated from the 12-month momentum.

  2. Different ways of setting the transaction cost will be demonstrated.

  • It is assumed that the firm characteristic has been created and saved to config.output_dir + ‘crspm’.

Processing time: approx. 10 seconds.

Initialization

Import packages

import matplotlib.pyplot as plt

from pyanomaly.globals import *
from pyanomaly.fileio import read_from_file
from pyanomaly.tcost import TransactionCost, TimeVaryingCost
from pyanomaly.analytics import *
from pyanomaly.wrdsdata import WRDS

Set log file path and start a timer.

set_log_path(__file__)
start_timer('Example 3')

# Jitted functions are slow when first called due to compile time. This example runs faster by disabling jit.
set_config(disable_jit=True)

Set variables

char = 'ret_12_1'  # 12-month momentum
char_class = char + '_class'  # Momentum class column
ret_col = 'exret'  # Return column
weight_col = 'me'  # Market equity column
split = 3  # Tercile. You can also do split = [0.3, 0.7] for 3:4:3 split.
labels = ['H', 'M', 'L', 'H-L']  # Portfolio names: 'H-L' for long short.

Load and prepare data

# Read crspm data (not raw data but the output data).
data = read_from_file('crspm')

# NYSE me
data['nyse_me'] = np.where(data.exchcd == 1, data[weight_col], np.nan)

# Delete unnecessary columns to save memory (optional).
keep_columns(data, [char, ret_col, weight_col, 'nyse_me'])

# Shift data except for return so that all the variables at t are as of t-1 and the return is over t-1 to t.
data = shift(data, 1, excl_cols=[ret_col])

# Risk-free rates.
# Set `month_end=True` since the crspm date has also been shifted to month end.
rf = WRDS.get_risk_free_rate(month_end=True)

Transaction costs

PyAnomaly provides a powerful way to set transaction costs via TransactionCost and TimeVaryingCost classes. Below are some examples of setting transaction costs.

# No transaction costs.
costfcn = None

# Transaction cost of 30 basis points.
costfcn = 0.003

# 20 (30) bps when buying (selling).
costfcn = TransactionCost(buy_linear=0.002, sell_linear=0.002)

In this example, we assume transaction cost varies over time and with size, following Brandt et al. (2009).

costfcn = TimeVaryingCost(data[weight_col])

Make tercile portfolios

In PyAnomaly, Portfolio class is used to make a portfolio and Portfolios class is used to handle a group of portfolios. make_quantile_portfolios() uses these classes to construct quantile portfolios.

# Exclude bottom 20% based on NYSE size. This should come after setting the time varying cost
# because `TimeVaryingCost` requires all firms' me as the input.
data = filter(data, weight_col, (0.2, None), by='nyse_me', ginfo='date')

# Classify data on char. The highest momentum stock will be labeled 0 and the lowest 2.
data[char_class] = classify(data[char], split, ascending=False, ginfo='date')

portfolios = make_quantile_portfolios(data, char_class, ret_col, weight_col, rf=rf, costfcn=costfcn, names=labels)

Evaluate the portfolios

Portfolios can be evaluated using Portfolio.eval() for a single portfolio or Portfolios.eval() for a group of portfolios.

# `annualize_factor=12` will annualize the results as crspm is monthly data.
pfperfs, pfvals = portfolios.eval(annualize_factor=12)

# Performance metrics.
print('\nPerformance')
print(pfperfs)

# Time-series of portfolio value, return, ...
print('\nValues')
print(pfvals)
Performance
                    H           M           L         H-L
mean           -0.006      -0.080      -0.079      -0.135
std             0.195       0.191       0.247       0.167
sharpe         -0.029      -0.419      -0.319      -0.809
cum             0.692      -6.404      -7.451     -14.777
mdd             0.877       1.000       1.000       1.000
mdd start  1929-08-31  1929-08-31  1928-04-30  1932-06-30
mdd end    1984-07-31  2009-02-28  2009-02-28  2023-12-31
msd             0.509       0.557       0.618       0.783
msd start  2008-05-31  1973-09-30  1932-02-29  1932-06-30
msd end    2009-02-28  1974-09-30  1932-06-30  1932-09-30
turnover        5.592       7.716       6.272      12.238
lposition     472.351     471.740     472.158     472.351
sposition       0.000       0.000       0.000     472.158

Values
           grossret grossexret      val1       val lposition sposition   cost tover ...
                  H          H         H         H         H         H      H     H ...
1927-01-31   -0.002     -0.004     0.998     1.000       121     0.000  0.000 0.000 ...
1927-02-28    0.046      0.044     1.044     0.998       123     0.000  0.009 0.415 ...
1927-03-31    0.019      0.016     1.065     1.044       123     0.000  0.016 0.720 ...
1927-04-30    0.019      0.017     1.085     1.065       127     0.000  0.013 0.535 ...
             ...        ...       ...       ...       ...       ...    ...   ...    ...
2023-09-30   -0.062     -0.066  9274.546  9884.677       786     0.000 15.306 0.265 ...
2023-10-31   -0.025     -0.030  9039.294  9274.546       771     0.000 20.416 0.393 ...
2023-11-30    0.109      0.104 10023.511  9039.294       759     0.000 17.488 0.401 ...
2023-12-31    0.035      0.031 10378.316 10023.511       749     0.000 23.490 0.462 ...

Plot cumulative returns.

# Plot cumulative returns.
pfvals['cumret'].plot()
plt.show()

# Plot selected portfolios' cumulative returns.
pfvals['cumret'][['H', 'L', 'H-L']].plot()
plt.show()
example3_plot1.png example3_plot2.png

Evaluate the portfolios for a sub-period and ignoring transaction costs.

pfperfs1, pfvals1 = portfolios.eval(sdate='2001-01-01', edate='2010-12-31', annualize_factor=12,
                                    return_type='gross')
print('\nPerformance: 2001-01 - 2010-12')
print(pfperfs1)

# Plot cumulative returns.
pfvals1['cumret'][['H', 'L', 'H-L']].plot()
plt.show()
Performance: 2001-01 - 2010-12
                    H           M           L         H-L
mean            0.026      -0.001      -0.001       0.027
std             0.169       0.149       0.256       0.200
sharpe          0.152      -0.005      -0.005       0.134
cum             0.325       0.093      -0.126       0.057
mdd             0.518       0.462       0.689       0.450
mdd start  2007-10-31  2007-10-31  2001-01-31  2008-06-30
mdd end    2009-02-28  2009-02-28  2009-02-28  2010-01-31
msd             0.402       0.259       0.404       0.396
msd start  2008-05-31  2008-08-31  2008-08-31  2009-02-28
msd end    2008-11-30  2008-11-30  2008-11-30  2009-05-31
turnover        6.289       7.606       6.150      12.898
lposition     677.483     676.883     677.208     677.483
sposition       0.000       0.000       0.000     677.208
example3_plot3.png

Access a member portfolio.

# To access portfolio 'H'
pf_h = portfolios['H']

# You can also access the returns of `eval()` as follows.
print('\nPerformance: H')
print(pf_h.performance)  # pf_h_perf
print('\nValues: H')
print(pf_h.value)  # pf_h_val

# To see the positions (constituents)
print('\nPositions: H')
print(pf_h.position)

elapsed_time('End of Example 6.')
Performance: H
                    H
mean            0.026
std             0.169
sharpe          0.152
cum             0.325
mdd             0.518
mdd start  2007-10-31
mdd end    2009-02-28
msd             0.402
msd start  2008-05-31
msd end    2008-11-30
turnover        6.289
lposition     677.483
sposition       0.000

Values: H
            grossret  grossexret      val1       val  lposition  sposition   cost  tover  netret  netexret  cumret  drawdown  drawdur drawstart  succdown  succdur succstart
1927-01-31    -0.002      -0.004     0.998     1.000        121      0.000  0.000  0.000  -0.002    -0.004     NaN       NaN      NaN       NaT       NaN      NaN       NaT
1927-02-28     0.046       0.044     1.044     0.998        123      0.000  0.009  0.415   0.037     0.035     NaN       NaN      NaN       NaT       NaN      NaN       NaT
1927-03-31     0.019       0.016     1.065     1.044        123      0.000  0.016  0.720   0.004     0.002     NaN       NaN      NaN       NaT       NaN      NaN       NaT
1927-04-30     0.019       0.017     1.085     1.065        127      0.000  0.013  0.535   0.007     0.005     NaN       NaN      NaN       NaT       NaN      NaN       NaT
              ...         ...       ...       ...        ...        ...    ...    ...     ...       ...     ...       ...      ...       ...       ...      ...       ...
2023-09-30    -0.062      -0.066  9274.546  9884.677        786      0.000 15.306  0.265  -0.063    -0.068     NaN       NaN      NaN       NaT       NaN      NaN       NaT
2023-10-31    -0.025      -0.030  9039.294  9274.546        771      0.000 20.416  0.393  -0.028    -0.032     NaN       NaN      NaN       NaT       NaN      NaN       NaT
2023-11-30     0.109       0.104 10023.511  9039.294        759      0.000 17.488  0.401   0.107     0.103     NaN       NaN      NaN       NaT       NaN      NaN       NaT
2023-12-31     0.035       0.031 10378.316 10023.511        749      0.000 23.490  0.462   0.033     0.029     NaN       NaN      NaN       NaT       NaN      NaN       NaT

Positions: H
               id    ret   wgt    rf  exret     val    val1    val0  cost
date
1927-01-31  10049 -0.030 0.004 0.002 -0.032   0.004   0.003   0.004 0.000
1927-01-31  10102 -0.020 0.002 0.002 -0.023   0.002   0.002   0.002 0.000
1927-01-31  10129 -0.009 0.003 0.002 -0.012   0.003   0.003   0.003 0.000
1927-01-31  10145  0.021 0.024 0.002  0.019   0.024   0.024   0.024 0.000
           ...    ...   ...   ...    ...     ...     ...     ...   ...
2023-12-31  93369 -0.005 0.000 0.004 -0.009   1.821   1.811   0.000 0.011
2023-12-31  93374  0.000 0.000 0.000  0.000   0.000   0.000   2.853 0.017
2023-12-31  93427  0.171 0.000 0.004  0.167   2.416   2.830   2.728 0.002
2023-12-31  93429 -0.024 0.001 0.004 -0.028   7.898   7.707   8.944 0.006

Example 4. Defining a new characteristic

Define a new characteristic

This example demonstrates how to define a new characteristic by inheriting CRSPM.

A new characteristic, ‘excess_ret_change’, will be defined, which is the excess return over the market return divided by the one-year average excess return (I have no idea if this has any predictive power).

  • It is assumed that data has been downloaded from WRDS.

Processing time: approx. 18 seconds.

Initialization

Import packages

from pyanomaly.globals import *
from pyanomaly.characteristics import CRSPM
from pyanomaly.analytics import *

Set log file path and start a timer.

set_log_path(__file__)
start_timer('Example 4')

Define a new characteristic method

Inherit CRSPM and add a new method c_excess_ret_change() that defines the new characteristic. Method name should be of the form ‘c_[characteristic name]’.

class MyCRSPM(CRSPM):

    def c_excess_ret_change(self):
        """Change of excess return over the market return."""
        # Make a short (one-line) docstring for description.
        # This is displayed when 'show_available_chars()' is called.

        cm = self.data

        # One-year average excess return
        avg_exret = self.rolling(cm.exret - cm.mktrf, 12, 'mean')

        # Excess return.
        exret = cm.exret - cm.mktrf

        # Characteristic
        char = exret / avg_exret

        return char

Generate the firm characteristic

# Create a MyCRSPM instance.
crspm = MyCRSPM()

# Load crspm.
crspm.load_data()

# Filter data: shrcd in [10, 11, 12]
crspm.filter_data()

# Fill missing months by populating the data.
crspm.populate(method=None)

# Some preprocessing, e.g., creating frequently used variables.
crspm.update_variables()

# Merge crspm with the monthly factors generated earlier.
# Need this as excess_ret_change uses the market return.
crspm.merge_with_factors()

# Display the firm characteristics to be generated.
crspm.show_available_chars()

Notice that excess_ret_change has been added.

beta_60m             Market beta (Org, JKP). Fama and MacBeth (1973)
...                  ...
excess_ret_change    Change of excess return over the market return.
...                  ...
turn                 Share turnover (Org, GHZ). Datar, Naik, and Radcliffe (1998)

Generate firm characteristics

# Generate excess_ret_change.
# Without any argument, all characteristics defined in CRSPM and MyCRSPM will be generated.
crspm.create_chars('excess_ret_change')

# Some other characteristics defined in CRSPM can also be generated.
crspm.create_chars(['ret_12_1', 'ret_1_0'])  # 12M momentum and short-term reversal.

# Check the result
print(crspm[['excess_ret_change', 'ret_12_1', 'ret_1_0']])
                  excess_ret_change  ret_12_1  ret_1_0
date       permno
1986-01-31 10000                 NaN       NaN      NaN
1986-02-28 10000                 NaN       NaN   -0.257
1986-03-31 10000                 NaN       NaN    0.365
1986-04-30 10000                 NaN       NaN   -0.099
                              ...       ...      ...
2023-09-30 93436               8.384    -0.027   -0.030
2023-10-31 93436             -24.680     0.100   -0.197
2023-11-30 93436               3.204     0.032    0.195
2023-12-31 93436              -0.337     0.949    0.035
# Postprocessing: delete temporary variables and inspect the generated data.
crspm.postprocess()
[2024/02/22 17:53] No. unique dates: 1177, Date range: 1925-12-31 00:00:00 - 2023-12-31 00:00:00
[2024/02/22 17:53] No. unique ids: 29055
[2024/02/22 17:53] Nans, Infs
                   no. nans  percentage  no. infs  percentage
excess_ret_change    471639       0.116         0       0.000
ret_12_1             467584       0.115         0       0.000
ret_1_0              114462       0.028         0       0.000
[2024/02/22 17:53] Replacing inf with nan...
[2024/02/22 17:53] Descriptive statistics
                        count  mean      std          min     0.1%      1%    25%   50%   75%    99%   99.9%         max
excess_ret_change 3605208.000 3.712 3781.662 -1261260.427 -969.774 -95.979 -1.973 0.945 4.066 98.078 952.120 6076629.096
ret_12_1          3609263.000 0.135    0.773       -1.000   -0.956  -0.834 -0.208 0.049 0.324  2.513   6.902     436.685
ret_1_0           3962385.000 0.011    0.185       -1.000   -0.692  -0.400 -0.066 0.000 0.070  0.579   1.500      24.000
# Save the data (raw data + firm characteristics) to config.output_dir + 'mycrspm'.
# To save only firm characteristics and a few columns of the raw data, use 'other_columns' argument.
crspm.save()

# The saved file can be loaded using crspm.load().
crspm = MyCRSPM().load()
print(crspm[['excess_ret_change', 'ret_12_1', 'ret_1_0']])

elapsed_time('End of Example 4.')
                  excess_ret_change  ret_12_1  ret_1_0
date       permno
1986-01-31 10000                 NaN       NaN      NaN
1986-02-28 10000                 NaN       NaN   -0.257
1986-03-31 10000                 NaN       NaN    0.365
1986-04-30 10000                 NaN       NaN   -0.099
                              ...       ...      ...
2023-09-30 93436               8.384    -0.027   -0.030
2023-10-31 93436             -24.680     0.100   -0.197
2023-11-30 93436               3.204     0.032    0.195
2023-12-31 93436              -0.337     0.949    0.035

Example 5. Firm characteristics using year-end ME

Firm characteristics using year-end ME

This example demonstrates how to generate funda firm characteristics using year-end ME instead of the latest ME.

  1. All firm characteristics defined in FUNDA are generated.

  2. Only USD-denominated stocks are sampled: currency conversion is unnecessary.

  3. Only funda is used without merging with fundq: merging the two datasets as in Example 1 is also possible.

  4. It is assumed that funda data is available 6 months later.

  • It is assumed that data has been downloaded from WRDS.

Processing time: approx. 105 seconds.

Initialization

Import packages

from pyanomaly.characteristics import FUNDA, CRSPM
from pyanomaly.analytics import *

Set log file path and start a timer.

set_log_path(__file__)
start_timer('Example 5')

Market equity (from CRSP)

# Load crspm.
crspm = CRSPM()
crspm.load_data()

# Filter data: shrcd in [10, 11, 12]
crspm.filter_data()

# Fill missing months by populating the data.
crspm.populate(method=None)

# Some preprocessing, e.g., creating frequently used variables.
# 'me' and 'me_company' (firm-level me) are generated here.
crspm.update_variables()

FUNDA firm characteristics

# Load funda.
funda = FUNDA()
funda.load_data()

# USD stocks only.
funda.filter(('curcd', '==', 'USD'))

# Populate data to monthly.
# limit=12: Forward fill up to 12 months.
# lag=6: Shift data 6 months, i.e., data is available 6 months later.
funda.populate(MONTHLY, limit=12, lag=6, new_date_col='date')

# Some preprocessing, e.g., creating frequently used variables.
funda.update_variables()

# Add year-end market equity to funda.
funda.add_crsp_me(crspm, method='year_end')

# Generate firm characteristics.
funda.show_available_chars()
funda.create_chars()

# Postprocess and save results to './output/funda_yearend_me'.
funda.postprocess()
funda.save('funda_yearend_me')

Example 6. Playing with Panel

Playing with Panel

This example demonstrates how to use Panel class.

Import packages

from pyanomaly.panel import Panel
from pyanomaly.analytics import *

Monthly data

data1 = pd.DataFrame({
    'id': [100, 100, 100, 200, 200, 200, 300, 300, 300],
    'date': ['2023-03-31', '2023-04-30', '2023-05-31', '2023-03-31', '2023-04-30', '2023-05-31', '2023-03-31',
             '2023-04-30', '2023-05-31'],
    'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, np.nan, 0.03, -0.05],
    'me': [100, 120, 110, 5000, 4500, 4700, 2000, 2100, 1900],
    'me_nyse': [np.nan, np.nan, np.nan, 5000, 4500, 4700, 2000, 2100, 1900],
})

data1['date'] = pd.to_datetime(data1['date'])
data1 = data1.sort_values(['id', 'date']).set_index(['date', 'id'])

# Instantiate Panel.
panel1 = Panel(data1, freq=MONTHLY)
del data1
print(panel1.data)
                  ret    me  me_nyse
date       id
2023-03-31 100  0.010   100      NaN
2023-04-30 100  0.050   120      NaN
2023-05-31 100  0.020   110      NaN
2023-03-31 200  0.030  5000 5000.000
2023-04-30 200  0.110  4500 4500.000
2023-05-31 200 -0.030  4700 4700.000
2023-03-31 300    NaN  2000 2000.000
2023-04-30 300  0.030  2100 2100.000
2023-05-31 300 -0.050  1900 1900.000
Inspect data. Available options are:
  • ‘summary’: Data shape, number of unique dates, and number of unique ids.

  • ‘id_count`: Number of ids per date.

  • ‘nans’: Number of nans and infs per column.

  • ‘stats’: Descriptive statistics. Same as data.describe().

panel1.inspect_data(option=['summary', 'id_count', 'nans', 'stats'])
[2024/02/23 17:17] Shape: (9, 3)
[2024/02/23 17:17] No. unique dates: 3, Date range: 2023-03-31 00:00:00 - 2023-05-31 00:00:00
[2024/02/23 17:17] No. unique ids: 3
[2024/02/23 17:17] No. ids per date:
 date
2023-03-31    3
2023-04-30    3
2023-05-31    3
dtype: int64
[2024/02/23 17:17] Nans, Infs
         no. nans  percentage  no. infs  percentage
ret             1       0.111         0       0.000
me              0       0.000         0       0.000
me_nyse         3       0.333         0       0.000
[2024/02/23 17:17] Descriptive statistics
         count     mean      std      min     0.1%       1%      25%      50%      75%      99%    99.9%      max
ret      8.000    0.021    0.049   -0.050   -0.050   -0.049    0.000    0.025    0.035    0.106    0.110    0.110
me       9.000 2281.111 2017.588  100.000  100.080  100.800  120.000 2000.000 4500.000 4976.000 4997.600 5000.000
me_nyse  6.000 3366.667 1506.873 1900.000 1900.500 1905.000 2025.000 3300.000 4650.000 4985.000 4998.500 5000.000

Quarterly data

data2 = pd.DataFrame({
    'id': [100, 100, 100, 200, 200, 200],
    'date': ['2023-03-31', '2023-06-30', '2023-09-30', '2023-01-31', '2023-04-30', '2023-07-31'],
    'prc': [100, 120, 110, 5000, 4500, 4700],
})

data2['date'] = pd.to_datetime(data2['date'])
data2 = data2.sort_values(['id', 'date']).set_index(['date', 'id'])

# Instantiate Panel.
panel2 = Panel(data2, freq=QUARTERLY)
del data2
print(panel2.data)
                 prc
date       id
2023-03-31 100   100
2023-06-30 100   120
2023-09-30 100   110
2023-01-31 200  5000
2023-04-30 200  4500
2023-07-31 200  4700

Populate data to monthly.

panel2.populate(MONTHLY, method='ffill', limit=3)
print(panel2.data)
                 prc
date       id
2023-03-31 100   100
2023-04-30 100   100
2023-05-31 100   100
2023-06-30 100   120
2023-07-31 100   120
2023-08-31 100   120
2023-09-30 100   110
2023-01-31 200  5000
2023-02-28 200  5000
2023-03-31 200  5000
2023-04-30 200  4500
2023-05-31 200  4500
2023-06-30 200  4500
2023-07-31 200  4700

Id-level operations

The methods below perform the operation per id and accounting for the data frequency. For instance, if a data is a quarterly data populated monthly, shift(1) will shift the data by 3 (one quarter) within each id.

Shift panel1.data by one month.

data1_lag1 = panel1.shift(n=1)
print(data1_lag1)
                ret       me  me_nyse
date       id
2023-03-31 100   NaN      NaN      NaN
2023-04-30 100 0.010  100.000      NaN
2023-05-31 100 0.050  120.000      NaN
2023-03-31 200   NaN      NaN      NaN
2023-04-30 200 0.030 5000.000 5000.000
2023-05-31 200 0.110 4500.000 4500.000
2023-03-31 300   NaN      NaN      NaN
2023-04-30 300   NaN 2000.000 2000.000
2023-05-31 300 0.030 2100.000 2100.000

Shift panel2.data by one quarter.

data2_lag1 = panel2.shift(n=1)
print(data2_lag1)
                   prc
date       id
2023-03-31 100      NaN
2023-04-30 100      NaN
2023-05-31 100      NaN
2023-06-30 100  100.000
2023-07-31 100  100.000
2023-08-31 100  100.000
2023-09-30 100  120.000
2023-01-31 200      NaN
2023-02-28 200      NaN
2023-03-31 200      NaN
2023-04-30 200 5000.000
2023-05-31 200 5000.000
2023-06-30 200 5000.000
2023-07-31 200 4500.000

Shift panel1.data['ret'] by 2 months. The two lines below are equivalent.

ret_lag2 = panel1.shift(panel1['ret'], 2)
ret_lag2 = panel1.shift('ret', 2)
print(ret_lag2)
date        id
2023-03-31  100     NaN
2023-04-30  100     NaN
2023-05-31  100   0.010
2023-03-31  200     NaN
2023-04-30  200     NaN
2023-05-31  200   0.030
2023-03-31  300     NaN
2023-04-30  300     NaN
2023-05-31  300     NaN

Quarterly price difference. The two lines below are equivalent.

prc_diff = panel2.diff(panel2.data['prc'], 1)
prc_diff = panel2.diff('prc', 1)
print(prc_diff)
date        id
2023-03-31  100        NaN
2023-04-30  100        NaN
2023-05-31  100        NaN
2023-06-30  100     20.000
2023-07-31  100     20.000
2023-08-31  100     20.000
2023-09-30  100    -10.000
2023-01-31  200        NaN
2023-02-28  200        NaN
2023-03-31  200        NaN
2023-04-30  200   -500.000
2023-05-31  200   -500.000
2023-06-30  200   -500.000
2023-07-31  200    200.000

Quarterly price percentage change. The two lines below are equivalent.

prc_change = panel2.pct_change(panel2.data['prc'], 1)
prc_change = panel2.pct_change('prc', 1)
print(prc_change)
date        id
2023-03-31  100      NaN
2023-04-30  100      NaN
2023-05-31  100      NaN
2023-06-30  100    0.200
2023-07-31  100    0.200
2023-08-31  100    0.200
2023-09-30  100   -0.083
2023-01-31  200      NaN
2023-02-28  200      NaN
2023-03-31  200      NaN
2023-04-30  200   -0.100
2023-05-31  200   -0.100
2023-06-30  200   -0.100
2023-07-31  200    0.044

Two-period cumulative returns.

cum_ret = panel1.cumret('ret', 2)
print(cum_ret)
date        id
2023-03-31  100      NaN
2023-04-30  100    0.060
2023-05-31  100    0.071
2023-03-31  200      NaN
2023-04-30  200    0.143
2023-05-31  200    0.077
2023-03-31  300      NaN
2023-04-30  300      NaN
2023-05-31  300   -0.021

One-period ahead returns.

fut_ret = panel1.futret('ret', 1)
print(fut_ret)
date        id
2023-03-31  100    0.050
2023-04-30  100    0.020
2023-05-31  100      NaN
2023-03-31  200    0.110
2023-04-30  200   -0.030
2023-05-31  200      NaN
2023-03-31  300    0.030
2023-04-30  300   -0.050
2023-05-31  300      NaN

Rolling-apply a function to each id. Supported functions: ‘sum’, ‘mean’, ‘std’, ‘var’.

rolling_sum = panel1.rolling(panel1[['ret', 'me']], 2, 'sum', min_n=1)
print(rolling_sum)
                  ret       me
date       id
2023-03-31 100  0.010  100.000
2023-04-30 100  0.060  220.000
2023-05-31 100  0.070  230.000
2023-03-31 200  0.030 5000.000
2023-04-30 200  0.140 9500.000
2023-05-31 200  0.080 9200.000
2023-03-31 300    NaN 2000.000
2023-04-30 300  0.030 4100.000
2023-05-31 300 -0.020 4000.000

To apply a user-defined function to each id, use apply_to_ids(). Below is the same as the above except that the return value is ndarray.

rolling_sum = panel1.apply_to_ids(panel1[['ret', 'me']], roll_sum, None, 2, 1)
print(rolling_sum)
[[ 1.0e-02  1.0e+02]
 [ 6.0e-02  2.2e+02]
 [ 7.0e-02  2.3e+02]
 [ 3.0e-02  5.0e+03]
 [ 1.4e-01  9.5e+03]
 [ 8.0e-02  9.2e+03]
 [     nan  2.0e+03]
 [ 3.0e-02  4.1e+03]
 [-2.0e-02  4.0e+03]]

Date-level operations

To apply a user-defined function to each date, use apply_to_dates().

# Functino to apply: sum of each column.
def sum0(data):
    return np.sum(data, axis=0).reshape(1, -1)

column_sum = panel1.apply_to_dates(panel1[['ret', 'me']], sum0, None)
print(column_sum)
[[      nan  7.10e+03]
 [ 1.90e-01  6.72e+03]
 [-6.00e-02  6.71e+03]]

Rolling regression

The first column of the first argument is used as the dependent variable and the rest as the independent variables. First K (number of independent varialbes) columns of the return are coefficients, the next column is R-squared, and the last column is idiosyncratic volatility (standard deviation of residuals).

retval = panel1.rolling_regression(panel1[['ret', 'me']], 3, min_n=2, add_const=True)
retval = pd.DataFrame(retval, columns=['alpha', 'me', 'R2', 'Idio vol'], index=panel1.data.index)
print(retval)
                alpha     me    R2  Idio vol
date       id
2023-03-31 100    NaN    NaN   NaN       NaN
2023-04-30 100 -0.190  0.002 1.000     0.000
2023-05-31 100 -0.193  0.002 0.923     0.006
2023-03-31 200    NaN    NaN   NaN       NaN
2023-04-30 200  0.830 -0.000 1.000     0.000
2023-05-31 200  0.659 -0.000 0.222     0.062
2023-03-31 300    NaN    NaN   NaN       NaN
2023-04-30 300    NaN    NaN   NaN       NaN
2023-05-31 300 -0.810  0.000 1.000     0.000

Panel merge

Two panels can be merged as follows.

panel1.merge(panel2, on=['date', 'id'], how='left')
print(panel1.data)
                  ret    me  me_nyse      prc
date       id
2023-03-31 100  0.010   100      NaN  100.000
2023-04-30 100  0.050   120      NaN  100.000
2023-05-31 100  0.020   110      NaN  100.000
2023-03-31 200  0.030  5000 5000.000 5000.000
2023-04-30 200  0.110  4500 4500.000 4500.000
2023-05-31 200 -0.030  4700 4700.000 4500.000
2023-03-31 300    NaN  2000 2000.000      NaN
2023-04-30 300  0.030  2100 2100.000      NaN
2023-05-31 300 -0.050  1900 1900.000      NaN

Example 7. Table download

Table download

This example demonstrates how to download a new table from WRDS. Different ways of downloading comp.secm table are demonstrated.

The table will be saved to ‘./input/comp/secm’.

Initialization

Import packages

from pyanomaly.globals import *
from pyanomaly.wrdsdata import WRDS

Set log file path and start a timer.

set_log_path(__file__)
start_timer('Example 7')

WRDS instantiation

References: WRDS.

wrds = WRDS(wrds_usename)

Data download

Method 1: Download the entire table at once.

wrds.download_table('comp', 'secm', date_cols=['datadate'])  # 'datadate's type will be converted to datetime.

Method 2: Download the entire table asynchronously.

Download data every interval years. For a small size data, this can be slower than download_table().

wrds.download_table_async('comp', 'secm', date_col='datadate', interval=5)

Method 3: Download only some fields.

sql = 'datadate, gvkey, cshoq, prccm'  # fields to select
wrds.download_table_async('comp', 'secm', sql=sql, date_col='datadate')

Method 4: Download data using a complete query statement.

Below is equivalent to the above. Note that the query statement must contain ‘WHERE [date_col] BETWEEN {} and {}’.

sql = f"""
    SELECT datadate, gvkey, cshoq, prccm
    FROM comp.secm
    WHERE datadate between '{{}}' and '{{}}'
"""
wrds.download_table_async('comp', 'secm', sql=sql, date_cols=['datadate'])