pyanomaly

PyAnomaly is a Python library for asset pricing research.

pyanomaly.analytics

This module defines analytic functions.

pyanomaly.characteristics

This module defines classes for firm characteristic generation.

pyanomaly.config

This module defines functions to set/get package configuration.

pyanomaly.datatools

This module defines functions for data handling.

pyanomaly.factors

This module defines functions to generate factor portfolios and characteristic portfolios.

pyanomaly.ff

This module defines functions to generate Fama-French factors.

pyanomaly.fileio

This module defines functions for file IO.

pyanomaly.globals

This module defines global constants.

pyanomaly.log

This module defines logging functions.

pyanomaly.numba_support

This module defines jitted function.

pyanomaly.panel

This module defines classes for panel data analysis.

pyanomaly.portfolio

This module defines classes for portfolio analysis.

pyanomaly.tcost

This module defines classes for transaction costs.

pyanomaly.util

This module defines utility functions.

pyanomaly.wrdsdata

This module defines WRDS class that is used to download and handle WRDS data.

analytics

This module defines analytic functions.

Sorting

one_dim_sort

One-dimensional sort.

two_dim_sort

Two-dimensional sort.

Auxiliary functions

relabel_class

Relabel classes.

weighted_mean

Calculate weighted means.

append_long_short

Add long-short to a quantile data.

Time-Series Analysis

time_series_average

Calculate time-series mean and t-statistic.

grs_test

Run GRS (Gibbons, Ross, and Shanken, 1989) test.

Auxiliary functions

t_stat

Calculate t-statistic.

Cross-sectional Analysis

crosssectional_regression

Run cross-sectional OLS.

Portfolio Analysis

make_position

Make portfolio position data.

make_portfolio

Make a portfolio.

make_long_short_portfolio

Make a long-short portfolio.

make_quantile_portfolios

Make quantile portfolios.

Auxiliary functions

future_return

Compute future returns.

pyanomaly.analytics.append_long_short(data, level=-1, l_label=None, s_label=None, ls_label=None)

Add long-short to a quantile data.

Long-short is defined as (first group - last group) in each date. If labels are not given, long-short will be (class 0 - class N-1), where N is the number of classes, and the class label of the long-short is set to N.

Parameters:
  • data – DataFrame with index = date/class1/class2/….

  • level – Index level to make long-short on. Default to the last level.

  • l_label – Label of the long class. If None, l_label = 0.

  • s_label – Label of the short class. If None, s_label = N-1.

  • ls_label – Label of the long-short. If None, ls_label = N.

Returns:

The data with long-short appended.

pyanomaly.analytics.crosssectional_regression(data, endo_col, exog_cols, add_constant=True, cov_type='nonrobust', cov_kwds=None)

Run cross-sectional OLS.

Run cross-sectional OLS on each date and calculate the time-series means and t-stats of the coefficients.

Parameters:
  • data – DataFrame with index = date/id.

  • endo_col – y column.

  • exog_cols – List of X columns.

  • add_constant (bool) – Add constant to X.

  • cov_type – Covariance estimator. See t_stat().

  • cov_kwds – Parameters required for the chosen covariance estimator. See t_stat().

Returns:

  • mean (DataFrame). Time-series means of coefficients with index = (‘const’ +) exog_cols and columns = ‘mean’.

  • t-stat (DataFrame). t-statistics of coefficients with index = (‘const’ +) exog_cols and columns = ‘t-stat’.

  • coefs (DataFrame). Coefficient time-series with index = dates and columns = (‘const’ +) exog_cols.

pyanomaly.analytics.future_return(ret, period=1)

Compute future returns.

Compute period-period ahead returns. If ret has a MultiIndex of date/id, future returns are calculated for each id.

Parameters:
  • ret – Series of returns with index = date or date/id. If index = date/id, ret must be sorted on id/date.

  • period – Target period.

Returns:

Series of future returns.

pyanomaly.analytics.grs_test(assets, factors)

Run GRS (Gibbons, Ross, and Shanken, 1989) test.

Parameters:
  • assets – T x N DataFrame or ndarray of asset returns.

  • factors – T x F DataFrame or ndarray of factor returns.

Returns:

  • pricing error (alpha.T * inv(Sigma) * alpha)

  • squared Sharpe ratio of the factors

  • GRS statistic

  • p value

pyanomaly.analytics.make_long_short_portfolio(lposition, sposition, rf=None, costfcn=None, keep_position=True, name='H-L', ls_wgt=(1, -1))

Make a long-short portfolio.

Parameters:
  • lposition – DataFrame. Long position data. See make_position() for the data format.

  • sposition – DataFrame. Short position data.

  • rf – Series or DataFrame of risk-free rates. The index should be date.

  • costfcn – Transaction cost. See Portfolio.costfcn().

  • keep_position – If True, keep the position information in the returned value. If position information is not needed, set this to False to save memory.

  • name – Portfolio name.

  • ls_wgt – Long-short weights. (1, -1) means 1:1 long-short.

Returns:

Portfolio object.

pyanomaly.analytics.make_portfolio(data, ret_col, weight_col=None, rf=None, costfcn=None, keep_position=True, name='')

Make a portfolio.

This function creates portfolio position data from data and construct a portfolio from it.

Parameters:
  • data – DataFrame with index = date/id.

  • ret_col – Return column of data. Return should be over t to t+1.

  • weight_col – Weight column of data. If None, constituents are equally weighted.

  • rf – Series or DataFrame of risk-free rates. The index should be date.

  • costfcn – Transaction cost. See Portfolio.costfcn().

  • keep_position – If True, keep the position information in the returned value. If position information is not needed, set this to False to save memory.

  • name – Portfolio name.

Returns:

Portfolio object.

pyanomaly.analytics.make_position(data, ret_col, weight_col=None, pf_col=None, other_cols=None)

Make portfolio position data.

To construct and evaluate a portfolio using Portfolio, position data is required. This function makes the position data from data, which is a panel of securities. The position data is generated via the following operations:

  • Change column names as assumed in Portfolio:

    • ‘date’: Date column.

    • ‘id’: Security id column.

    • ‘ret’: Return column.

    • ‘wgt’: Weight column.

  • Normalize weights so that their cross-sectional sum becomes 1 within each portfolio.

Parameters:
  • data – DataFrame with index = date/id.

  • ret_col – Return column of data. Return should be over t to t+1.

  • weight_col – Weight column of data. If None, constituents are equally weighted.

  • pf_col – Portfolio column of data, i.e., a column that maps securities with portfolios. This can be None if the input data is for one portfolio.

  • other_cols – Other columns of data to include in the position attribute of Portfolio.

Returns:

Position DataFrame with index = ‘date’ and columns = [‘id’, ‘ret’, ‘wgt’] + other_cols.

pyanomaly.analytics.make_quantile_portfolios(data, q_col, ret_col, weight_col=None, rf=None, costfcn=None, keep_position=True, names=None, ls_wgt=(1, -1))

Make quantile portfolios.

This function makes quantile portfolios and the long-short portfolio from data.

Parameters:
  • data – DataFrame with index = date/id.

  • q_col – Column of data that maps a security with quantiles (portfolios). The values should be integers starting from 0.

  • ret_col – Return column of data. Return should be over t to t+1.

  • weight_col – Weight column of data. If None, constituents are equally weighted.

  • rf – Series or DataFrame of risk-free rates. The index should be date.

  • costfcn – Transaction cost. See Portfolio.costfcn().

  • keep_position – If True, keep the position information in the returned value. If position information is not needed, set this to False to save memory.

  • names – Portfolio names. If None, the values in pf_col are used.

  • ls_wgt – Long-short weights. (1, -1) means 1:1 long-short. If None, long-short portfolio is not constructed.

Returns:

Portfolios object.

pyanomaly.analytics.one_dim_sort(data, class_col, target_cols=None, weight_col=None, function='mean', add_long_short=True)

One-dimensional sort.

This function assumes that data has already been sorted/grouped and class labels are given in class_col column. Aggregate target_cols values using class_col and return aggregated results. Class labels in class_col should be 0, 1, …

Parameters:
  • data – DataFrame to be grouped. Index must be date/id.

  • class_col – Class label column.

  • target_cols – (List of) column(s) to aggregate. If None, target_cols = all numeric columns of data.

  • weight_col – Weight column. If given, weighted mean is returned. Applicable only when function = ‘mean’.

  • function – Aggregate function, e.g., ‘sum’, ‘mean’, ‘count’, or a list of functions.

  • add_long_short (bool) – Add long-short to the output.

Returns:

Aggregated data with index = date/class, columns = target_cols. If function is a list of functions, the columns has two levels: first level = target_cols and second level = function.

pyanomaly.analytics.relabel_class(data, labels=None, axis=0, level=-1, col=None)

Relabel classes.

Relabel (rename) columns, indexes, or column values of data. The existing labels (values) should be continuous integers starting from 0. The data is relabeled in-place.

Parameters:
  • data – DataFrame to be relabeled.

  • labels (list) – New class labels. Label 0 is replaced by the first element of labels, and so forth.

  • axis – 1: index, 2: column.

  • level – Level of index/column to be relabeled.

  • col – Column name. If column name is given, axis and level are ignored.

pyanomaly.analytics.t_stat(data, cov_type='nonrobust', cov_kwds=None)

Calculate t-statistic.

Calculate t-statistic for each column of data under H0: x = 0.

Parameters:
  • data – Series, DataFrame, or ndarray with each column containing samples.

  • cov_type – Covariance estimator: e.g., ‘HAC’ for Newey-West.

  • cov_kwds – Parameters required for the chosen covariance estimator: e.g., {‘maxlags: 12} for cov_type = ‘HAC’.

Returns:

t-stat. Float (if data is one dimensional) or (1 x K) ndarray, where K is the number of columns of data.

Note

See statsmodels.api.OLS.fit for possible values of cov_type and cov_kwds.

pyanomaly.analytics.time_series_average(data, cov_type='nonrobust', cov_kwds=None)

Calculate time-series mean and t-statistic.

Time-series mean and t-statistic are calculated for each column of data. The data can be either a time-series data (index = date) or a panel data (index = date/id). If it is a panel, time-series mean and t-statistic are calculated for each id.

Parameters:
  • data – DataFrame, Series, or ndarray. Data to analyze. If MultiIndex, the first index must be date.

  • cov_type – Covariance estimator. See t_stat().

  • cov_kwds – Parameters required for the chosen covariance estimator. See t_stat().

Returns:

  • mean (DataFrame).

  • t-stat (DataFrame).

If data has MultiIndex, mean (t-stat) has index = level 1 index of data and columns = data.columns. Otherwise, mean (t-stat) has index = data.columns and columns = ‘mean’ (‘t-stat’).

pyanomaly.analytics.two_dim_sort(data, class_col1, class_col2, target_cols=None, weight_col=None, function='mean', add_long_short=True, output_dim=1)

Two-dimensional sort.

This function assumes that data has already been sorted/grouped and class labels are given in class_col1 and class_col2 columns. Aggregate target_cols values using class_col1 and class_col2 and return aggregated results. Class labels in class_col1(2) should be 0, 1, …

Parameters:
  • data – Data to be grouped. Index must be date/id.

  • class_col1 – Class label column for the 1st dimension.

  • class_col2 – Class label column for the 2nd dimension.

  • target_cols – (List of) column(s) to aggregate. If None, target_cols = all numeric columns of data.

  • weight_col – Weight column. If given, weighted mean is returned. Applicable only when function = ‘mean’.

  • function – Aggregate function, e.g., ‘sum’, ‘mean’, ‘count’.

  • add_long_short (bool) – Add long-short to the output.

  • output_dim – If 1, output is a DataFrame with index = date/class1/class2; if 2, output is a DataFrame with index = date/class1 and column = class2.

Returns:

Aggregated data (DataFrame or dict of DataFrames).

  • If output_dim = 1, index = date/class1/class2 and columns = target_cols.

  • If output_dim = 2 and len(target_cols) = 1, index = date/class1 and columns = class2.

  • If output_dim = 2 and len(target_cols) > 1, output is dict with keys = target_cols and values = DataFrames (as in the second case).

pyanomaly.analytics.weighted_mean(data, target_cols, weight_col, group_cols)

Calculate weighted means.

Calculate weighted means of each column in target_cols within each group defined by group_cols.

Parameters:
  • data – DataFrame.

  • target_cols – (List of) column(s) to calculate weighted-mean.

  • weight_col – Weight column name or Series or ndarray of weights.

  • group_cols – (List of) grouping column(s).

Returns:

DataFrame of weighted means with index = group_cols and columns = target_cols.

Examples

If data is a panel with index = date/permno and column ‘ret’ contains returns and ‘me’ contains market equity at t-1, value-weighted returns can be obtained as follows:

>>> wmean = weighted_mean(data, 'ret', 'me', 'date')

characteristics

This module defines classes for firm characteristic generation.

FUNDA

Class to generate firm characteristics from funda.

FUNDQ

Class to generate firm characteristics from fundq.

CRSPM

Class to generate firm characteristics from crspm.

CRSPDRaw

Class that handles crspd data.

CRSPD

Class to generate firm characteristics from crspd.

Merge

Class to generate firm characteristics from a combined dataset of crspm, crspd, funda, and fundq.

FUNDA

class pyanomaly.characteristics.FUNDA(alias=None, data=None)

Bases: FCPanel

Class to generate firm characteristics from funda.

The firm characteristics defined in this class can be viewed using show_available_chars().

Parameters:
  • alias (str, list, or dict) –

    Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.

    • str: A column name in the mapping file (config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.

    • list: List of firm characteristics (method names).

    • dict: Dict of firm characteristics and their aliases of the form {method name: alias}.

    If aliases are not specified, method names are used as aliases.

  • data – DataFrame of funda data with index = datadate/gvkey, sorted on gvkey/datadate. The funda data can be given at initialization or loaded later using load_data().

Methods

load_data

Load funda data from file.

convert_currency

Convert currency to USD.

merge_with_fundq

Merge funda with fundq.

add_crsp_me

Add crsp market equity.

update_variables

Preprocess data before creating firm characteristics.

postprocess

Postprocess data.

Firm characteristic generation methods have a name like c_characteristic().

add_crsp_me(crspm, method='latest')

Add crsp market equity.

In funda, market equity (‘me’) and fiscal market equity (‘me_fiscal’) are both defined as (prcc_f * csho). This method replaces them with crspm’s firm-level market equity (‘me_company’). If method = ‘latest’, ‘me’ is the latest ‘me_company’ and ‘me_fiscal’ is the ‘me_company’ on datadate. If method = ‘year_end’, both ‘me’ and ‘me_fiscal’ are the ‘me_company’ in December of datadate year.

Parameters:
  • crspmCRSPM instance.

  • method – How to merge crsp me with funda. ‘latest’: latest me; ‘year_end’: December me.

c_absacc()

Absolute accruals. Bandyopadhyay, Huang, and Wirjanto (2010)

c_acc()

Operating accruals (GHZ, Org). Sloan (1996)

c_age()

Firm age. Jiang, Lee, and Zhang (2005)

c_aliq_at()

Asset liquidity to book assets. Ortiz-Molina and Phillips (2014)

c_aliq_mat()

Asset liquidity to market assets. Ortiz-Molina and Phillips (2014)

c_at_be()

Book leverage. Fama and French (1992)

c_at_gr1()

Asset growth. Cooper, Gulen, and Schill (2008)

c_at_me()

Assets-to-market. Fama and French (1992)

c_at_turnover()

Capital turnover. Haugen and Baker (1996)

c_be_gr1a()

Chage in common equity. Richardson et al. (2005)

c_be_me()

Book-to-market (December ME). Rosenberg, Reid, and Lanstein (1985)

c_bev_mev()

Book-to-market enterprise value. Penman, Richardson, and Tuna (2007)

c_bm_ia()

Industry-adjusted book-to-market. Asness, Porter, and Stevens (2000)

c_capex_abn()

Abnormal corporate investment. Titman, Wei, and Xie (2004)

c_capx_gr1()

CAPEX growth (1 year). Xie (2001)

c_capx_gr2()

Two-year investment growth. Anderson and Garcia-Feijoo (2006)

c_capx_gr3()

Three-year investment growth. Anderson and Garcia-Feijoo (2006)

c_cash_at()

Cash-to-assets. Palazzo (2012)

c_cashdebt()

Cash flow-to-debt. Ou and Penman(1989)

c_cashpr()

Cash productivity. Chandrashekar and Rao (2009)

c_cfp()

Operating Cash flows to price (Org, GHZ). Desai, Rajgopal, and Venkatachalam (2004)

c_cfp_ia()

Industry-adjusted cash flow-to-price. Asness, Porter, and Stevens (2000)

c_chatoia()

Change in profit margin. Soliman (2008)

c_chcsho()

Net stock issues (GHZ). Pontiff and Woodgate (2008)

c_chempia()

Industry-adjusted change in employees. Asness, Porter, and Stevens (2000)

c_chpmia()

Change in profit margin. Soliman (2008)

c_coa_gr1a()

Change in current operating assets. Richardson et al. (2005)

c_col_gr1a()

Change in current Ooperating liabilities. Richardson et al. (2005)

c_convind()

Convertible debt indicator. Valta (2016)

c_cop_at()

Cash-based operating profitablility. Ball et al. (2016)

c_cop_atl1()

Cash-based operating profits to lagged assets. Ball et al. (2016)

c_cowc_gr1a()

Change in net non-cash working capital. Richardson et al. (2005)

c_currat()

Current ratio. Ou and Penman (1989)

c_dbnetis_at()

Net debt finance. Bradshaw, Richardson, and Sloan (2006)

c_debt_gr3()

Composite debt issuance. Lyandres, Sun, and Zhang (2008)

c_debt_me()

Debt to market. Bhandari (1988)

c_depr()

Depreciation to PP&E. Holthausen and Larcker (1992)

c_dgp_dsale()

Gross margin growth to sales growth. Abarbanell and Bushee (1998)

c_dsale_dinv()

Sales growth to inventory growth. Abarbanell and Bushee (1998)

c_dsale_drec()

Sales growth to receivable growth. Abarbanell and Bushee (1998)

c_dsale_dsga()

Sales growth to SG&A growth. Abarbanell and Bushee (1998)

c_dy()

Dividend yield (GHZ). Litzenberger and Ramaswamy (1979)

c_earnings_variability()

Earnings smoothness. Francis et al. (2004)

c_ebit_bev()

Return on net operating assets. Soliman (2008)

c_ebit_sale()

Profit margin. Soliman (2008)

c_ebitda_mev()

Enterprise multiple. Loughran and Wellman (2011)

c_emp_gr1()

Employee growth. Belo, Lin, and Bazdresch (2014)

c_enterprise_multiple()

Enterprise multiple. Loughran and Wellman (2011)

c_eq_dur()

Equity duration. Dechow, Sloan, and Soliman (2004)

c_eqnetis_at()

Net equity finance. Bradshaw, Richardson, and Sloan (2006)

c_eqnpo_me()

Net payout yield. Boudoukh et al. (2007)

c_eqpo_me()

Payout yield. Boudoukh et al. (2007)

c_f_score()

Piotroski F-Score (JKP). Piotroski (2000)

c_fcf_me()

Cash flow-to-price. Lakonishok, Shleifer, and Vishny (1994)

c_fnl_gr1a()

Change in financial liabilities. Richardson et al. (2005)

c_gp_at()

Gross profits-to-assets. Novy-Marx (2013)

c_gp_atl1()

Gross profits-to-lagged assets. Novy-Marx (2013)

c_herf_at()

Industry concentration (total assets). Hou and Robinson (2006)

c_herf_be()

Industry concentration (book equity). Hou and Robinson (2006)

c_herf_sale()

Industry concentration (sales). Hou and Robinson (2006)

c_intrinsic_value()

Intrinsic value-to-market. Frankel and Lee (1998)

c_inv_gr1()

Inventory growth. Belo and Lin (2012)

c_inv_gr1a()

Inventory change. Thomas and Zhang (2002)

c_invest()

CAPEX and inventory. Chen and Zhang (2010)

c_kz_index()

Kaplan-Zingales Index. Lamont, Polk, and Saa-Requejo (2001)

c_lgr()

Change in long-term debt. Richardson et al. (2005)

c_lnoa_gr1a()

Change in long-term net operating assets. Fairfield, Whisenant, and Yohn (2003)

c_lti_gr1a()

Chagne in long-term investments. Richardson et al. (2005)

c_mve_ia()

Industry-adjusted firm size. Asness, Porter, and Stevens (2000)

c_ncoa_gr1a()

Change in non-current operating assets. Richardson et al. (2005)

c_ncol_gr1a()

Change in non-current operating liabilities. Richardson et al. (2005)

c_netdebt_me()

Net debt-to-price. Penman, Richardson, and Tuna (2007)

c_netis_at()

Net external finance. Bradshaw, Richardson, and Sloan (2006)

c_nfna_gr1a()

Change in net financial assets. Richardson et al. (2005)

c_ni_ar1()

Earnings persistence. Francis et al. (2004)

c_ni_be()

Return on equity. Haugen and Baker (1996)

c_ni_ivol()

Earnings predictability. Francis et al. (2004)

c_ni_me()

Earnings to price. Basu (1983)

c_nncoa_gr1a()

Change in net non-current operating assets. Richardson et al. (2005)

c_noa_at()

Net operating assets. Hirshleifer et al. (2004)

c_noa_gr1a()

Change in net operating assets. Hirshleifer et al. (2004)

c_o_score()

Ohlson O-Score. Dichev (1998)

c_oaccruals_at()

Operating Accruals (JKP). Sloan (1996)

c_oaccruals_ni()

Percent Operating Accruals (JKP). Hafzalla, Lundholm, and Van Winkle (2011)

c_ocf_at()

Operating cash flow to assets. Bouchard et al. (2019)

c_ocf_at_chg1()

Change in operating cash flow to assets. Bouchard et al. (2019)

c_ocf_me()

Operating Cash flows to price (JKP). Desai, Rajgopal, and Venkatachalam (2004)

c_op_at()

Operating profits-to-assets. Ball et al. (2016)

c_op_atl1()

Operating profits-to-lagged assets. Ball et al. (2016)

c_ope_be()

Operating profits to book equity (JKP). Fama and French (2015)

c_ope_bel1()

Operating profits to lagged book equity. Fama and French (2015)

c_operprof()

Operating profits to book equity (GHZ, Org). Fama and French (2015)

c_opex_at()

Operating leverage. Novy-Marx (2011)

c_pchcapx_ia()

Industry-adjusted change in capital investment. Abarbanell and Bushee (1998)

c_pchcurrat()

Change in current ratio. Ou and Penman (1989)

c_pchdepr()

Change in depreciation to PP&E. Holthausen and Larcker (1992)

c_pchquick()

Change in quick ratio. Ou and Penman (1989)

c_pchsaleinv()

Change in sales to inventory. Ou and Penman(1989)

c_pctacc()

Percent operating accruals (GHZ, Org). Hafzalla, Lundholm, and Van Winkle (2011)

c_pi_nix()

Taxable income to income (JKP). Lev and Nissim (2004)

c_ppeinv_gr1a()

Changes in PPE and inventory/assets. Lyandres, Sun, and Zhang (2008)

c_ps()

Piotroski score (GHZ, Org). Piotroski (2000)

c_quick()

Quick ratio. Ou and Penman (1989)

c_rd()

Unexpected R&D increase. Eberhart, Maxwell, and Siddique (2004)

c_rd5_at()

R&D capital-to-assets. Li (2011)

c_rd_me()

R&D to market. Chan, Lakonishok, and Sougiannis (Guo, Lev, and Shi (2006) in GHZ)

c_rd_sale()

R&D to sales. Chan, Lakonishok, and Sougiannis (2001) (Guo, Lev, and Shi (2006) in GHZ)

c_realestate()

Real estate holdings. Tuzel (2010)

c_roic()

Return on invested capital. Brown and Rowe (2007)

c_sale_bev()

Asset turnover. Soliman (2008)

c_sale_emp_gr1()

Labor force efficiency. Abarbanell and Bushee (1998)

c_sale_gr1()

Annual sales growth. Lakonishok, Shleifer, and Vishny (1994)

c_sale_gr3()

Three-year sales growth. Lakonishok, Shleifer, and Vishny (1994)

c_sale_me()

Sales to price. Barbee, Mukherji, and Raines (1996)

c_salecash()

Sales-to-cash. Ou and Penman (1989)

c_saleinv()

Sales-to-inventory. Ou and Penman(1989)

c_salerec()

Sales-to-receivables. Ou and Penman(1989)

c_secured()

Secured debt-to-total debt. Valta (2016)

c_securedind()

Secured debt indicator. Valta (2016)

c_sin()

Sin stocks. Hong and Kacperczyk (2009)

c_sti_gr1a()

Change in short-term investments. Richardson et al. (2005)

c_taccruals_at()

Total Accruals. Richardson et al. (2005)

c_taccruals_ni()

Percent total accruals. Hafzalla, Lundholm, and Van Winkle (2011)

c_tangibility()

Tangibility. Hahn and Lee (2009)

c_tb()

Taxable income to income (Org, GHZ). Lev and Nissim (2004)

c_z_score()

Altman Z-Score. Dichev (1998)

convert_currency()

Convert currency to USD.

Convert the currency of funda to USD. This method needs to be called if

  1. the data contains non USD-denominated firms, e.g., CAD; and

  2. CRSP’s market equity is used, which is always in USD.

load_data(sdate=None, edate=None, fname='funda')

Load funda data from file.

This method loads funda data from config.input_dir/comp/funda and stores it in the data attribute. The data has index = datadate/gvkey and is sorted on gvkey/datadate.

Parameters:
  • sdate – Start date (‘yyyy-mm-dd’). If None, from the earliest.

  • edate – End date (‘yyyy-mm-dd’). If None, to the latest.

  • fname – The funda file name.

merge_with_fundq(fundq)

Merge funda with fundq.

Merge funda with quarterly-updated annual data generated from fundq. If a value exists in both data, funda has the priority.

Note

JKP create characteristics in funda and fundq separately and merge them, whereas we merge the raw data first and then generate characteristics. Since some variables in funda are not available in fundq, e.g., ebitda, JKP synthesize those unavailable variables with other variables and create characteristics, even when they are available in funda. We prefer to merge funda with fundq at the raw data level and create characteristics from the merged data.

Columns in both funda and fundq:

datadate, cusip, cik, sic, naics, sale, revt, cogs, xsga, dp, xrd, ib, nopi, spi, pi, txp, ni, txt, xint, capx, oancf, gdwlia, gdwlip, rect, act, che, ppegt, invt, at, aco, intan, ao, ppent, gdwl, lct, dlc, dltt, lt, pstk, ap, lco, lo, drc, drlt, txdi, ceq, scstkc, csho, prcc_f, oibdp, oiadp, mii, xopr, xi, do, xido, ibc, dpc, xidoc, fincf, fiao, txbcof, dltr, dlcch, prstkc, sstk, dv, ivao, ivst, re, txditc, txdb, seq, mib, icapt, ajex, curcd, exratd

Columns in funda but not in fundq:

xad, gp, ebitda, ebit, txfed, txfo, dvt, ob, gwo, fatb, fatl, dm, dcvt, cshrc, dcpstk, emp, xlr, ds, dvc, itcb, pstkrv, pstkl, dltis, ppenb, ppenls

Parameters:

fundqFUNDQ instance.

postprocess()

Postprocess data.

This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.

update_variables()

Preprocess data before creating firm characteristics.

  1. Synthesize missing values with other variables.

  2. Create frequently used variables.

FUNDQ

class pyanomaly.characteristics.FUNDQ(alias=None, data=None)

Bases: FCPanel

Class to generate firm characteristics from fundq.

The firm characteristics defined in this class can be viewed using show_available_chars().

Parameters:
  • alias (str, list, or dict) –

    Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.

    • str: A column name in the mapping file (config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.

    • list: List of firm characteristics (method names).

    • dict: Dict of firm characteristics and their aliases of the form {method name: alias}.

    If aliases are not specified, method names are used as aliases.

  • data – DataFrame of fundq data with index = datadate/gvkey, sorted on gvkey/datadate. The fundq data can be given at initialization or loaded later using load_data().

Methods

load_data

Load fundq data from file.

remove_duplicates

Drop duplicates.

convert_currency

Convert currency to USD.

create_qitems_from_yitems

Quarterize ytd items.

update_variables

Preprocess data before creating firm characteristics.

postprocess

Postprocess data.

Firm characteristic generation methods have a name like c_characteristic().

c_chtx()

Tax expense surprise. Thomas and Zhang (2011)

c_ni_inc8q()

Number of consecutive quarters with earnings increases. Barth, Elliott, and Finn (1999)

c_niq_at()

Quarterly return on assets. Balakrishnan, Bartov, and Faurel (2010)

c_niq_at_chg1()

Change in quarterly return on assets. Balakrishnan, Bartov, and Faurel (2010)

c_niq_be()

Return on equity (quarterly). Hou, Xue, and Zhang (2015)

c_niq_be_chg1()

Change in quarterly return on equity. Balakrishnan, Bartov, and Faurel (2010)

c_niq_su()

Earnings surprise. Foster, Olsen, and Shevlin (1984)

c_ocfq_saleq_std()

Cash flow volatility. Huang (2009)

c_roavol()

ROA volatility. Francis et al. (2004)

c_rsup()

Revenue surprise (Karma). Kama (2009)

c_saleq_su()

Revenue surprise. Jegadeesh and Livnat (2006)

c_stdacc()

Accrual volatility. Bandyopadhyay, Huang, and Wirjanto (2010)

convert_currency()

Convert currency to USD.

Convert the currency of fundq to USD. This method needs to be called if

  1. the data contains non USD-denominated firms, e.g., CAD; and

  2. CRSP’s market equity is used, which is always in USD.

create_qitems_from_yitems()

Quarterize ytd items.

Quarterize ytd variables, Xy’s, and use them to fill missing Xq’s (if Xq exists) or to create new quarterly variables (if Xq does not exist).

generate_funda_vars()

Generate quarterly-updated annual data from fundq.

The following variables are annualized by cumulating over the past 4 quarters.

‘cogs’, ‘xsga’, ‘xint’, ‘dp’, ‘txt’, ‘xrd’, ‘spi’, ‘sale’, ‘revt’, ‘xopr’, ‘oibdp’, ‘oiadp’, ‘ib’, ‘ni’, ‘xido’, ‘nopi’, ‘mii’, ‘pi’, ‘xi’, ‘oancf’, ‘dv’, ‘sstk’, ‘dlcch’, ‘capx’, ‘dltr’, ‘txbcof’, ‘xidoc’, ‘dpc’, ‘fiao’, ‘ibc’, ‘prstkc’, ‘fincf’.

Returns:

DataFrame of quarterly-updated annual data.

load_data(sdate=None, edate=None, fname='fundq')

Load fundq data from file.

This method loads fundq data from config.input_dir/comp/fundq and stores it in the data attribute. The data has index = datadate/gvkey and is sorted on gvkey/datadate.

Parameters:
  • sdate – Start date (‘yyyy-mm-dd’). If None, from the earliest.

  • edate – End date (‘yyyy-mm-dd’). If None, to the latest.

  • fname – The fundq file name.

postprocess()

Postprocess data.

This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.

remove_duplicates()

Drop duplicates.

In fundq, there are duplicate rows (rows with the same datadate and gvekey). Remove duplicates in the following order:

  1. Remove records with missing fqtr.

  2. Choose records with the maximum fyearq.

  3. Choose records with the minimum fqtr.

update_variables()

Preprocess data before creating firm characteristics.

  1. Synthesize missing values with other variables.

  2. Create frequently used variables.

CRSPM

class pyanomaly.characteristics.CRSPM(alias=None, data=None)

Bases: FCPanel

Class to generate firm characteristics from crspm.

The firm characteristics defined in this class can be viewed using show_available_chars().

Parameters:
  • alias (str, list, or dict) –

    Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.

    • str: A column name in the mapping file (config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.

    • list: List of firm characteristics (method names).

    • dict: Dict of firm characteristics and their aliases of the form {method name: alias}.

    If aliases are not specified, method names are used as aliases.

  • data – DataFrame of crspm data with index = date/permno, sorted on permno/date. The crspm data can be given at initialization or loaded later using load_data().

Methods

load_data

Load crspm data from file.

filter_data

Filter data.

update_variables

Preprocess data before creating firm characteristics.

merge_with_factors

Merge crspm with factors.

postprocess

Postprocess data.

Firm characteristic generation methods have a name like c_characteristic().

c_beta_60m()

Market beta (Org, JKP). Fama and MacBeth (1973)

c_chcsho_12m()

Net stock issues (JKP). Pontiff and Woodgate (2008)

c_chmom()

Change in 6-month momentum. Gettleman and Marks (2006)

c_div12m_me()

Dividend yield (JKP). Litzenberger and Ramaswamy (1979)

c_divi()

Dividend initiation. Michaely, Thaler, and Womack (1995)

c_divo()

Dividend omission. Michaely, Thaler, and Womack (1995)

c_dolvol()

Dollar trading volume (Org, GHZ). Brennan, Chordia, and Subrahmanyam (1998)

c_eqnpo_12m()

Composite equity issuance (JKP, 12 months). Daniel and Titman (2006)

c_eqnpo_60m()

Composite equity issuance (Org). Daniel and Titman (2006)

c_indmom()

Industry momentum. Moskowitz and Grinblatt (1999)

c_ipo()

Initial public offerings. Loughran and Ritter (1995)

c_market_equity()

Market equity. Banz (1981)

c_price()

Share price. Miller and Scholes (1982)

c_resff3_12_1()

12 month residual momentum. Blitz, Huij, and Martens (2011)

c_resff3_6_1()

6 month residual momentum. Blitz, Huij, and Martens (2011)

c_ret_12_1()

Momentum (12 months). Jegadeesh and Titman (1993)

c_ret_12_6()

Intermediate momentum (7-12). Novy-Marx (2012)

c_ret_1_0()

Short-term reversal. Jegadeesh (1990)

c_ret_36_12()

Long-term reversal (12-36). De Bondt and Thaler (1985)

c_ret_3_1()

Momentum (3 months). Jegadeesh and Titman (1993)

c_ret_60_12()

Long-term reversal (12-60). De Bondt and Thaler (1985)

c_ret_6_1()

Momentum (6 months). Jegadeesh and Titman (1993)

c_ret_9_1()

Momentum (9 months). Jegadeesh and Titman (1993)

c_seas_11_15an()

Years 11-15 lagged returns, annual. Heston and Sadka (2008)

c_seas_11_15na()

Years 11-15 lagged returns, nonannual. Heston and Sadka (2008)

c_seas_16_20an()

Years 16-20 lagged returns, annual. Heston and Sadka (2008)

c_seas_16_20na()

Years 16-20 lagged returns, nonannual. Heston and Sadka (2008)

c_seas_1_1an()

Year 1-lagged return, annual. Heston and Sadka (2008)

c_seas_1_1na()

Year 1-lagged return, nonannual. Heston and Sadka (2008)

c_seas_2_5an()

Years 2-5 lagged returns, annual. Heston and Sadka (2008)

c_seas_2_5na()

Years 2-5 lagged returns, nonannual. Heston and Sadka (2008)

c_seas_6_10an()

Years 6-10 lagged returns, annual. Heston and Sadka (2008)

c_seas_6_10na()

Years 6-10 lagged returns, nonannual. Heston and Sadka (2008)

c_turn()

Share turnover (Org, GHZ). Datar, Naik, and Radcliffe (1998)

filter_data()

Filter data.

The data is filtered using the following filters:

  • shrcd in [10, 11, 12]

Note

We do not filter the data using exchange code (exchcd in [1 (NYSE), 2 (ASE), 3 (NASDAQ)]) because exchcd can change when a stock is delisted: If the data is filtered using exchcd, the data of the delist month can be lost.

load_data(sdate=None, edate=None, fname='crspm')

Load crspm data from file.

This method loads crspm data from config.input_dir/crspm and stores it in the data attribute. The data has index = date/permno and is sorted on permno/date.

Note

In CRSP monthly tables, date is the last business day of the month, whereas datadate in Compustat is the end-of-month date. To make the two dates consistent, crspm dates are shifted to the end of the month.

Parameters:
  • sdate – Start date (‘yyyy-mm-dd’). If None, from the earliest.

  • edate – End date (‘yyyy-mm-dd’). If None, to the latest.

  • fname – The crspm file name.

merge_with_factors(factors=None)

Merge crspm with factors.

The factors should contain Fama-French 3 factors with column names as defined in config.factor_names.

Parameters:

factors – DataFrame of factors with index = date or list of factors. If None or list, factor data will be read from config.monthly_factors_fname and only the factors in factors will be merged.

postprocess()

Postprocess data.

This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.

update_variables()

Preprocess data before creating firm characteristics.

  • Convert negative prices (quotes) to positive.

  • Convert shares outstanding (shrout) unit from thousands to millions (divide by 1000).

  • Convert trading volume (vol) unit from 100 shares to shares (multiply by 100).

  • Adjust trading volume following Gao and Ritter (2010).

  • Create frequently used variables.

CRSPD

class pyanomaly.characteristics.CRSPDRaw(alias=None, data=None)

Bases: FCPanel

Class that handles crspd data.

This class contains daily crspd data and is used to generate monthly firm characteristics in CRSPD.

Parameters:
  • alias (str, list, or dict) –

    Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.

    • str: A column name in the mapping file (config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.

    • list: List of firm characteristics (method names).

    • dict: Dict of firm characteristics and their aliases of the form {method name: alias}.

    If aliases are not specified, method names are used as aliases.

  • data – DataFrame of crspd data with index = date/permno, sorted on permno/date. The crspd data can be given at initialization or loaded later using load_data().

Methods

load_data

Load crspd data from file.

filter_data

Filter data.

update_variables

Preprocess data before creating firm characteristics.

merge_with_factors

Merge crspd with factors.

get_idym_group

Get id-month group.

get_idym_group_size

Get id-month group sizes.

apply_to_idyms

Apply a function to each id-month group.

apply_to_idyms(data, function, n_ret, *args, data2=None)

Apply a function to each id-month group.

This method groups data by id-month and applies function to each group. This method can be used when the function is a reduce function and requires only the data within the month.

Parameters:
  • data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the data attribute.

  • function – Jitted reduce function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).

  • n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data.

  • *args – Additional arguments of function.

  • data2 – DataFrame, Series, ndarray, str, or int. Optional argument when function requires two sets of input data.

Returns:

Concatenated value of the outputs of function. Size = (number of id-month groups) x n_retval.

filter_data()

Filter data.

The data is filtered using the following filters:

  • shrcd in [10, 11, 12]

Note

We do not filter the data using exchange code (exchcd in [1 (NYSE), 2 (ASE), 3 (NASDAQ)]) because exchcd can change when a stock is delisted: If the data is filtered using exchcd, the data of the delist month can be lost.

get_idym_group()

Get id-month group.

Group data by permno and year-month and return the GroupBy object.

Returns:

Pandas GroupBy object.

get_idym_group_size()

Get id-month group sizes.

Returns:

Ndarray of id-month group sizes.

load_data(sdate=None, edate=None, fname='crspd')

Load crspd data from file.

This method loads crspd data from config.input_dir/crspd and stores it in the data attribute. The data has index = date/permno and is sorted on permno/date.

Parameters:
  • sdate – Start date (‘yyyy-mm-dd’). If None, from the earliest.

  • edate – End date (‘yyyy-mm-dd’). If None, to the latest.

  • fname – The crspd file name.

merge_with_factors(factors=None)

Merge crspd with factors.

The factors should contain Fama-French 3 factors and Hou-Xue-Zhang 4 factors with column names as defined in config.factor_names.

Parameters:

factors – DataFrame of factors with index = date or list of factors. If None or list, factor data will be read from config.daily_factors_fname and only the factors in factors will be merged.

update_variables()

Preprocess data before creating firm characteristics.

  • Convert negative prices (quotes) to positive.

  • Set askhi and bidlo to nan if it is negative, the price is negative, or the volume is 0.

  • Conver cfacpr of 0 to nan.

  • Convert shares outstanding (shrout) unit from thousands to millions (divide by 1000).

  • Adjust trading volume following Gao and Ritter (2010).

  • Create frequently used variables.

class pyanomaly.characteristics.CRSPD(alias=None, data=None)

Bases: FCPanel

Class to generate firm characteristics from crspd.

This class has a CRSPDRaw object as a member attribute and use it to generate monthly firm characteristics. CRSPDRaw.data contains daily crspd data and CRSPD.data contains monthly firm characteristics. The firm characteristics defined in this class can be viewed using show_available_chars().

Parameters:
  • alias (str, list, or dict) –

    Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.

    • str: A column name in the mapping file (config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.

    • list: List of firm characteristics (method names).

    • dict: Dict of firm characteristics and their aliases of the form {method name: alias}.

    If aliases are not specified, method names are used as aliases.

  • data – DataFrame of crspd data with index = date/permno, sorted on permno/date. The crspd data can be given at initialization or loaded later using load_data().

Attributes

cd

CRSPDRaw object that stores daily crspd data.

Methods

load_data

Load crspd data from file.

filter_data

Filter data.

update_variables

Preprocess data before creating firm characteristics.

merge_with_factors

Merge crspd with factors.

postprocess

Postprocess data.

get_idym_group

Get id-month group.

get_idym_group_size

Get id-month group sizes.

Firm characteristic generation methods have a name like c_characteristic().

c_ami_126d()

Illiquidity. Amihud (2002)

c_baspread()

Bid-ask spread. Amihud and Mendelson (1986)

c_beta()

Market beta (GHZ). Fama and MacBeth (1973)

c_beta_dimson_21d()

Dimson Beta. Dimson (1979)

c_betabab_1260d()

Frazzini-Pedersen beta. Frazzini and Pedersen (2014)

c_betadown_252d()

Downside beta. Ang, Chen, and Xing (2006)

c_betasq()

Beta squared (GHZ). Fama and MacBeth (1973)

c_bidaskhl_21d()

High-low bid-ask spread. Corwin and Schultz (2012)

c_corr_1260d()

Market correlation. Assness et al. (2020)

c_coskew_21d()

Coskewness. Harvey and Siddique (2000)

c_dolvol_126d()

Dollar trading volume (JKP). Brennan, Chordia, and Subrahmanyam (1998)

c_dolvol_var_126d()

Volatility of dollar trading volume (JKP). Chordia, Subrahmanyam, and Anshuman (2001)

c_idiovol()

Idiosyncratic volatility (GHZ). Ali, Hwang, and Trombley (2003)

c_iskew_capm_21d()

Idiosyncratic skewness (CAPM). Bali, Engle, and Murray (2016)

c_iskew_ff3_21d()

Idiosyncratic skewness (FF3). Bali, Engle, and Murray (2016)

c_iskew_hxz4_21d()

Idiosyncratic skewness (HXZ). Bali, Engle, and Murray (2016)

c_ivol_capm_21d()

Idiosyncratic volatility (CAPM). Ang et al. (2006)

c_ivol_capm_252d()

Idiosyncratic volatility (Org, JKP). Ali, Hwang, and Trombley (2003)

c_ivol_ff3_21d()

Idiosyncratic volatility (FF3). Ang et al. (2006)

c_ivol_hxz4_21d()

Idiosyncratic volatility (HXZ). Ang et al. (2006)

c_prc_highprc_252d()

52-week high. George and Hwang (2004)

c_pricedelay()

Price delay based on R-squared. Hou and Moskowitz (2005)

c_pricedelay_slope()

Price delay based on slopes. Hou and Moskowitz (2005)

c_retvol()

Return volatility. Ang et al. (2006)

c_rmax1_21d()

Maximum daily return. Bali, Cakici, and Whitelaw (2011)

c_rmax5_21d()

Highest 5 days of return. Bali, Brown, and Tang (2017)

c_rmax5_rvol_21d()

Highest 5 days of return to volatility. Assness et al. (2020)

c_rskew_21d()

Return skewness. Bali, Engle, and Murray (2016)

c_std_dolvol()

Volatility of dollar trading volume (GHZ). Chordia, Subrahmanyam, and Anshuman (2001)

c_std_turn()

Volatility of share turnover (GHZ). Chordia, Subrahmanyam, and Anshuman (2001)

c_trend_factor()

Trend factor. Han, Zhou, and Zhu (2016)

c_turnover_126d()

Share turnover (JKP). Datar, Naik, and Radcliffe (1998)

c_turnover_var_126d()

Volatility of share turnover (JKP). Chordia, Subrahmanyam, and Anshuman (2001)

c_zero_trades_126d()

Zero-trading days (6 months). Liu (2006)

c_zero_trades_21d()

Zero-trading days (1 month). Liu (2006)

c_zero_trades_252d()

Zero-trading days (12 months). Liu (2006)

filter_data()

Filter data.

This is a wrapping method of CRSPDRaw.filter_data().

get_idym_group()

Get id-month group.

This is a wrapping method of CRSPDRaw.get_idym_group().

get_idym_group_size()

Get id-month group sizes.

This is a wrapping method of CRSPDRaw.get_idym_group_size().

load_data(sdate=None, edate=None, fname='crspd')

Load crspd data from file.

This is a wrapping method of CRSPDRaw.load_data().

merge_with_factors(factors=None)

Merge crspd with factors.

This is a wrapping method of CRSPDRaw.merge_with_factors().

postprocess()

Postprocess data.

This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.

update_variables()

Preprocess data before creating firm characteristics.

This method calls CRSPDRaw.update_variables() to update variables and initializes the data attribute.

Merge

class pyanomaly.characteristics.Merge(alias=None)

Bases: FCPanel

Class to generate firm characteristics from a combined dataset of crspm, crspd, funda, and fundq.

The firm characteristics defined in this class can be viewed using show_available_chars().

Methods

preprocess

Merge crspm, crspd, funda, and fundq.

postprocess

Postprocess data.

Firm characteristic generation methods have a name like c_characteristic().

c_age()

Firm age. Jiang, Lee, and Zhang (2005)

c_mispricing_mgmt()

Mispricing factor: Management. Stambaugh and Yuan (2016)

c_mispricing_perf()

Mispricing factor: Performance. Stambaugh and Yuan (2016)

c_qmj()

Quality minus Junk: Composite. Assness, Frazzini, and Pedersen (2018)

c_qmj_growth()

Quality minus Junk: Growth. Assness, Frazzini, and Pedersen (2018)

c_qmj_prof()

Quality minus Junk: Profitability. Assness, Frazzini, and Pedersen (2018)

c_qmj_safety()

Quality minus Junk: Safety. Assness, Frazzini, and Pedersen (2018)

postprocess()

Postprocess data.

This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.

preprocess(crspm=None, crspd=None, funda=None, fundq=None, delete_data=True)

Merge crspm, crspd, funda, and fundq.

The crspd, funda, and fundq are left-joined to crspm, and the resulting data has index = date/permno. The frequency of the final data is the same as the frequency of crspm data. This method also checks if “ingredient” firm characteristics have been generated and generates them if necessary.

Parameters:
  • crspmCRSPM instance.

  • crspdCRSPD instance.

  • fundaFUNDA instance.

  • fundqFUNDQ instance.

  • delete_data – True to delete the data of crspm, crspd, funda, and fundq after merge to save memory.

config

This module defines functions to set/get package configuration.

A configuration can be accessed by either get_config(attr) or config.attr.

Configurations

Attribute

Description

Default value

input_dir

Input file top-level directory

‘./input/’

output_dir

Output file top-level directory

‘./output/’

log_dir

Log file directory

‘./log/’

mapping_file_path

Function-characteristic mapping file path

‘./mapping.xlsx’

factors_monthly_fname

Monthly factor file name

‘factors_monthly’

factors_daily_fname

Daily factor file name

‘factors_daily’

factor_names

Factor name mapping dictionary. The keys are the factor names used in PyAnomaly and the values are the factor names in monthly(daily) factor file. This configuration can be used when factor files are obtained externally and have different factor names.

{‘rf’: ‘rf’, ‘mktrf’: ‘mktrf’, ‘smb_ff’: ‘smb_ff’, ‘hml’: ‘hml’, ‘smb_ff5’: ‘smb_ff5’, ‘rmw’: ‘rmw’, ‘cma’: ‘cma’, ‘smb_hxz’: ‘smb_hxz’, ‘inv’: ‘inv’, ‘roe’: ‘roe’, ‘smb_sy’: ‘smb_sy’, ‘mgmg’: ‘mgmt’, ‘perf’: ‘perf’}

replicate_jkp

Whether to replicate JKP version. True or False

False

float_type

Float data type. ‘float32’ or ‘float64’

‘float64’

file_format

File format used to save data. ‘pickle’ or ‘parquet’

‘pickle’

disable_jit

Disable jitting. Applicable only to non-cached jitted functions.

False

jit_parallel

Enable parallel looping in jitted functions.

True

numba_num_threads

Number of threads used in Numba parallel mode. Should be fewer than the number of CPU cores.

Number of CPU cores

debug

Print debugging messages.

False

Factor model

Factors

Fama and French 3-factor model

mktrf, smb_ff, hml

Fama and French 5-factor model

mktrf, smb_ff5, hml, rmw, cma

Hou, Xu, and Zhang 4-factor model

mktrf, smb_hxz, inv, roe

Stambaugh and Yuan 4-factor model

mktrf, smb_sy, mgmt, perf

Methods

set_config

Set configuration.

get_config

Get configuration.

pyanomaly.config.get_config(attr)

Get configuration.

Parameters:

attr – String. A configuration attribute.

Returns:

The value of the attribute.

Examples

>>> get_config('input_dir')
'./input/'
pyanomaly.config.set_config(**kwargs)

Set configuration.

Parameters:

**kwargs – Keword arguments of configuration attributes and their values.

Examples

Change the float type to ‘float32’.

>>> set_config(float_type='float32')

Set input and output directories to ‘./my_input/’ and ‘./my_output/’, respectively.

>>> set_config(input_dir='./my_input/', out_dir='./my_output')

datatools

This module defines functions for data handling.

Group-and-Apply

apply_to_groups

Group data and apply a function to each group.

apply_to_groups_jit

Group data and apply a function to each group (jitted version).

apply_to_groups_reduce_jit

Group data and apply a reduce function to each group (jitted version).

Classify/Trim/Filter/Winsorize

classify

Classify array.

trim

Trim array.

filter

Filter data.

winsorize

Winsorize array.

Merge/Populate/Shift

merge

Merge two data sets.

populate

Populate data.

shift

Shift data.

Data Inspection/Comparison

inspect_data

Inspect data.

compare_data

Compare two data sets.

Auxiliary Functions

to_month_end

Shift dates to the last dates of the same month.

add_months

Add months to dates.

pyanomaly.datatools.add_months(date, months, to_month_end=True)

Add months to dates.

Parameters:
  • date – Datetime Series.

  • months – Months to add. Can be negative.

  • to_month_end – If True, returned dates are end-of-month dates.

Returns:

Datetime Series of (date + months). Dates are end-of-month dates if to_month_end = True.

pyanomaly.datatools.apply_to_groups(data, ginfo, function, *args, data2=None)

Group data and apply a function to each group.

This function can be used for a complex groupby operation. The data (and data2) is grouped using the grouping information, ginfo, and function is applied to each group. The function can be either jitted or unjitted. If it is jitted, consider using apply_to_groups_jit() instead, which runs the for loop along the groups in parallel. This function is faster than groupby().apply(function) when the size of data is large.

Parameters:
  • data – DataFrame or ndarray (values of a DataFrame) to be grouped.

  • ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. See the note below.

  • function – Function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).

  • *args – Additional arguments of function.

  • data2 – DataFrame or ndarray (values of a DataFrame) to be grouped. Optional argument when function requires two sets of input data.

Returns:

Concatenated value of the outputs of function.

Note

Suppose data is a DataFrame with index = date/id, sorted on id/date.

To apply a function to each id, ginfo can be set to any of the followings.

  • ginfo = ‘id’ (index name)

  • ginfo = 1 (index level)

  • ginfo = data.groupby(‘id’) (GroupBy object)

  • ginfo = data.groupby(‘id’).size().to_numpy() (group size)

  • ginfo = list(data.groupby(‘id’).indices.values()) (group index)

To apply a function to each date, ginfo can be set to any of the followings.

  • ginfo = ‘date’ (index name)

  • ginfo = 0 (index level)

  • ginfo = data.groupby(‘date’) (GroupBy object)

  • ginfo = list(data.groupby(‘date’).indices.values()) (group index)

Since data is sorted on id/date, group sizes can be used only when grouped by id. The most efficient method is to provide group sizes, followed by group indexes. Hence, if this function needs to be called repeatedly, performance can be improved by generating group sizes (if data is sorted on the grouping index (column)) or group indexes outside and use them as ginfo.

Examples

>>> data = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... ).sort_index(level=['permno', 'date'])
... data
                     ret    me  me_nyse
date       permno
2023-03-31 10000   0.010   100      NaN
2023-04-30 10000  -0.030   120      NaN
2023-03-31 20000   0.050  5000 5000.000
2023-04-30 20000  -0.010  4500 4500.000
2023-03-31 30000   0.020  2000 2000.000
2023-04-30 30000   0.030  2100 2100.000
2023-03-31 40000   0.030   300  300.000
2023-04-30 40000  -0.050   305  305.000
2023-03-31 50000   0.110   150  150.000
2023-04-30 50000   0.070   140  140.000

Define the function to apply.

>>> def rolling_sum(x, n):
...    return x.rolling(n).sum()

Group data by ‘permno’ and calculate rolling sum of ‘ret’ and ‘me’.

>>> apply_to_groups(data[['ret', 'me']], 'permno', rolling_sum, 2)
                     ret       me
date       permno
2023-03-31 10000     NaN      NaN
2023-04-30 10000  -0.020  220.000
2023-03-31 20000     NaN      NaN
2023-04-30 20000   0.040 9500.000
2023-03-31 30000     NaN      NaN
2023-04-30 30000   0.050 4100.000
2023-03-31 40000     NaN      NaN
2023-04-30 40000  -0.020  605.000
2023-03-31 50000     NaN      NaN
2023-04-30 50000   0.180  290.000

The followings are equivalent to the above.

>>> gb = data.groupby('permno')
... apply_to_groups(data[['ret', 'me']], gb, rolling_sum, 2)
>>> gsize = gb.size().to_numpy()
... apply_to_groups(data[['ret', 'me']], gsize, rolling_sum, 2)

Group data by ‘date’ and calculate cross-sectional mean of ‘ret’.

>>> apply_to_groups(data['ret'], 'date', np.mean)
[[0.044]
 [0.002]]

The followings are equivalent to the above.

>>> gb = data.groupby('date')
... apply_to_groups(data['ret'], gb, np.mean)
>>> gidx = list(gb.indices.values())
... apply_to_groups(data['ret'], gidx, np.mean)
pyanomaly.datatools.apply_to_groups_jit(data, ginfo, function, n_ret, *args, data2=None)

Group data and apply a function to each group (jitted version).

This function is similar to apply_to_groups(). The for loop along the groups is jitted for faster performance. The first call of this function can be slow as jitting takes place when first called. The function should be jitted, and the row size of its returns should be the same as the row size of the input data. For reduce functions, e.g., sum and mean, use apply_to_groups_reduce_jit().

Parameters:
  • data – DataFrame or ndarray (values of a DataFrame) to be grouped.

  • ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. See apply_to_groups() for more details.

  • function – Jitted function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).

  • n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data.

  • *args – Additional arguments of function.

  • data2 – DataFrame or ndarray (values of a DataFrame) to be grouped. Optional argument when function requires two sets of input data.

Returns:

Concatenated value of the outputs of function.

Examples

>>> data = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... ).sort_index(level=['permno', 'date'])
... data
                     ret    me  me_nyse
date       permno
2023-03-31 10000   0.010   100      NaN
2023-04-30 10000  -0.030   120      NaN
2023-03-31 20000   0.050  5000 5000.000
2023-04-30 20000  -0.010  4500 4500.000
2023-03-31 30000   0.020  2000 2000.000
2023-04-30 30000   0.030  2100 2100.000
2023-03-31 40000   0.030   300  300.000
2023-04-30 40000  -0.050   305  305.000
2023-03-31 50000   0.110   150  150.000
2023-04-30 50000   0.070   140  140.000

Group data by ‘permno’ and calculate rolling sum of ‘ret’ and ‘me’ using numba_support.roll_sum().

>>> apply_to_groups_jit(data[['ret', 'me']], 'permno', roll_sum, 2, 2)
array([[      nan,       nan],
       [-2.00e-02,  2.20e+02],
       [      nan,       nan],
       [ 4.00e-02,  9.50e+03],
       [      nan,       nan],
       [ 5.00e-02,  4.10e+03],
       [      nan,       nan],
       [-2.00e-02,  6.05e+02],
       [      nan,       nan],
       [ 1.80e-01,  2.90e+02]])

The followings are equivalent to the above.

>>> gb = data.groupby('permno')
... apply_to_groups_jit(data[['ret', 'me']], gb, roll_sum, 2, 2)
>>> gsize = gb.size().to_numpy()
... apply_to_groups_jit(data[['ret', 'me']], gsize, roll_sum, 2, 2)
pyanomaly.datatools.apply_to_groups_reduce_jit(data, ginfo, function, n_ret, *args, data2=None)

Group data and apply a reduce function to each group (jitted version).

This function is similar to apply_to_groups(). The for loop along the groups is jitted for faster performance. The first call of this function can be slow as jitting takes place when first called. The function should be jitted and a reduce function such as mean or std: the row size of the returns should be 1.

Parameters:
  • data – DataFrame or ndarray (values of a DataFrame) to be grouped.

  • ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. See apply_to_groups() for more details.

  • function – Jitted function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).

  • n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data.

  • *args – Additional arguments of function.

  • data2 – DataFrame or ndarray (values of a DataFrame) to be grouped. Optional argument when function requires two sets of input data.

Returns:

Concatenated value of the outputs of function.

Examples

>>> data = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... ).sort_index(level=['permno', 'date'])
... data
                     ret    me  me_nyse
date       permno
2023-03-31 10000   0.010   100      NaN
2023-04-30 10000  -0.030   120      NaN
2023-03-31 20000   0.050  5000 5000.000
2023-04-30 20000  -0.010  4500 4500.000
2023-03-31 30000   0.020  2000 2000.000
2023-04-30 30000   0.030  2100 2100.000
2023-03-31 40000   0.030   300  300.000
2023-04-30 40000  -0.050   305  305.000
2023-03-31 50000   0.110   150  150.000
2023-04-30 50000   0.070   140  140.000

Group data by ‘date’ and calculate cross-sectional standard deviation of ‘ret’ using numba_support.nanstd().

>>> apply_to_groups_reduce_jit(data['ret'], 'date', nanstd, None)
array([0.03974921, 0.04816638])

The following is equivalent to the above.

>>> gidx = list(data.groupby('date').indices.values())
... apply_to_groups_reduce_jit(data['ret'], gidx, nanstd, None)
pyanomaly.datatools.classify(array, split, ascending=True, ginfo=None, by_array=None)

Classify array.

Classify (group) array into split classes based on its value. Class labels are set to 0, 1, … where 0 corresponds to the lowest (highest) value group if ascending = True (False). If array contains nan, their classes are set to nan.

Parameters:
  • array – Nx1 ndarray or Series to be classified.

  • split – Number of classes (for equal-size quantiles) or list of quantiles, e.g., (0.3, 0.7).

  • ascending (bool) – Sorting order.

  • ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. If given, array is classified within each group. See apply_to_groups() for more details.

  • by_array – Array based on which cut points are determined. If None, by_array = array. E.g., array can be set to ME and by_array to NYSE-ME to group firms on size with NYSE-size cut points.

Returns:

Nx1 ndarray of classes.

Note

If the array has one unique value, the class will be set to 0, and if the array has two unique values (binary variable), the class of the smaller value will be 0 and that of the larger value will be (number of quantiles - 1), when ascending = True. If the number of unique values is greater than 2 and smaller than the number of quantiles, the classes are not deterministic.

Examples

>>> array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
... array
[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]

Classify array into 5 equally-spaced groups.

>>> classify(array, 5)
[0., 0., 0., 1., 1., 2., 2., 3., 3., 4., 4.]

Classify array into three groups that correspond to 0.3, 03-0.7, 0.7-1.0 quantiles.

>>> classify(array, [0.3, 0.7])
[0., 0., 0., 0., 1., 1., 1., 1., 2., 2., 2.]

Classify array into three groups in descending order.

>>> classify(array, [0.3, 0.7], ascending=False)
[2., 2., 2., 2., 1., 1., 1., 1., 0., 0., 0.]
>>> data = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... ).sort_index(level=['permno', 'date'])
... data
                     ret    me  me_nyse
date       permno
2023-03-31 10000   0.010   100      NaN
2023-04-30 10000  -0.030   120      NaN
2023-03-31 20000   0.050  5000 5000.000
2023-04-30 20000  -0.010  4500 4500.000
2023-03-31 30000   0.020  2000 2000.000
2023-04-30 30000   0.030  2100 2100.000
2023-03-31 40000   0.030   300  300.000
2023-04-30 40000  -0.050   305  305.000
2023-03-31 50000   0.110   150  150.000
2023-04-30 50000   0.070   140  140.000

Cross-sectionally classify the data into three groups ([0.3, 0.7, 1.0]) on ‘me’ using ‘me_nyse’ cut points.

>>> data['me_cls'] = classify(data['me'], [0.3, 0.7], ginfo='date', by_array=data['me_nyse'])
... data
                     ret    me  me_nyse  me_cls
date       permno
2023-03-31 10000   0.010   100      NaN       0
2023-04-30 10000  -0.030   120      NaN       0
2023-03-31 20000   0.050  5000 5000.000       2
2023-04-30 20000  -0.010  4500 4500.000       2
2023-03-31 30000   0.020  2000 2000.000       1
2023-04-30 30000   0.030  2100 2100.000       1
2023-03-31 40000   0.030   300  300.000       1
2023-04-30 40000  -0.050   305  305.000       1
2023-03-31 50000   0.110   150  150.000       0
2023-04-30 50000   0.070   140  140.000       0
pyanomaly.datatools.compare_data(data1, data2=None, on=None, how='inner', tolerance=0.01, suffixes=('_x', '_y'), returns=False)

Compare two data sets.

This function compares the common columns of data1 and data2. This is similar to data1.compare(data2), but data1 and data2 are not required to have the same index and columns: Data sets are first merged and only common columns are compared. Also, a tolerance can be set to determine whether two values are identical. Whether two columns are identical within the tolerance (‘match’), their correlation (‘corr’), and the number of nans in data1 and data2 (‘nan_x’, ‘nan_y’) are printed.

Parameters:
  • data1 – DataFrame for comparison.

  • data2 – DataFrame for comparison. If None, data1 is assumed to be a merged dataset of data1 and data2. If data1 is a merged dataset, on and how have no effect.

  • on – A column or a list of columns to merge data sets on. If None, data sets will be merged on index.

  • how – How to merge: ‘inner’, ‘outer’, ‘left’, or ‘right’. If ‘inner’, only common indexes are compared.

  • tolerance – Tolerance level to determine equality. Two values, val1 and val2, are considered to be identical if abs((val1 - val2) / val2) < tolerance.

  • suffixes – suffixes to add to common columns or suffixes used in the merged dataset. suffixes[0]: suffix for data1, suffixes[1]: suffix for data2.

  • returns – If True, return the comparison results and merged data.

Returns:

  • Comparison result. DataFrame with index = compared columns and columns = [‘match’, ‘corr’, ‘nan_x’, ‘nan_y’].

  • Merged DataFrame of data1 and data2.

Examples

>>> data1 = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... )
... data2 = pd.DataFrame(
...     {'ret': [0.00, np.nan, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... )
... compare_data(data1, data2)
column                matched     corr    nan_x    nan_y
ret                   0.88889  0.99773       0       1
pyanomaly.datatools.filter(data, on, limits, ginfo=None, by=None)

Filter data.

Remove rows of data, where data[on] is outside limits.

Parameters:
  • data – DataFrame to be filtered.

  • on – Column of data to filter data on.

  • limits – A pair of quantiles, e.g., (0.1, 0.1) to remove top and bottom 10%. An element of limits can be set to None for one-sided filtering.

  • ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. If given, data is filtered within each group. See apply_to_groups() for more details.

  • by – Column of data on which cut points are determined. If None, by = on. E.g., on can be set to ‘me’ and by to ‘nyse_me’ to remove small firms based on NYSE-size cut points.

Returns:

DataFrame. Filtered data.

Examples

>>> data = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... ).sort_index(level=['permno', 'date'])
... data
                     ret    me  me_nyse
date       permno
2023-03-31 10000   0.010   100      NaN
2023-04-30 10000  -0.030   120      NaN
2023-03-31 20000   0.050  5000 5000.000
2023-04-30 20000  -0.010  4500 4500.000
2023-03-31 30000   0.020  2000 2000.000
2023-04-30 30000   0.030  2100 2100.000
2023-03-31 40000   0.030   300  300.000
2023-04-30 40000  -0.050   305  305.000
2023-03-31 50000   0.110   150  150.000
2023-04-30 50000   0.070   140  140.000

Remove smallest 20% using ‘me_nyse’ cut points.

>>> filter(data, 'me', [0.2, None], ginfo='date', by='me_nyse')
                     ret    me  me_nyse
date       permno
2023-03-31 20000   0.050  5000 5000.000
2023-04-30 20000  -0.010  4500 4500.000
2023-03-31 30000   0.020  2000 2000.000
2023-04-30 30000   0.030  2100 2100.000
2023-03-31 40000   0.030   300  300.000
2023-04-30 40000  -0.050   305  305.000
pyanomaly.datatools.inspect_data(data, option=['summary'], date_col=None, id_col=None)

Inspect data.

This function inspects a panel data, data, and print the results.

Parameters:
  • data – DataFrame. It should have date and id columns or index = date/id.

  • option – List of items to display. See below for available options.

  • date_col – Date column. If None, date.index[0] is assumed to be date.

  • id_col – ID column. If None, date.index[1] is assumed to be id.

Option

Items displayed

‘summary’

Data shape, number of unique dates, and number of unique ids.

‘id_count`

Number of ids per date.

‘nans’

Number of nans and infs per column.

‘stats’

Descriptive statistics. Same as data.describe().

pyanomaly.datatools.merge(left, right, on=None, right_on=None, how='left', drop_duplicates='right', suffixes=None, method=None)

Merge two data sets.

This is similar to pd.merge(), but often much faster and less memory-hungry when merging left. Also, the index of left is always retained.

Parameters:
  • left – Series or DataFrame. Left data to merge.

  • right – Series or DataFrame, Right data to merge.

  • on – (List of) column(s) to merge on. If None, merge on index.

  • right_on – (List of) column(s) of right to merge on. If None, right_on = on.

  • how – Merge method: ‘inner’, ‘outer’, ‘left’, or ‘right’.

  • drop_duplicates – how to handle duplicate columns. ‘left’: keep right, ‘right’: keep left, None: keep both. If None, suffixes should be provided.

  • suffixes – A tuple of suffixes for duplicate columns, e.g., suffixes=(‘_x’, ‘_y’) will add ‘_x’ and ‘_y’ to the left and right duplicate columns, respectively.

  • method – None or ‘pandas’. None uses an internal merge algorithm for left-merge; ‘pandas’ uses pd.merge() internally. If how is not ‘left’, this option is ignored and pd.merge() is always used.

Returns:

Merged DataFrame.

Note

The internal algorithm is much faster and more memory-efficient than pd.merge() especially when how = ‘left’ and right data does not have many columns. In other cases, it could be slower. Try both method = None and ‘merge’, and choose the faster method.

Warning

The left and right could be modified internally.

Examples

>>> data1 = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... )
... data1
                     ret    me
date       permno
2023-03-31 10000   0.010   100
           20000   0.050  5000
           30000   0.020  2000
           40000   0.030   300
           50000   0.110   150
2023-04-30 10000  -0.030   120
           20000  -0.010  4500
           30000   0.030  2100
           40000  -0.050   305
           50000   0.070   140
>>> data2 = pd.DataFrame(
...     {'me': [120, 4500, 2100, 305, 140],
...      'me_nyse': [np.nan, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... )
... data2
                     me  me_nyse
date       permno
2023-04-30 10000    120      NaN
           20000   4500 4500.000
           30000   2100 2100.000
           40000    305  305.000
           50000    140  140.000
>>> merge(data1, data2)
                     ret    me  me_nyse
date       permno
2023-03-31 10000   0.010   100      NaN
           20000   0.050  5000      NaN
           30000   0.020  2000      NaN
           40000   0.030   300      NaN
           50000   0.110   150      NaN
2023-04-30 10000  -0.030   120      NaN
           20000  -0.010  4500 4500.000
           30000   0.030  2100 2100.000
           40000  -0.050   305  305.000
           50000   0.070   140  140.000
>>> merge(data1, data2, how='inner')
                     ret    me  me_nyse
date       permno
2023-04-30 10000  -0.030   120      NaN
           20000  -0.010  4500 4500.000
           30000   0.030  2100 2100.000
           40000  -0.050   305  305.000
           50000   0.070   140  140.000
>>> merge(data1, data2, drop_duplicates='left')
                     ret       me  me_nyse
date       permno
2023-03-31 10000   0.010      NaN      NaN
           20000   0.050      NaN      NaN
           30000   0.020      NaN      NaN
           40000   0.030      NaN      NaN
           50000   0.110      NaN      NaN
2023-04-30 10000  -0.030  120.000      NaN
           20000  -0.010 4500.000 4500.000
           30000   0.030 2100.000 2100.000
           40000  -0.050  305.000  305.000
           50000   0.070  140.000  140.000
pyanomaly.datatools.populate(data, freq, method='ffill', limit=None, new_date_idx=None)

Populate data.

Populate data to freq frequency.

Parameters:
  • data – DataFrame with index = date/id, sorted on id/date.

  • freq – Frequency to populate: ANNUAL, QUARTERLY, MONTHLY, or DAILY.

  • method – Filling method for newly added rows. ‘ffill’: forward fill, None: nan.

  • limit – Maximum number of rows to forward-fill. Default to None (no fill).

  • new_date_idx – Name of the new (populated) date index. If None, use the current date index name. If given, the original date index is kept as a column.

Returns:

Populated data with index = new_date/id.

Examples

>>> data = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03],
...      'me': [100, 5000, 120, 4500],
...      },
...     index=pd.MultiIndex.from_product([pd.to_datetime(['2023-03-31', '2024-03-31']), [10000, 20000]],
...                                      names=['date', 'permno'])
... )
... data
                    ret    me
date       permno
2023-03-31 10000  0.010   100
           20000  0.050  5000
2024-03-31 10000  0.020   120
           20000  0.030  4500

Populate to monthly and forward-fill up to 12 months.

>>> populate(data, MONTHLY, limit=12)
                    ret       me
date       permno
2023-03-31 10000  0.010  100.000
2023-04-30 10000  0.010  100.000
2023-05-31 10000  0.010  100.000
2023-06-30 10000  0.010  100.000
2023-07-31 10000  0.010  100.000
2023-08-31 10000  0.010  100.000
2023-09-30 10000  0.010  100.000
2023-10-31 10000  0.010  100.000
2023-11-30 10000  0.010  100.000
2023-12-31 10000  0.010  100.000
2024-01-31 10000  0.010  100.000
2024-02-29 10000  0.010  100.000
2024-03-31 10000  0.020  120.000
2023-03-31 20000  0.050 5000.000
2023-04-30 20000  0.050 5000.000
2023-05-31 20000  0.050 5000.000
2023-06-30 20000  0.050 5000.000
2023-07-31 20000  0.050 5000.000
2023-08-31 20000  0.050 5000.000
2023-09-30 20000  0.050 5000.000
2023-10-31 20000  0.050 5000.000
2023-11-30 20000  0.050 5000.000
2023-12-31 20000  0.050 5000.000
2024-01-31 20000  0.050 5000.000
2024-02-29 20000  0.050 5000.000
2024-03-31 20000  0.030 4500.000
pyanomaly.datatools.shift(data, n, cols=None, excl_cols=None)

Shift data.

Shift data by n. If cols is given, only cols columns are shifted. If excl_cols is given, columns excluding excl_cols are shifted. Either cols or excl_cols should be None. The shifted data contains both shifted and not-shifted columns. If data has a MultiIndex of date/id, data is shifted within each id.

Parameters:
  • data – Series or DataFrame with index = date or date/id. If index = date/id, data must be sorted on id/date.

  • n – Shift size

  • cols – Columns to shift.

  • excl_cols – Columns to not shift.

Returns:

Series or DataFrame. Shifted data.

pyanomaly.datatools.to_month_end(date)

Shift dates to the last dates of the same month.

Parameters:

date – Datetime Series.

Returns:

Datetime Series shifted to month end.

pyanomaly.datatools.trim(array, limits, ginfo=None, by_array=None)

Trim array.

Trim array values that are outside limits. The returned value is a bool array that indicates which to be trimmed (True to keep and False to remove).

Parameters:
  • array – Nx1 ndarray or Series to be trimmed.

  • limits – A pair of quantiles, e.g., (0.1, 0.1) to trim top and bottom 10%. An element of limits can be set to None for one-sided trim.

  • ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. If given, array is trimmed within each group. See apply_to_groups() for more details.

  • by_array – Array based on which cut points are determined. If None, by_array = array. E.g., array can be set to ME and by_array to NYSE-ME to remove small firms based on NYSE-size cut points.

Returns:

Nx1 bool ndarray. Elements corresponding to trimmed values are set to False.

Examples

>>> array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
... array
[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]

Trim top/bottom 10% values.

>>> trim(array, [0.1, 0.1])
[False,  True,  True,  True,  True,  True,  True,  True,  True, True, False]
>>> data = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... ).sort_index(level=['permno', 'date'])
... data
                     ret    me  me_nyse
date       permno
2023-03-31 10000   0.010   100      NaN
2023-04-30 10000  -0.030   120      NaN
2023-03-31 20000   0.050  5000 5000.000
2023-04-30 20000  -0.010  4500 4500.000
2023-03-31 30000   0.020  2000 2000.000
2023-04-30 30000   0.030  2100 2100.000
2023-03-31 40000   0.030   300  300.000
2023-04-30 40000  -0.050   305  305.000
2023-03-31 50000   0.110   150  150.000
2023-04-30 50000   0.070   140  140.000

Trim smallest 20% cross-sectionally using ‘me_nyse’ cut points.

>>> data['trimmed'] = trim(data['me'], [0.2, None], ginfo='date', by_array=data['me_nyse'])
... data
                     ret    me  me_nyse  trimmed
date       permno
2023-03-31 10000   0.010   100      NaN    False
2023-04-30 10000  -0.030   120      NaN    False
2023-03-31 20000   0.050  5000 5000.000     True
2023-04-30 20000  -0.010  4500 4500.000     True
2023-03-31 30000   0.020  2000 2000.000     True
2023-04-30 30000   0.030  2100 2100.000     True
2023-03-31 40000   0.030   300  300.000     True
2023-04-30 40000  -0.050   305  305.000     True
2023-03-31 50000   0.110   150  150.000    False
2023-04-30 50000   0.070   140  140.000    False
pyanomaly.datatools.winsorize(array, limits, ginfo=None, by_array=None)

Winsorize array.

Winsorize array values that are outside limits.

Parameters:
  • array – Nx1 ndarray or Series to be winsorized.

  • limits – A pair of quantiles, e.g., (0.1, 0.1) to winsorize top and bottom 10%. An element of limits can be set to None for one-sided winsorization.

  • ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. If given, array is winsorized within each group. See apply_to_groups() for more details.

  • by_array – Array based on which cut points are determined. If None, by_array = array. E.g., array can be set to ME and by_array to NYSE-ME to winsorize large firms’ weights based on NYSE-size cut points.

Returns:

Nx1 ndarray of winsorized values.

Examples

>>> array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
... array
[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]

Winsorize top/bottom 10% values.

>>> winsorize(array, [0.1, 0.1])
[ 2,  2,  3,  4,  5,  6,  7,  8,  9, 10, 10]
>>> data = pd.DataFrame(
...     {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07],
...      'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140],
...      'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140],
...      },
...     index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]],
...                                      names=['date', 'permno'])
... ).sort_index(level=['permno', 'date'])
... data
                     ret    me  me_nyse
date       permno
2023-03-31 10000   0.010   100      NaN
2023-04-30 10000  -0.030   120      NaN
2023-03-31 20000   0.050  5000 5000.000
2023-04-30 20000  -0.010  4500 4500.000
2023-03-31 30000   0.020  2000 2000.000
2023-04-30 30000   0.030  2100 2100.000
2023-03-31 40000   0.030   300  300.000
2023-04-30 40000  -0.050   305  305.000
2023-03-31 50000   0.110   150  150.000
2023-04-30 50000   0.070   140  140.000

Winsorize top 10% returns cross-sectionally.

>>> data['ret_winsorized'] = winsorize(data['ret'], [None, 0.1], ginfo='date')
... data[['ret', 'ret_winsorized']].sort_index()
                     ret  ret_winsorized
date       permno
2023-03-31 10000   0.010           0.010
           20000   0.050           0.050
           30000   0.020           0.020
           40000   0.030           0.030
           50000   0.110           0.050
2023-04-30 10000  -0.030          -0.030
           20000  -0.010          -0.010
           30000   0.030           0.030
           40000  -0.050          -0.050
           50000   0.070           0.030

factors

This module defines functions to generate factor portfolios and characteristic portfolios.

prepare_data_for_factors

Prepare data for factor portfolio generation.

make_factor_portfolio

Make a factor portfolio.

make_factor_portfolios

Make factor portfolios.

make_all_factor_portfolios

Make all factor portfolios.

make_char_portfolios

Make characteristic portfolios.

pyanomaly.factors.make_all_factor_portfolios(monthly=True, daily=True, sdate=None)

Make all factor portfolios.

Currently, this function generates the following factors:

  • Fama-French 3 factors: mktrf, smb_ff, hml

  • Fama-French 5 factors: mktrf, smb_ff5, hml, rmw, cma

  • Hou-Xue-Zhang 4 factors: mktrf, smb_hxz, inv, roe

  • Stambaugh-Yuan 4 factors: mktrf, smb_sy, mgmt, perf

The DataFrame of factors is saved to config.factors_monthly(daily)_fname in config.output_dir.

Parameters:
  • monthly – If True, generate monthly factors.

  • daily – If True, generate daily factors.

  • sdate – Start date (‘yyyy-mm-dd’).

pyanomaly.factors.make_char_portfolios(panel, char_list, weighting)

Make characteristic portfolios.

Make characteristic portfolios using the method of JKP.

  1. Split stocks into terciles (1:1:1) based on a firm characteristic. Use only NYSE stocks (excluding bottom 20%) to determine the cut points.

  2. Make a characteristic portfolio: First Quantile - Last Quantile.

Parameters:
  • panelFCPanel that contains firm characteristics in char_list. It should also have ‘exret’, ‘me’, ‘primary’, and ‘exchcd’ columns.

  • char_list – List of firm characteristics to generate.

  • weighting – ‘ew’ (equal-weight), ‘vw’ (value-weight), or ‘vw_cap’ (value-weight capped at 0.8 NYSE-size quantile).

Returns:

Characteristic portfolio DataFrame with index = ‘date’ and columns = [‘group’, ‘char’, ‘ret’, ‘signal’, ‘n_firms’].

  • group: ‘h’, ‘m’, ‘l’, or ‘hml’.

  • char: characteristic name

  • ret: characteristic portfolio return

  • signal: average characteristic value

  • n_firms: number of firms

pyanomaly.factors.make_factor_portfolio(panel, ret_col, char, char_split=(0.3, 0.7), nyse=True, ascending=False, size_class=None, weight_col=None)

Make a factor portfolio.

The procedure is as follows.

  1. Split stocks into terciles based on the values of char column.

    • If nyse = True, cut points are determined by char of NYSE stocks.

    • If ascending is True, the first quantile contains stocks with the lowest char values.

  2. Make the factor portfolio.

    • If size_class is given,

      Factor portfolio (hml) = 1/2(Small High + Big High) − 1/2(Small Low + Big Low)

      Size portfolio (smb) = 1/3(Small High + Small Mid + Small Low) - 1/3(Big High + Big Mid + Big Low)

    • Otherwise,

      Factor portfolio (hml) = High - Low.

    • High (Low) is the first (last) quantile, i.e., the factor portfolio is (high - low) when ascending = False and (low - high) when ascending = True.

    • If weight_col is given, the factor portfolio is a weight_col-weighted portfolio; otherwise, it is an equal-weight portfolio.

Parameters:
  • panelFCPanel that contains data for factor generation. It should have char, ret_col, size_class, and weight_col (optional) columns.

  • ret_col – Return column.

  • char – Firm characteristic column to make a factor portfolio from.

  • char_split – Tuple of splits for terciles. (0.3, 0.7) means 3:4:3 split.

  • nyse – If True, cut points are determined by char of NYSE stocks.

  • ascending – If True, the first quantile contains stocks with the lowest char values.

  • size_class – Size class column. If given, a factor portfolio is constructed in each size group and the factor portfolio is the average of them.

  • weight_col – Weight column. If None, stocks are equally weighted.

Returns:

DataFrame of the factor and its ingredient portfolios. Index = ‘date’.

  • If size_class is given,

    columns: [‘sh’, ‘sm’, ‘sl’, ‘bh’, ‘bm’, ‘bl’, ‘hml’, ‘smb’].

  • Otherwise,

    columns: [‘h’, ‘m’, ‘l’, ‘hml’].

pyanomaly.factors.make_factor_portfolios(panel, factor_groups)

Make factor portfolios.

Generate factor portfolios defined in factor_groups, which is a “factor group” or a list of them.

Factor group

A “factor group” is a dictionary that defines a factor model and has the following structure.

{
  'factor_name': {
      'char'(str): Firm characteristic to use to make the factor
      'ascending'(bool): If True (False), the factor portfolio is low-high (high-low). Default to False
      'char_split'(tuple): How to split stocks into three groups. Default to (0.3, 0.7)
      }
}

Example (Fama-French 5 factors):

ff5 = {
    'smb_ff5': {},
    'hml': dict(char='be_me'),
    'rmw': dict(char='ope_be'),
    'cma': dict(char='at_gr1', ascending=True),
}

Note that

  • Any firm characteristic defined in FUNDA, FUNDQ, CRSPM, CRSPD, and Merge can be used to generate a factor portfolio.

  • Size factor should have an empty dict.

  • Default value items can be omitted.

Procedure

  1. Split stocks into two size groups (50:50) using NYSE-size cut points.

  2. Make market factor portfolio: weighted mean excess returns of all stocks.

  3. Make factor portfolios. See make_factor_portfolio().

  4. Make size factor portfolio: average of the size factor portfolios (smb) from each factor.

Parameters:
  • panelFCPanel that contains data for factor generation, generated by prepare_data_for_factors().

  • factor_groups – (List of) factor group(s).

Returns:

Factor portfolio dataframe with index = ‘date’ and columns = [‘mktrf’, ‘rf’] + factor names in factor_groups.

pyanomaly.factors.prepare_data_for_factors(chars, monthly=True, daily=True, sdate=None)

Prepare data for factor portfolio generation.

Generate firm characteristics needed to make factors. The output data are used as input for make_factor_portfolios().

Parameters:
  • chars – List of firm characteristics to generate.

  • monthly – If True, generate firm characteristics monthly.

  • daily – If True, generate firm characteristics daily.

  • sdate – Start date (‘yyyy-mm-dd’).

Returns:

  • mdata. FCPanel of monthly data. None if monthly = False.

  • ddata. FCPanel of daily data. None if daily = False.

ff

This module defines functions to generate Fama-French factors. This module is for validation only. The Fama-French factors used in this library are generated by factors.make_all_factor_portfolios().

make_ff_factors

Generate Fama-French 3 factors.

make_ff_factors_wrds

Generate Fama-French 3 factors (copy of the WRDS code).

pyanomaly.ff.make_ff_factors()

Generate Fama-French 3 factors.

This function refers to the WRDS code, but the results are slightly different as the code is written under the architecture of PyAnomaly. Compared to the data from the K. French website, HML has a correlation of 0.967, and SMB has a correlation of 0.989. Compared to the WRDS code, HML has a correlation of 0.991, and SMB has a correlation of 0.993.

Major differences from WRDS code:

  1. Primary stock identification: for our method, refer to WRDS.add_gvkey_to_crsp().

  2. Delist return: for our method, refer to WRDS.merge_sf_with_seall().

Returns:

  • Factors. DataFrame with index = ‘date’, columns = [‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’].

  • Number of firms. DataFrame with index = ‘date’, columns = [‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’].

References

WRDS code: https://wrds-www.wharton.upenn.edu/pages/support/applications/python-replications/fama-french-factors-python/

pyanomaly.ff.make_ff_factors_wrds()

Generate Fama-French 3 factors (copy of the WRDS code).

Returns:

  • Factors. DataFrame with index = ‘date’, columns = [‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’].

  • Number of firms. DataFrame with index = ‘date’, columns = [‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’].

References

WRDS code: https://wrds-www.wharton.upenn.edu/pages/support/applications/python-replications/fama-french-factors-python/

fileio

This module defines functions for file IO.

write_to_file

Write data to a file.

read_from_file

Read data from a file.

pyanomaly.fileio.read_from_file(fname, fdir=None, typecast=True)

Read data from a file.

Parameters:
  • fname – File name without extension.

  • fdir – Directory. If None, config.output_dir.

  • typecast – If True, cast float to config.float_type and object to string after reading from a file.

Returns:

DataFrame read from fdir/fname.

pyanomaly.fileio.write_to_file(data, fname, fdir=None, typecast=True)

Write data to a file.

The data is saved to fdir/fname. The file format is determined by config.file_format.

Parameters:
  • data – DataFrame to save.

  • fname – File name without extension.

  • fdir – Directory. If None, config.output_dir.

  • typecast – If True, cast float to config.float_type and object to string before writing to a file.

globals

This module defines global constants. Importing this module will also import frequently used modules: config, log, and util.

Constants

Weight_scheme
  • VW: value-weight

  • EW: equal-weight

Data frequency
  • ANNUAL

  • QUARTERLY

  • MONTHLY

  • DAILY

log

This module defines logging functions.

set_log_path

Set log file path.

log

Write log message.

err

Write error message.

warn

Write warning message.

debug

Write debugging message.

drawline

Draw line in the log file.

start_timer

Start timer.

elapsed_time

Get elapsed time.

pyanomaly.log.debug(msg='')

Write debugging message.

Parameters:

msg – Debugging message.

pyanomaly.log.drawline(level=0, width=80)

Draw line in the log file.

Parameters:
  • level – Shape of line. 0: ‘#’, 1: ‘*’, 2: ‘-’

  • width – Line width.

pyanomaly.log.elapsed_time(msg='', headmsg=None)

Get elapsed time.

The total elapsed time (‘hh:mm:ss’) since the timer started and the elapsed time (in seconds) since the last call of this function are printed. If start_timer() was never called before, this function starts timer.

Parameters:
  • msg – Message to print.

  • headmsg – Message header.

Examples

>>> start_timer('start')
[2024/02/06 01:14] Timer On: start
>>> elapsed_time('check1')
[2024/02/06 01:15] Elapsed [0:00:35.564, 35.564]: check1
>>> elapsed_time('check2', 'parallel1')
[2024/02/06 01:15: parallel1] Elapsed [0:00:53.802, 18.238]: check1
pyanomaly.log.err(msg)

Write error message.

Parameters:

msg – Error message.

pyanomaly.log.log(msg, headmsg=None, header=True)

Write log message.

The format is [yyyy/mm/dd hh:mm: headmsg] msg.

Parameters:
  • msg – Log message.

  • headmsg – Message header.

  • header – If True, write the header.

pyanomaly.log.set_log_path(fpath=None, append=True)

Set log file path.

Parameters:
  • fpath – Log file name or path or __file__. If it’s a name, the path becomes config.log_dir + fpath. If __file__ , the name of the module calling this function is retrieved from __file__ and the path is set to config.log_dir + module name. If None, the path is config.log_dir + ‘log.log’.

  • append – If True, append to the existing log file. Otherwise, delete the current log file.

Examples

>>> set_log_path('./log/example.log')  # Full file path.
>>> set_log_path('example.log')  # file name. Path = config.log_dir + 'example.log'
>>> set_log_path(__file__)  # Make path from a module name. Path = config.log_dir + module name
pyanomaly.log.start_timer(msg='', headmsg=None)

Start timer.

Parameters:
  • msg – Message to print.

  • headmsg – Message header.

pyanomaly.log.warn(msg)

Write warning message.

Parameters:

msg – Warning message.

numba_support

This module defines jitted function.

nansum

Sum excluding nan.

nanmean

Mean excluding nan.

nanvar

Variance excluding nan.

nanstd

Standard deviation excluding nan.

nanskew

Skewness excluding nan.

shift

Shift.

roll_sum

Rolling sum.

roll_mean

Rolling mean.

roll_var

Rolling variance.

roll_std

Rolling standard deviation.

rank

Rank.

set_to_nan

Set rows to nan.

isnan1

Check nan along columns.

add_constant

Add a constant column to a matrix.

bivariate_regression

Bivariate regression.

regression

Multivariate regression.

rolling_regression

Rolling regression.

pyanomaly.numba_support.add_constant(x)

Add a constant column to a matrix.

A vector of 1’s is prepended to x.

Parameters:

x – N x K ndarray.

Returns:

N x (K+1) ndarray (x with a vector of 1’s prepended).

Examples

>>> x = np.array([[1, 2], [3, 4]])
... x
[[1 2]
 [3 4]]
>>> add_constant(x)
[[1 1 2]
 [1 3 4]]
pyanomaly.numba_support.bivariate_regression(y, x)

Bivariate regression.

A constant is added internally.

Parameters:
  • y – Nx1 ndarray. Dependent variable.

  • x – Nx1 ndarray. Independent variable.

Returns:

  • Coefficients. 2x1 ndarray of [constant, beta].

  • R-squared.

  • Residuals. Nx1 ndarray.

pyanomaly.numba_support.isnan1(x)

Check nan along columns.

Parameters:

x – NxK ndarray.

Returns:

Nx1 bool ndarray. True if any value in the corresponding row of x is nan, False otherwise.

Examples

>>> x = np.array([[np.nan, 1, 2], [1, 2, 3]])
... x
[[nan  1.  2.]
 [ 1.  2.  3.]]
>>> isnan1(x)
[ True, False]
pyanomaly.numba_support.nanmean(x)

Mean excluding nan.

Parameters:

x – 1D ndarray.

Returns:

Mean of x excluding nan.

pyanomaly.numba_support.nanskew(x, dof=1)

Skewness excluding nan.

Parameters:
  • x – 1D ndarray.

  • dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.

Returns:

Skewness of x excluding nan.

pyanomaly.numba_support.nanstd(x, dof=1)

Standard deviation excluding nan.

Parameters:
  • x – 1D ndarray.

  • dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.

Returns:

Standard deviation of x excluding nan.

pyanomaly.numba_support.nansum(x)

Sum excluding nan.

Parameters:

x – 1D ndarray.

Returns:

Sum of x excluding nan.

pyanomaly.numba_support.nanvar(x, dof=1)

Variance excluding nan.

Parameters:
  • x – 1D ndarray.

  • dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.

Returns:

Variance of x excluding nan.

pyanomaly.numba_support.rank(x, ascending=True, pct=False)

Rank.

Calculate ranks of the elements of x within each column. Nan values are excluded.

Parameters:
  • x – Ndarray.

  • ascending – If True, rank increases with value starting from 1.

  • pct – If True, percentile ranks are returned.

Returns:

(Percentile) ranks of x. Ndarray of the same size as x.

Examples

>>> x = np.array([[1, 2, 0, np.nan, 3, -1], [1, 2, 3, 4, 5, 6]]).T
... x
array([[ 1.,  1.],
       [ 2.,  2.],
       [ 0.,  3.],
       [nan,  4.],
       [ 3.,  5.],
       [-1.,  6.]])
>>> rank(x)
array([[ 3.,  1.],
       [ 4.,  2.],
       [ 2.,  3.],
       [nan,  4.],
       [ 5.,  5.],
       [ 1.,  6.]])
>>> rank(x, pct=True)
array([[0.6       , 0.16666667],
       [0.8       , 0.33333333],
       [0.4       , 0.5       ],
       [       nan, 0.66666667],
       [1.        , 0.83333333],
       [0.2       , 1.        ]])
pyanomaly.numba_support.regression(y, X)

Multivariate regression.

Parameters:
  • y – Nx1 ndarray. Dependent variable.

  • X – NxK ndarray. Independent variables (including constant).

Returns:

  • Coefficients. Kx1 ndarray.

  • R-squared.

  • Residuals. Nx1 ndarray.

pyanomaly.numba_support.roll_mean(x, n, min_n=-1)

Rolling mean.

The x is rolled along the first axis with the window size n, and the mean of each window is calculated. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.

Parameters:
  • x – Ndarray.

  • n – Window size.

  • min_n – Minimum number of observations. Default to n.

Returns:

Rolling mean. Ndarray of the same size as x.

pyanomaly.numba_support.roll_std(x, n, min_n=-1, dof=1)

Rolling standard deviation.

The x is rolled along the first axis with the window size n, and the standard deviation of each window is calculated. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.

Parameters:
  • x – Ndarray.

  • n – Window size.

  • min_n – Minimum number of observations. Default to n.

  • dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.

Returns:

Rolling standard deviation. Ndarray of the same size as x.

pyanomaly.numba_support.roll_sum(x, n, min_n=-1)

Rolling sum.

The x is rolled along the first axis with the window size n, and the sum of each window is calculated. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.

Parameters:
  • x – Ndarray.

  • n – Window size.

  • min_n – Minimum number of observations. Default to n.

Returns:

Rolling sum. Ndarray of the same size as x.

Examples

>>> x = np.array([[1, 2, 0, np.nan, 3, -1], [1, 2, 3, 4, 5, 6]]).T
... x
array([[ 1.,  1.],
       [ 2.,  2.],
       [ 0.,  3.],
       [nan,  4.],
       [ 3.,  5.],
       [-1.,  6.]])
>>> roll_sum(x, 3)
array([[nan, nan],
       [nan, nan],
       [ 3.,  6.],
       [nan,  9.],
       [nan, 12.],
       [nan, 15.]])
>>> roll_sum(x, 3, 2)
array([[nan, nan],
       [ 3.,  3.],
       [ 3.,  6.],
       [ 2.,  9.],
       [ 3., 12.],
       [ 2., 15.]])
pyanomaly.numba_support.roll_var(x, n, min_n=-1, dof=1)

Rolling variance.

The x is rolled along the first axis with the window size n, and the variance of each window is calculated. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.

Parameters:
  • x – Ndarray.

  • n – Window size.

  • min_n – Minimum number of observations. Default to n.

  • dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.

Returns:

Rolling variance. Ndarray of the same size as x.

pyanomaly.numba_support.rolling_regression(data, n, min_n=-1, add_const=True)

Rolling regression.

The data is rolled along the first axis with the window size n, and the regression is conducted for each window. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.

Parameters:
  • data – NxK ndarray. The first column is the dependent variable and the rest are the independent variables.

  • n – Window size.

  • min_n – Minimum number of observations. Default to n.

  • add_const – If True, add a constant to the independent variables.

Returns:

Nx(K’+2) ndarray. K’ = K if add_const is True, else K’ = K-1.

  • First K’ columns: Coefficients.

  • K’+1-th column: R-squared.

  • K’+2-th column: Idiosyncratic volatility (standard deviation of residuals).

pyanomaly.numba_support.set_to_nan(x, n)

Set rows to nan.

Set the first n (last -n if n < 0) rows of x to nan. The x is changed in-place.

Parameters:
  • x – Ndarray.

  • n – Number of rows to set to nan.

pyanomaly.numba_support.shift(x, n)

Shift.

x is shifted by n along the first axis. A negative n shifts x backward.

Parameters:
  • x – Ndarray.

  • n – Shift size.

Returns:

Shifted x. Ndarray of the same size as x.

panel

This module defines classes for panel data analysis.

Panel

Base class for panel data analysis.

FCPanel

Base class for firm characteristic generation.

class pyanomaly.panel.FCPanel(alias=None, data=None, freq=12, base_freq=None)

Bases: Panel

Base class for firm characteristic generation.

For each firm characteristic, there should be one method to generate it and the method name should start with ‘c_’. Generated firm characteristics are added to the data attribute with column names equal to their method names (without ‘c_’).

Parameters:
  • alias (str, list, or dict) –

    Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.

    • str: A column name in the mapping file (config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.

    • list: List of firm characteristics (method names).

    • dict: Dict of firm characteristics and their aliases of the form {method name: alias}.

    If aliases are not specified, method names are used as aliases.

  • data – Raw data DataFrame with index = date/id, sorted on id/date.

  • freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.

  • base_freq – Frequency of data values. ANNUAL, QUARTERLY, MONTHLY, or DAILY. For example, if funda is populated monthly, freq = MONTHLY and base_freq = ANNUAL. If None, base_freq = freq.

Note

Firm characteristics are stored in the data attribute with column names equal to their method names (without ‘c_’). When a FCPanel is saved to a file using save(), the firm characteristic columns are renamed by the aliases , and when a saved FCPanel is loaded using load(), the firm characteristic columns are renamed by the method names.

Attributes

data

DataFrame with index = date/id, sorted on id/date that stores a panel data.

freq

Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.

base_freq

Frequency of data values. ANNUAL, QUARTERLY, MONTHLY, or DAILY.

char_map

Dictionary to map firm characteristic generation methods with aliases: char_map['method'] returns the alias.

Methods

show_available_chars

Display firm characteristics available in this class.

get_available_chars

Get all firm characteristics available in this class.

get_char_list

Get the firm characteristics to generate.

create_chars

Generate firm characteristics.

created

Check if a firm characteristic has already been created.

prepare

Prepare characteristics that are required to generate a characteristic.

rename_chars

Rename the firm characteristic columns.

remove_rawdata

Remove raw data.

save

Save this object to a file.

load

Load a FCPanel object from a file.

create_chars(char_list=None)

Generate firm characteristics.

Generated firm characteristics are added to data using their method names as the column names.

Parameters:

char_list – List or dict of firm characteristics (method names) to generate. If dict, keys should be method names and values aliases. If None, all firm characteristics available in this class and specified by the alias argument are generated.

created(char)

Check if a firm characteristic has already been created.

Parameters:

char – Firm characteristic (the method name without ‘c_’) to check.

Returns:

True if the firm characteristic has been created.

get_available_chars()

Get all firm characteristics available in this class.

Returns:

List of characteristics (method names)

get_char_list()

Get the firm characteristics to generate.

Among all the firm characteristics to generate, the firm characteristics defined in this class are returned.

Returns:

A dictionary of firm characteristics with keys equal to method names and values equal to aliases.

load(fname=None, fdir=None)

Load a FCPanel object from a file.

Firm characteristic columns are renamed by method names after loading.

Parameters:
  • fname – File name without extension. If None, fname = lower-case class name .

  • fdir – Directory to read the file from. If None, fdir = config.output_dir.

prepare(char_list)

Prepare characteristics that are required to generate a characteristic.

If a characteristic is defined as a function of other characteristics, this function will check whether they already exist and generate them if they don’t.

Parameters:

char_list – list of firm characteristics (method names without ‘c_’) to prepare.

remove_rawdata(excl_columns=None)

Remove raw data.

Raw data columns of data except excl_columns are deleted. If raw data are not needed after generating firm characteristics, calling this method can reduce memory and hard disc usage.

Parameters:

excl_columns – List of columns to exclude.

rename_chars(to_alias=True)

Rename the firm characteristic columns.

Parameters:

to_alias – If True, rename firm characteristic columns from method names to aliases, and vice versa.

save(fname=None, fdir=None, other_columns='all')

Save this object to a file.

The data attribute is saved to a config.file_format file and the other attributes are saved to a json file. Firm characteristic columns are renamed by aliases before saving.

Parameters:
  • fname – File name without extension. If None, fname = lower-case class name.

  • fdir – Directory to save the file. If None, fdir = config.output_dir.

  • other_columns – List of columns other than firm characteristic columns to save. If None, only firm characteristics are saved; if ‘all’, all columns are saved; if list, firm characteristic columns plus other_columns are saved.

show_available_chars(all=False)

Display firm characteristics available in this class.

Parameters:

all – If True, display all firm characteristics available in this class, otherwise, display only the firm characteristics to generate.

class pyanomaly.panel.Panel(data=None, freq=12, base_freq=None)

Bases: object

Base class for panel data analysis.

This class stores a panel data, data, and offers various tools to handle and analyze it. The data should be a Pandas DataFrame with a MultiIndex = date/id, i.e., the first index should be a time-series identifier (Pandas datetime type) and the second index a cross-section identifier. It should be sorted on id/date.

Parameters:
  • data – Panel data DataFrame with index = date/id, sorted on id/date.

  • freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.

  • base_freq – Frequency of data values. For example, if data has annual values populated monthly, freq = MONTHLY and base_freq = ANNUAL. If None, base_freq = freq.

Attributes

data

DataFrame with index = date/id, sorted on id/date that stores a panel data. The items of data can be accessed by Panel.__get_item__(): For a Panel instance panel, panel[cols] and panel[rows, cols] are equivalent to panel.data[col] and panel.data.loc[rows, cols], respectively.

freq

Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.

base_freq

Frequency of data values. ANNUAL, QUARTERLY, MONTHLY, or DAILY.

Methods

is_sorted

Check if a panel data is sorted on id/date.

id_idx

Get id index name.

date_idx

Get date index name.

id_values

Get id index values.

date_values

Get date index values.

get_id_group

Get id group.

get_date_group

Get date group.

get_id_group_size

Get id group sizes.

get_date_group_size

Get date group sizes.

get_id_group_index

Get id group indices.

get_date_group_index

Get date group indices.

apply_to_ids

Apply a function to each id group.

apply_to_dates

Apply a function to each date group.

populate

Populate data.

merge

Merge with another data.

inspect_data

Inspect data.

filter

Filter data.

get_row_count

Get row count per id.

remove_rows

Remove rows.

shift

Shift data.

diff

Get the difference of data.

pct_change

Get the percentage change of data.

cumret

Cumulative returns.

futret

Get future returns.

rolling

Apply a function to a rolling window.

rolling_regression

Conduct rolling regression.

copy

Make a copy of this object and return it.

copy_from

Copy from another Panel object.

save

Save this object to a file.

load

Load a Panel object from a file.

clean_memory

Clean memory.

apply_to_dates(data, function, n_ret, *args, data2=None, looper=None)

Apply a function to each date group.

This method groups data by date and applies function to each date group.

Parameters:
  • data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the data attribute.

  • function – Function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).

  • n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data. If looper is apply_to_groups, this argument is ignored.

  • *args – Additional arguments of function.

  • data2 – DataFrame, Series, ndarray, str, or int. Optional argument when function requires two sets of input data.

  • looper – Looping function: apply_to_groups(), apply_to_groups_jit(), or apply_to_groups_reduce_jit(). If looper is None, apply_to_groups_jit is used if function is jitted, otherwise, apply_to_groups is used.

Returns:

Concatenated value of the outputs of function.

apply_to_ids(data, function, n_ret, *args, data2=None, looper=None)

Apply a function to each id group.

This method groups data by id and applies function to each id group.

Parameters:
  • data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the data attribute.

  • function – Function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).

  • n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data. If looper is apply_to_groups, this argument is ignored.

  • *args – Additional arguments of function.

  • data2 – DataFrame, Series, ndarray, str, or int. Optional argument when function requires two sets of input data.

  • looper – Looping function: apply_to_groups(), apply_to_groups_jit(), or apply_to_groups_reduce_jit(). If looper is None, apply_to_groups_jit is used if function is jitted, otherwise, apply_to_groups is used.

Returns:

Concatenated value of the outputs of function.

clean_memory()

Clean memory.

Deleting rows/columns of the data attribute may not release memory immediately. If a Panel object consumes unusually large memory, call this function to release memory.

copy(columns=None, deep=False)

Make a copy of this object and return it.

Parameters:
  • columns – Columns to copy. If None, copy all columns.

  • deep – If True, deep copy.

Returns:

Copy of this object.

copy_from(panel, columns=None, deep=False)

Copy from another Panel object.

Parameters:
  • panel – A Panel object to be copied.

  • columns – Columns of panel to copy. If None, copy all columns.

  • deep – If True, deep copy.

cumret(ret, period=1, lag=0)

Cumulative returns.

Compute cumulative returns between t-period and t-lag. If ret is monthly returns, 12-month momentum can be obtained by setting period = 12 and lag = 1. A negative period will generate future returns: e.g., period = -1 and lag = 0 for one-period ahead return; period = -3 and lag = -1 for two-period ahead return starting from t+1.

Parameters:
  • ret – Series of returns or a return column name. If ret is a Series, it should have the same length and order as the data attribute.

  • period – Target horizon (in base frequency). (+) for past returns and (-) for future returns.

  • lag – Period (in base frequency) to calculate returns from.

Returns:

Series of cumulative returns.

date_idx()

Get date index name.

Returns:

Date index name.

date_values()

Get date index values.

Returns:

Date Index.

diff(data, n=1)

Get the difference of data.

Calculate difference within each id accounting for data frequency. The period between two data points is determined by freq and base_freq. See shift() for details.

Parameters:
  • data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the data attribute. If None, data = data.

  • n – Shift size in the base frequency.

Returns:

Series or DataFrame of differenced data.

filter(filters, keep_row=False)

Filter data.

Filter the panel data using filters. A filter is a tuple of three elements:

  • filter[0]: column to apply the filter to.

  • filter[1]: filter condition: ‘==’, ‘!=’, ‘>’, ‘<’, ‘>=’, ‘<=’, ‘in’, or ‘not in’.

  • filter[2]: rhs value.

If filters is a list of filters, they are applied sequentially.

Parameters:
  • filters – A filter or list of filters.

  • keep_row – Whether to keep or remove the filtered out rows. If True, the values of the filtered rows are set to nan.

Examples

To remove rows, where the value of column ‘x’ is less than 10,

>>> panel.filter(('x', '>=', 10))

This is equivalent to

>>> panel.data = panel.data[panel.data['x'] >= 10]
futret(ret, period=1)

Get future returns.

Parameters:
  • ret – Series of returns or a return column name. If ret is a Series, it should have the same length and order as the data attribute.

  • period – Target horizon (in base frequency).

Returns:

Series of future returns.

get_date_group()

Get date group.

Same as Panel.data.groupby(level=0).

Returns:

Pandas GroupBy object.

get_date_group_index()

Get date group indices.

Returns:

List of date group indices.

get_date_group_size()

Get date group sizes.

Returns:

Ndarray of date group sizes.

get_id_group()

Get id group.

Same as Panel.data.groupby(level=1).

Returns:

Pandas GroupBy object.

get_id_group_index()

Get id group indices.

Returns:

List of id group indices.

get_id_group_size()

Get id group sizes.

Returns:

Ndarray of id group sizes.

get_row_count(ascending=True)

Get row count per id.

The row count is a sequential number starting from 0 for each id. If ascending is True (False), its value is 0 on the first (last) date. It can be used, e.g., to remove first (last) n rows of data.

Parameters:

ascending – If True, the row count increases with date.

Returns:

Ndarray of row counts.

id_idx()

Get id index name.

Returns:

Id index name.

id_values()

Get id index values.

Returns:

Id Index.

inspect_data(columns=None, option=['summary'])

Inspect data.

See datatools.inspect_data().

Parameters:
  • columns – List of columns to inspect.

  • option – List of items to display.

is_sorted(data=None)

Check if a panel data is sorted on id/date.

Parameters:

data – Panel DataFrame with index = date/id. If None, data = data.

Returns:

Bool. True if data is sorted on id/date.

load(fname=None, fdir=None)

Load a Panel object from a file.

Parameters:
  • fname – File name without extension. If None, fname = lower-case class name.

  • fdir – Directory to load the file from. If None, fdir = config.output_dir.

merge(right, on=None, right_on=None, how='left', drop_duplicates='right', suffixes=None, method=None)

Merge with another data.

Merge the data attribute with right.

Parameters:
  • right – Panel, Series, or DataFrame to merge with.

  • on – (List of) column(s) to merge on. If None, merge on index.

  • right_on – (List of) column(s) of right to merge on. If None, right_on = on.

  • how – Merge method: ‘inner’, ‘outer’, ‘left’, or ‘right’.

  • drop_duplicates – how to handle duplicate columns. ‘left’: keep right, ‘right’: keep left, None: keep both. If None, suffixes should be provided.

  • suffixes – A tuple of suffixes for duplicate columns, e.g., suffixes=(‘_x’, ‘_y’) will add ‘_x’ and ‘_y’ to the left and right duplicate columns, respectively.

  • method – None or ‘pandas’. None uses an internal merge algorithm for left-merge; ‘pandas’ uses pd.merge() internally. If how is not ‘left’, this option is ignored and pd.merge() is always used.

pct_change(data, n=1, allow_negative_denom=False)

Get the percentage change of data.

Calculate percentage change within each id accounting for data frequency. The period between two data points is determined by freq and base_freq. See shift() for details.

Parameters:
  • data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the data attribute. If None, data = data.

  • n – Shift size in the base frequency.

  • allow_negative_denom – If False, set the output to nan when the denominator is not positive.

Returns:

Series or DataFrame of percentage changes.

populate(freq=None, method='ffill', limit=None, lag=0, new_date_col=None)

Populate data.

Populate data to freq frequency and shift the populated data by lag period(s).

Parameters:
  • freq – Frequency to populate: ANNUAL, QUARTERLY, MONTHLY, or DAILY. If None, freq is set to the data frequency (freq), and missing dates are added.

  • method – Filling method for newly added rows. ‘ffill’: forward fill, None: nan.

  • limit – Maximum number of rows to forward-fill.

  • lag – Minimum periods between new date and original date. If freq = MONTHLY, lag = 4 shifts data by 4 months, which means data are available at least 4 months later.

  • new_date_idx – Name of the new (populated) date index. If None, use the current date index name. If given, the original date index is kept as a column.

remove_rows(data, n=1)

Remove rows.

Remove (set to nan) the first n periods of data per id. If n < 0, the last -n periods are removed. Data frequency is accounted for: e.g., if freq = MONTHLY and base_freq = ANNUAL, n = 1 removes the first 12 rows per id.

Parameters:
  • data – DataFrame, Series, or ndarray. It should have the same length and order as the data attribute.

  • n – Number of rows (in base frequency) to remove (set to nan).

Returns:

The data with removed rows set to nan.

rolling(data, n, function, min_n=None, lag=0)

Apply a function to a rolling window.

For each id, apply function to rolling windows of size n. The rolling window is determined by freq and base_freq. For instance, if freq = MONTHLY and base_freq = ANNUAL, n = 3 means three-year rolling window and the first window consists of the 1st, 13th, and 25th rows.

Parameters:
  • data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the data attribute. If None, data = data.

  • function – Function to apply: ‘sum’, ‘mean’, ‘std’, or ‘var’.

  • n – Window size in the base frequency.

  • min_n – Minimum number of observations in a window. If observations < min_n, result is nan. Default to n.

  • lag – Lag size in the base frequency. The data is shifted by lag before rolling.

Returns:

Series or DataFrame. Rolling calculation result.

Note

For other user-defined functions, use apply_to_ids().

Examples

Suppose funda is a Panel object with freq = MONTHLY and base_freq = ANNUAL (annual data populated monthly), and funda.data has a return column, ‘ret’. The past three-year average return starting from one year ago can be calculated (with a condition of at least two non-missing values within the sample window) and saved as ‘avg_ret3y’ as follows:

>>> funda['avg_ret3y'] = funda.rolling('ret', 'mean', 3, 2, 1)
rolling_regression(data, n, min_n=None, add_const=True)

Conduct rolling regression.

Run rolling OLS for each id accounting for the data frequency.

Parameters:
  • data – DataFrame, ndarray, or (list of) columns. It should have the same length and order as the data attribute. The first column is used as the dependent variable and the rest as the independent variables.

  • n – Window size in the base frequency.

  • min_n – Minimum number of observations in a window. If observations < min_n, result is nan. Default to n.

  • add_const – If True, add a constant to the independent variables.

Returns:

Nx(K+2) ndarray, where N is the length of data and K is the number of independent variables.

  • First K columns: Coefficients.

  • K+1-th column: R-squared.

  • K+2-th column: Idiosyncratic volatility (standard deviation of residuals).

save(fname=None, fdir=None, columns=None)

Save this object to a file.

The data attribute is saved to a config.file_format file and the other attributes are saved to a json file.

Parameters:
  • fname – File name without extension. If None, fname = lower-case class name.

  • fdir – Directory to save the file. If None, fdir = config.output_dir.

  • columns – List of columns to save. If None, the entire dataset is saved.

shift(data=None, n=1)

Shift data.

Shift data within each id accounting for data frequency. The shift period is determined by freq and base_freq. E.g., if freq = MONTHLY and base_freq = ANNUAL, n = 1 means 1-year shift, and data is shifted by 12 (12 months). That is, if base_freq = ANNUAL (QUARTERLY), shift(data, n) will always shift data by n years (quarters) regardless of the data frequency.

Parameters:
  • data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the data attribute. If None, data = data.

  • n – Shift size in the base frequency.

Returns:

Series or DataFrame of shifted data.

portfolio

This module defines classes for portfolio analysis.

Portfolio

Portfolio class.

Portfolios

Class for a group of portfolios.

class pyanomaly.portfolio.Portfolio(name=None, position=None, rf=None, pfval0=1, costfcn=None, keep_position=False)

Portfolio class.

This class makes a portfolio from positions and evaluate it.

Position information is saved in position and portfolio information is saved in value. If positive weights of the positions on a date don’t add up to 1, (1 - sum(positive weights)) will be assumed to be invested in a risk-free asset, and its information is saved in fposition. The transaction cost is assumed to be 0 for risk-free assets. Once the portfolio is evaluated by calling eval(), performance attribute is generated.

Parameters:
  • name – Portfolio name.

  • position – Position DataFrame. It should have index = ‘date’ and columns = ‘id’ (security ID), ‘ret’ (return), and ‘wgt’ (portfolio weight). If it has other columns such as price, they will be kept in position. If it has ‘rf’ (risk-free rate) column, its values are used as risk-free rates.

  • rf – Risk-free rate DataFrame with index = ‘date’ and columns = [‘rf’]. The rf has priority over ‘rf’ column in position. If rf = None and position does not have ‘rf’ column, the risk-free rate is assumed to be 0.

  • pfval0 – Initial portfolio value. Default to 1.

  • costfcnTransactionCost class, a transaction cost function, or value. See costfcn.

  • keep_position – If False, position information (position) is deleted after the portfolio is constructed.

Attributes

name

Portfolio name.

position

Position DataFrame. Its index is ‘date’ and has the following columns:

Columns from the input

Column

Description

id

Security ID.

ret

Return between t-1 and t.

wgt

Weight at the beginning of t (at t-1 after rebalancing).

Other columns

Any other columns that are in the input position data.

Columns generated internally.

Column

Description

exret

Excess return over risk-free rate between t-1 and t.

val1

Value at t.

val

Value at the beginning of t.

val0

Value at t-1 (before rebalancing). val1 at t-1 = val0 at t.

cost

Transaction cost incurred at the beginning of t.

value

Portfolio value DataFrame. Its index is ‘date’ and has the following columns:

Column

Description

ret

Return between t-1 and t. This can be either net (excess) return or gross (excess) return depending on the return type. See eval(). ret = netexret by default.

val1

Value at t.

val

Value at the beginning of t.

cost

Transaction cost incurred at the beginning of t.

tover

Turnover incurred at the beginning of t.

netret

Return between t-1 and t, net of transaction cost.

grossret

Gross return between t-1 and t.

netexret

Excess return between t-1 and t, net of transaction cost.

grossexret

Gross excess return between t-1 and t.

lposition

Number of long positions.

sposition

Number of short positions.

  • tover = sum(abs(position.val - position.val0)) / value.val

The following columns are added to value once the portfolio is evaluated by calling eval().

Column

Description

cumret

Cumulative return since the first date.

drawdown

Drawdown.

drawdur

Duration of drawdown in the frequency of data, e.g., if monthly, 3 means 3 months.

drawstart

Beginning date of drawdown.

succdown

Successive down; down without any up during a period.

succdur

Duration of successive down.

succstart

Beginning date of successive down.

fposition

Risk-free asset position DataFrame. Its index is ‘date’ and has the following columns:

Column

Description

ret

Return between t-1 and t.

wgt

Weight at the beginning of t.

val1

Value at t.

val

Value at the beginning of t.

performance

Portfolio performance DataFrame. Its column is equal to the portfolio name and has the following indexes:

Index

Description

mean

Mean return.

std

Standard deviation.

sharpe

Sharpe ratio.

cum

Cumulative return.

mdd

Maximum drawdown.

mdd start

Maximum drawdown start date.

mdd end

Maximum drawdown end date.

msd

Maximum successive down.

msd start

Maximum successive down start date.

msd end

Maximum successive down end date.

turnover

Average turnover.

lposotion

Average number of long positions.

sposition

Average number of short positions.

costfcn

TransactionCost object, a transaction cost function, or a value. For example, if a constant transaction cost of 20 basis points is assumed, costfcn can be set to 0.002.

If costfcn is a TransactionCost object, TransactionCost.get_cost() is called to get transaction costs. TransactionCost allows transaction costs varying across time and securities.

When costfcn is defined as a function, it should have arguments val (value after rebalancing) and val0 (value before rebalancing) and return the transaction cost. For example, if the transaction cost to buy (sell) is 20 (30) bps, the function can be defined as follows:

def cost_fcn(val, val0):
    dval = np.abs(val - val0)
    return 0.002 * dval if val > val0 else 0.003 * dval

Note

If a position exists at t-1 but not at t, it will be added at t with 0 weight. This is to compute the transaction cost.

The ‘val’, ‘val1’, and ‘val0’ in position and value do not take transaction costs into account. For the value changes net of transaction costs, use the cumulative return.

Methods

set_position

Set positions.

from_portfolo_return

Make a portfolio given portfolio returns.

copy

Copy this object.

set_return_type

Set the return to use for portfolio evaluation.

returns

Get returns.

cum_returns

Get cumulative returns.

mean_return

Get mean return.

std_return

Get standard deviation.

cum_return

Get cumulative return.

sharpe_ratio

Get Sharpe ratio.

succdown

Get successive downs.

max_succdown

Get maximum successive down.

drawdown

Get drawdowns.

max_drawdown

Get maximum drawdown.

eval

Evaluate the portfolio.

eval_series

Evaluate the portfolio repeatedly over a period.

copy(sdate=None, edate=None)

Copy this object.

Copy this object for the given period.

Parameters:
  • sdate – Start date (‘yyyy-mm-dd’).

  • edate – End date (‘yyyy-mm-dd’).

Returns:

Portfolio object.

cum_return(sdate=None, edate=None, logscale=True)

Get cumulative return.

Parameters:
  • sdate – Start date.

  • edate – End date.

  • logscale – If True, return log-scale cumulative return.

Returns:

Cumulative return over the period.

cum_returns(sdate=None, edate=None, logscale=True, zero_start=False)

Get cumulative returns.

Both sdate and edate are inclusive, i.e., the first cumulative return is the return over sdate-1 and sdate and the last cumulative return is the return over sdate-1 and edate.

Parameters:
  • sdate – Start date.

  • edate – End date.

  • logscale – If True, return log-scale cumulative returns.

  • zero_start – If True, a cumulative return of 0 is prepended with date = sdate - 1 day. This is useful when plotting several cumulative returns as all curves will start at 0.

Returns:

Cumulative return Series with index = ‘date’.

drawdown(sdate=None, edate=None)

Get drawdowns.

Parameters:
  • sdate – Start date.

  • edate – End date.

Returns:

Drawdowns. DataFrame with index = ‘date’ and columns = [‘value’, ‘duration’, ‘start’].

eval(sdate=None, edate=None, logscale=True, annualize_factor=1, return_type='net', percentage=False)

Evaluate the portfolio.

This method evaluates the portfolio over a period and create performance. It also adds performance-related columns to value.

Parameters:
  • sdate – Start date.

  • edate – End date.

  • logscale – If True, cumulative returns are in log-scale.

  • annualize_factor – ‘mean’, ‘std’, ‘sharpe’, and ‘turnover’ are annualized by this factor, e.g., ‘mean’ is multiplied by annualize_factor and ‘std’ by its square-root. If the data is monthly, the results can be annualized by setting annualize_factor = 12. Default to 1.

  • return_type – Which return to use for evaluation. See set_return_type() for available options. Default to ‘net’.

  • percentage – If True, ‘mean’, ‘std’, ‘cum’, ‘mdd’, ‘msd’, and ‘turnover’ are multiplied by 100. Default to False.

Returns:

performance, value.

eval_series(sdate=None, edate=None, logscale=True, annualize_factor=1, return_type='net', percentage=False)

Evaluate the portfolio repeatedly over a period.

Evaluate the portfolio repeatedly for the periods [sdate-1, sdate], [sdate-1, sdate+1], …, [sdate-1, edate]. For the description of the arguments, see eval().

Returns:

Performance for each period. DataFrame with index values equal to the period end dates and columns equal to the indexes of performance, i.e., a row with index t contains the performance up to t.

static from_portfolo_return(pfret, val=1)

Make a portfolio given portfolio returns.

If portfolio returns are already known, this method can be used to construct a Portfoliio object without position information and evaluate the portfolio.

Parameters:

pfret – Portfolio returns. DataFrame with index = ‘date’ and columns = ‘ret’.

Returns:

Portfolio object.

max_drawdown(sdate=None, edate=None)

Get maximum drawdown.

Parameters:
  • sdate – Start date.

  • edate – End date.

Returns:

Maximum drawdown. Series with index = [‘value’, ‘duration’, ‘start’].

max_succdown(sdate=None, edate=None)

Get maximum successive down.

Parameters:
  • sdate – Start date.

  • edate – End date.

Returns:

Maximum successive down. Series with index = [‘value’, ‘duration’, ‘start’].

mean_return(sdate=None, edate=None)

Get mean return.

Parameters:
  • sdate – Start date.

  • edate – End date.

Returns:

Mean return over the period.

returns(sdate=None, edate=None)

Get returns.

Both sdate and edate are inclusive, i.e., the first return is the return over sdate-1 and sdate.

Parameters:
  • sdate – Start date.

  • edate – End date.

Returns:

Return Series with index = ‘date’.

set_position(position, rf=None, pfval0=1, keep_position=False)

Set positions.

This method sets position from position. Any existing positions will be deleted. For the details of the input arguments, See the class parameters.

set_return_type(return_type='net')

Set the return to use for portfolio evaluation.

The ‘ret’ and ‘exret’ of value are set according to return_type and used to compute portfolio evaluation metrics. Mean, std, and Sharpe ratio use value.exret and cumulative return, mdd, and msd use value.ret.

Parameters:

return_type – Return to use. ‘net’, ‘gross’, ‘netexret’, ‘grossexret’, ‘netret’, or ‘grossret’. value.ret and value.exret are set as follows.

Return_type

Ret

Exret

‘net’

net return

net excess return

‘gross’

gross return

gross excess return

‘netexret’

net excess return

net excess return

‘grossexret’

gross excess return

gross excess return

‘netret’

net return

net return

‘grossret’

gross return

gross return

sharpe_ratio(sdate=None, edate=None)

Get Sharpe ratio.

Parameters:
  • sdate – Start date.

  • edate – End date.

Returns:

Sharpe ratio over the period.

std_return(sdate=None, edate=None)

Get standard deviation.

Parameters:
  • sdate – Start date.

  • edate – End date.

Returns:

Standard deviation of the returns over the period.

succdown(sdate=None, edate=None)

Get successive downs.

Parameters:
  • sdate – Start date.

  • edate – End date.

Returns:

Successive downs. DataFrame with index = ‘date’ and columns = [‘value’, ‘duration’, ‘start’].

class pyanomaly.portfolio.Portfolios(portfolios=None)

Class for a group of portfolios.

This class can have several portfolios as its members and evaluate them together. This class facilitates portfolio comparison by evaluating them and saving the results in a single DataFrame.

Parameters:

portfolios – List or dict of Portfolio objects to add. If it is a dict, its keys are used as portfolio names. Portfolios can also be added later using add() or set().

Attributes

members

Dict of portfolio members. A Portfolio object is added to members with its name as the key. A member portfolio can be accessed using __getitem__():

>>> pf1 = Portfolio('pf1')
>>> portfolios = Portfolios()
>>> portfolios.add(pf1)
>>> pf1 = portfolios['pf1']  # This and the next line are equivalent.
>>> pf1 = portfolios.members['pf1']
value

Portfolios’ values. This is a concatenated (outer-joined) DataFrame of the value attributes of the members. Its index is ‘date’ and columns are two-level: the first level is the same as the columns of Portfolio.value and the second level is the portfolio names.

performance

Portfolios’ performances. This is a concatenated DataFrame of the performance attributes of the members. Its index is the same as the index of Portfolio.performance and columns are portfolio names.

Methods

add

Add a portfolio.

set

Set portfolios.

eval

Evaluate the portfolios.

add(portfolio, alias=None)

Add a portfolio.

Add portfolio to members.

Parameters:
  • portfolio (Portfolio) – Portfolio to add.

  • alias – Portfolio alias. If not None, this is used as the portfolio name.

eval(sdate=None, edate=None, logscale=True, annualize_factor=1, return_type='net', percentage=False)

Evaluate the portfolios.

For the arguments, see Portfolio.eval().

Returns:

performance, value.

set(portfolios)

Set portfolios.

Any existing portfolios are deleted.

Parameters:

portfolios – List or dict of Portfolio objects. If it is a dict, its keys are used as portfolio names.

tcost

This module defines classes for transaction costs.

TransactionCost

Transaction cost class.

TimeVaryingCost

Transaction cost that varies with time and firm size.

class pyanomaly.tcost.TimeVaryingCost(me=None, normalize=True)

Bases: TransactionCost

Transaction cost that varies with time and firm size.

This class implements the time-varying transaction costs used in, e.g., Brandt et al. (2009), Hand and Green (2011), DeMiguel et al. (2020), and Han (2021). Transaction cost parameter k is defined as k = y * z, where y decreases linearly from 4.0 in 1974.01 to 1.0 in 2002.01 and remains at 1.0 thereafter, and z = 0.006 - 0.0025 * nme, where nme is a cross-sectionally normalized market equity that has a value between 0 and 1.

The maximum transaction cost is 240 basis points (the smallest firm before 1974) and the minimum transaction cost is 35 basis points (the largest firm after 2002).

We find this assumption is too conservative since the normalized me is sensitive to the largest firm. In 1974, the mean of the normalized me is only 0.0045 and most firms have the transaction cost of 240 basis points, and in 2002, the mean of the normalized me is only 0.0059 and most firms have the transaction cost of 60 basis points.

Using the logarithm of the market equity or capping the me of the largest firms may make more sense.

Parameters:

Methods

set_params

Set transaction cost parameters.

set_params(me, normalize=True)

Set transaction cost parameters.

Parameters:
  • me – DataFrame or Series of market equity with index = date/id.

  • normalize – If True, cross-sectionally normalize me so that its values are between 0 and 1. If me is already normalized, set normalize to False.

Note

If the input contains only a subset of all listed stocks, normalizing the market equity can result in over- or underestimation of the transaction costs. For example, if me contains only top 80% of the stocks, the transaction costs will be overestimated. Use all listed stocks in the market or normalize the market equity outside and set normalize = False.

class pyanomaly.tcost.TransactionCost(**kwargs)

Bases: object

Transaction cost class.

Transaction cost can be set at the security level. It can also vary over time.

Parameters:

**kwargs – Transaction cost parameters. See set_params().

Attributes

params

Transaction cost parameters. This can be a float number, dict, or DataFrame. See set_params() for details.

Methods

set_params

Set transaction cost parameters.

get_cost

Get transaction costs.

get_cost(position)

Get transaction costs.

Parameters:

position – Portfolio positions. Portfolio object calls this function with the argument, Portfolio.position, to get transaction costs. The position argument should have index = ‘date’ and columns = [id, ‘val’, ‘val0’], where ‘val’ is the value after rebalancing and ‘val0’ is the value before rebalancing.

Returns:

Transaction costs. A ndarray with the same length as position.

set_params(**kwargs)

Set transaction cost parameters.

Parameters can be set either by this method or at class initialization.

Parameters:

**kwargs

Transaction cost parameters. The kwargs can have the following keywords:

  • ’cost’: For a constant (scalar) transaction cost.

  • ’buy_fixed`, ‘buy_linear`, ‘buy_quad’, ‘sell_fixed’, ‘sell_linear’, ‘sell_quad’: For a quadratic transaction cost.

  • ’params’: DataFrame. For a transaction cost that varies across securities (and over time).

Examples

A constant transaction cost of 20 basis points.

>>> tc = TransactionCost(cost=0.002)

Asymmetric quadratic cost function.

cost = fixed + linear * Amount + quad * Amount^2

  • To buy: fixed cost = $5, linear cost = 0.002, and quadratic cost = 0.001

  • To sell: fixed cost = $0, linear cost = 0.003, and quadratic cost = 0.001

>>> tc = TransactionCost(buy_fixed=5, buy_linear=0.002, buy_quad=0.001, sell_linear=0.003, sell_quad=0.001)
  • Only non-zero parameters need to be provided.

Transaction costs that vary across securities.

  • Security 1 (id: 0001): 0.002, security 2 (id: 0002): 0.003

>>> params = pd.DataFrame({'cost': [0.002, 0.003]}, index=pd.Index(['0001', '0002'], name='id'))
>>> params
      cost
id
0001 0.002
0002 0.003
>>> tc = TransactionCost(params=params)

Transaction costs that vary across securities and dates.

  • Security 1 (id: 0001): 0.004 on ‘2000-03-31’, 0.003 on ‘2000-04-30’

  • Security 2 (id: 0002): 0.005 on ‘2000-03-31’, 0.004 on ‘2000-04-30’

>>> dates = ['2000-03-31', '2000-04-30']
>>> ids = ['0001', '0002']
>>> params = pd.DataFrame(index=pd.MultiIndex.from_product([dates, ids], names=('date', 'id')))
>>> params['cost'] = [0.004, 0.005, 0.003, 0.004]
>>> params
                 cost
date       id
2000-03-31 0001 0.004
           0002 0.005
2000-04-30 0001 0.003
           0002 0.004
>>> tc = TransactionCost(params=params)
  • The params DataFrame must have index = id or date/id. It can have columns such as ‘buy_fixed’ instead of ‘cost’ for a more complex transaction cost structure.

util

This module defines utility functions.

is_iterable

Check if a variable is iterable.

to_list

Convert a variable to a list.

unique_list

Get a list of unique elements of a list.

drop_columns

Delete columns of a DataFrame.

keep_columns

Keep columns of a DataFrame.

is_zero

Check if a variable or its element is zero.

is_float

Check if a variable's data type is float.

is_numeric

Check if a variable's data type is numeric (int or float).

is_bool_array

Check if an array is a bool array.

nansum1

Summation treating nan values as zero.

pyanomaly.util.drop_columns(data, cols)

Delete columns of a DataFrame.

Columns are deleted in-place.

Parameters:
  • data – DataFrame.

  • cols – List of columns to drop.

pyanomaly.util.is_bool_array(array)

Check if an array is a bool array.

The array is identified as a bool array if:

  • its dtype is ‘bool’ or ‘boolean’; or

  • it contains only True, False, or None.

Parameters:

array – Series or ndarray.

Returns:

True if array is a bool array.

pyanomaly.util.is_float(x)

Check if a variable’s data type is float.

Parameters:

x – A scalar or array.

Returns:

Bool. True if the data type of x is float.

pyanomaly.util.is_int(x)

Check if a variable’s data type is int.

Parameters:

x – A scalar or array.

Returns:

Bool. True if the data type of x is int.

pyanomaly.util.is_iterable(x)

Check if a variable is iterable.

Check if x is iterable. A string, an iterable object, is considered not iterable.

Parameters:

x – A variable to check.

Returns:

Bool. True if x is iterable.

pyanomaly.util.is_numeric(x)

Check if a variable’s data type is numeric (int or float).

Parameters:

x – A scalar or array.

Returns:

Bool. True if the data type of x is numeric.

pyanomaly.util.is_zero(x)

Check if a variable or its element is zero.

A value is considered 0 if it is in the range (-1.e-7, 1.e-7).

Parameters:

x – A scalar or array.

Returns:

Bool or an array of bool. True if 0.

pyanomaly.util.keep_columns(data, cols)

Keep columns of a DataFrame.

Keep cols columns of data and delete the rest in-place. Much more memory-efficient than the following two methods: these methods momentarily consume a lot of memory when data is large.

>>> data = data[cols]
>>> data.drop(columns=data.columns.difference(cols), inplace=True)

Use this function when handling a large dataset.

Parameters:
  • data – DataFrame.

  • cols – List of columns to keep.

pyanomaly.util.nansum1(*args)

Summation treating nan values as zero.

This is similar to sum() of SAS: nan’s of args are replaced by 0. If all elements are nan, the result is nan.

Parameters:

args – List of Series.

Returns:

Series. Sum of args

Examples

>>> x = pd.Series([np.nan, 1, 1])
... y = pd.Series([np.nan, np.nan, 1])
>>> nansum(x, y)
0     NaN
1   1.000
2   2.000
dtype: float64
>>> nansum(x, y, y)
0     NaN
1   1.000
2   3.000
pyanomaly.util.to_list(x)

Convert a variable to a list.

Parameters:

x – A scalar or an iterable item.

Returns:

x converted to list.

Examples

>>> x = 1
... to_list(x)
[1]
>>> x = [1, 2]
... to_list(x)
[1, 2]
>>> x = (1, 2)
... to_list(x)
[1, 2]
pyanomaly.util.unique_list(x, exclude_nan=True)

Get a list of unique elements of a list.

If x contains a list, its elements are considered elements of x, not the list itself.

Parameters:
  • x – A list.

  • exclude_nan – If True, exclude None and np.nan elements.

Returns:

List of unique elements of x.

Examples

>>> x = [1, 2, 2, None]
... unique_list(x)
[1, 2]
>>> x = [1, 2, [1, 3], None]
... unique_list(x)
[1, 2, 3]

wrdsdata

This module defines WRDS class that is used to download and handle WRDS data.

WRDS

Class to download/handle WRDS data.

class pyanomaly.wrdsdata.WRDS(wrds_username=None)

Class to download/handle WRDS data.

Parameters:

wrds_username – WRDS username. Required only when downloading data: can be set to None when reading data from files.

Attributes

db

A wrds object to connect to WRDS database.

Methods for Data Download

download_table

Download a table from WRDS library.

download_table_async

Asynchronous download of a WRDS table.

download_sf

Download crsp.m(d)sf joined with crsp.m(d)senames.

download_seall

Download delist and dividend info from crsp.m(d)seall.

download_funda

Download comp.funda.

download_fundq

Download comp.fundq.

download_secd

Download comp.secd.

download_g_secd

Download comp.g_secd.

download_all

Download all tables.

Other Methods

create_pgpass_file

Create pgpass file.

merge_sf_with_seall

Merge m(d)sf with m(d)seall.

add_gvkey_to_crsp

Add gvkey to m(d)sf and identify primary stocks.

create_crsp_comp_linktable

Create a CRSP-Compustat link table using cusip.

add_gvkey_to_crsp_cusip

Add gvkey to m(d)sf and identify primary stocks using internal link table.

preprocess_crsp

Create crspm and crspd files.

get_risk_free_rate

Get risk-free rate.

convert_fund_currency_to_usd

Convert non-USD values of funda(q) to USD values.

save_data

Save downloaded table to a file.

read_data

Read data from a saved table.

save_as_csv

Read a file and save it to a csv file.

References

CRSP overview: https://wrds-www.wharton.upenn.edu/pages/support/data-overview/wrds-overview-crsp-us-stock-database/

CRSP-Compustat merge: https://wrds-www.wharton.upenn.edu/pages/support/manuals-and-overviews/crsp/crspcompustat-merged-ccm/wrds-overview-crspcompustat-merged-ccm/

CRSP annual update tables: https://wrds-www.wharton.upenn.edu/data-dictionary/crsp_a_indexes/

shrcd: https://wrds-www.wharton.upenn.edu/data-dictionary/form_metadata/crsp_a_stock_msf_identifyinginformation/shrcd/

exchcd: https://wrds-www.wharton.upenn.edu/data-dictionary/form_metadata/crsp_a_stock_msf_identifyinginformation/exchcd/

static add_gvkey_to_crsp(sf)

Add gvkey to m(d)sf and identify primary stocks.

The permno and gvkey are mapped using crsp.ccmxpf_linktable.

Primary stocks are identified in the following order.

  1. If linkprim = ‘P’ or ‘C’, set the security as primary.

  2. If permno and gvkey have 1:1 mapping, set the security as primary.

  3. Among the securities with the same gvkey, set the one with the maximum trading volume as primary.

  4. Among the securities with the same permco and missing gvkey, set the one with the maximum trading volume as primary.

Parameters:

sf – m(d)sf DataFrame with index = date/permno.

Returns:

m(d)sf with ‘gveky’ and ‘primary’ (primary stock indicator) columns added.

References

https://wrds-www.wharton.upenn.edu/pages/support/research-wrds/macros/wrds-macros-cvccmlnksas/

https://wrds-www.wharton.upenn.edu/pages/support/applications/linking-databases/linking-crsp-and-compustat/

static add_gvkey_to_crsp_cusip(sf)

Add gvkey to m(d)sf and identify primary stocks using internal link table.

The permno and gvkey are mapped using crsp_comp_linktable.

Primary stocks are identified in the following order.

  1. If linkprim = True, set the security as primary.

  2. If permno and gvkey have 1:1 mapping, set the security as primary.

  3. Among the securities with the same gvkey, set the one with the maximum trading volume as primary.

  4. Among the securities with the same permco and missing gvkey, set the one with the maximum trading volume as primary.

Parameters:

sf – m(d)sf DataFrame with index = date/permno.

Note

Compared to using ccmxpf_linktable, about 13% of gvkey’s and 3% of primary’s are different.

Returns:

m(d)sf with ‘gveky’ and ‘primary’ (primary stock indicator) columns added.

static convert_fund_currency_to_usd(fund, table='funda')

Convert non-USD values of funda(q) to USD values.

Parameters:
  • fund – funda(q) DataFrame with index = datadate/gvkey.

  • table – ‘funda’ or ‘fundq’: indicator whether fund is funda or fundq.

Returns:

Converted fund DataFrame.

Note

In Compustat North America, the accounting data can be either in USD and CAD. This is no problem if firm characteristics are generated using only Compustat. However, if data from different sources are mixed, e.g., if CRSP’s market equity (in USD) is combined with Compustat, Compustat data should be converted to USD.

Following JKP, we use compustat.exrt_dly to obtain exchange rates. The exrt_dly starts from 1982-02-01.

static create_crsp_comp_linktable()

Create a CRSP-Compustat link table using cusip.

This method creates a CRSP-Compustat link table by merging crsp.msf with comp.security on cusip. This method can be used if the user does not have a WRDS subscription for ccmxpf_linktable. The link table has the columns [‘cusip’, ‘gvkey’, ‘permno’, ‘linkdt’, ‘linkenddt’, ‘linkprim’] and is saved to config.input_dir/crsp_comp_linktable. The linkprim column value is True if a security is primary.

Note

The sql in the reference uses historical cusip (ncusip in msenames). However, we use cusip in msf as it renders more matches.

A security is considered primary (linkprim = True) if its cusip is in funda or fundq.

References

https://wrds-www.wharton.upenn.edu/pages/support/applications/linking-databases/linking-crsp-and-compustat/

create_pgpass_file()

Create pgpass file.

Need to be called only once (after logging in to WRDS for the first time using passwords). Once pgpass file is created, password is not required when connecting to WRDS.

download_all(run_in_executer=True)

Download all tables.

Currently, this method downloads the following tables:

  • comp.funda

  • comp.fundq

  • comp.exrt_dly

  • crsp.msf (merged with crsp.msenames)

  • crsp.dsf (merged with crsp.dsenames)

  • crsp.mseall

  • crsp.dseall

  • crsp.ccmxpf_linktable

  • crsp.mcti

  • ff.factors_monthly

  • ff.factors_daily

Parameters:

run_in_executer – If True, download concurrently. Faster (if network speed is high) but memory hungrier.

download_funda(sdate=None, edate=None, run_in_executer=True)

Download comp.funda.

Downloaded data has index = datadate/gvkey and is sorted on gvkey/datadate.

Parameters:
  • sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.

  • edate – End date. Set to None to download all data.

  • run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_fundq(sdate=None, edate=None, run_in_executer=True)

Download comp.fundq.

Downloaded data has index = datadate/gvkey and is sorted on gvkey/datadate.

Parameters:
  • sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.

  • edate – End date. Set to None to download all data.

  • run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_g_secd(sdate=None, edate=None, run_in_executer=True)

Download comp.g_secd.

Downloaded data has index = datadate/gvkey and is sorted on gvkey/datadate.

Parameters:
  • sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.

  • edate – End date. Set to None to download all data.

  • run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_seall(sdate=None, edate=None, monthly=True, run_in_executer=True)

Download delist and dividend info from crsp.m(d)seall.

Delist can be obtained from either mseall or msedelist. We use mseall since it contains exchcd, which is used when replacing missing dlret. The shrcd and exchcd in mseall are usually those before halting/suspension. If a stock in NYSE is halted, exchcd in msenames can be -2, whereas that in mseall is 1. The downloaded fields are: permno, date, dlret, dlstcd, shrcd, exchcd, distcd, divamt. Downloaded data has index = date/permno and is sorted on permno/date.

Parameters:
  • sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.

  • edate – End date. Set to None to download all data.

  • monthly – If True download mseall else dseall.

  • run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_secd(sdate=None, edate=None, run_in_executer=True)

Download comp.secd.

Downloaded data has index = datadate/gvkey and is sorted on gvkey/datadate.

Parameters:
  • sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.

  • edate – End date. Set to None to download all data.

  • run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_sf(sdate=None, edate=None, monthly=True, run_in_executer=True)

Download crsp.m(d)sf joined with crsp.m(d)senames.

Downloaded data has index = date/permno and is sorted on permno/date.

Parameters:
  • sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.

  • edate – End date. Set to None to download all data.

  • monthly – If True download msf else dsf.

  • run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_table(library, table, obs=-1, offset=0, columns=None, coerce_float=None, date_cols=None, index_col=None, sort_col=None)

Download a table from WRDS library.

This is a wrapping function of wrds.get_table(). The queried table is saved to config.input_dir/library/table.

Parameters:
  • library – WRDS library. e.g., crsp, comp, …

  • table – A table in library.

  • obs – See wrds.get_table().

  • offset – See wrds.get_table().

  • columns – See wrds.get_table().

  • coerce_float – See wrds.get_table().

  • date_cols – See wrds.get_table().

  • index_col – (List of) column(s) to be set as index.

  • sort_col – (List of) column(s) to sort data on.

download_table_async(library, table, sql=None, date_col=None, sdate=None, edate=None, interval=5, src_tables=None, run_in_executer=True, index_col=None, sort_col=None)

Asynchronous download of a WRDS table.

This method splits the total period into interval years and downloads data of each sub-period asynchronously. If download fails, it can be started from the failed date: already downloaded files will be gathered together. This method allow us to download a large table, e.g., crsp.dsf, reliably without connection timeout and consumes much less memory than download_table(). The queried table is saved to config.input_dir/library/table.

Parameters:
  • library – WRDS library.

  • table – WRDS table. If a complete query is given in sql, this can be any name: used only as the file name when saving the data.

  • sql – String of the fields to select or a complete query statement. See below.

  • date_col – Date field on which downloaing will be split. Ignored if sql is a complete query statement.

  • sdate – Start date (‘yyyy-mm-dd’). If None, ‘1900-01-01’.

  • edate – End date. If None, today.

  • interval – Sub-period size in years.

  • src_tables – List of (library, table) that are used in the query. The src_tables are used to get data types of the fields. When data is selected from a single table, library.table, this can be set to None.

  • run_in_executer – If True, download concurrently. Faster but memory hungrier.

  • index_col – (List of) column(s) to be set as index.

  • sort_col – (List of) column(s) to sort data on.

Note

For a small table, this can be slower than download_table(). Table should have a date field (date_col) to split the period.

Examples

Instantiate WRDS.

>>> wrds = WRDS('user_name')

Download crsp.msf.

>>> wrds.download_table_async('crsp', 'msf', date_col='date')

Download ‘permno’, ‘prc’, and ‘exchcd’ fields from crsp.msf.

>>> sql = 'permno, prc, exchcd'
>>> wrds.download_table_async('crsp', 'msf', sql, 'date')

Download crsp.msf merged with crsp.msenames. When sql is a complete query statement as below, it should contain ‘WHERE [date_col] BETWEEN {} and {}’ for asynchronous download.

>>> sql = '''
...        SELECT a.*, b.shrcd, b.exchcd, b.siccd
...        FROM crsp.msf as a
...        LEFT JOIN crsp.msenames as b
...        ON a.permno = b.permno
...        AND b.namedt <= a.date
...        AND a.date <= b.nameendt
...        WHERE a.date BETWEEN '{{}}' and '{{}}'
...        ORDER BY a.permno, a.date
...     '''
>>> src_tables = [('crsp', 'msf'), ('crsp', 'msenames')]
>>> wrds.download_table_async('crsp', 'msf', sql, src_tables=src_tables)
static get_risk_free_rate(sdate=None, edate=None, src='mcti', month_end=False)

Get risk-free rate.

The risk-free rate can be obtained either from crsp.mcti or ff.factors_monthly. The mcti is preferred since the values in factors_monthly have only 4 decimal places. Both risk-free rates are in decimal (not percentage values).

Parameters:
  • sdate – Start date.

  • edate – End date.

  • src – data source. ‘mcti’: crsp.mcti, ‘ff’: ff.factors_monthly.

  • month_end – If True, shift dates to the end of the month.

Returns:

DataFrame of risk-free rates with index = ‘date’ and columns = [‘rf’].

static merge_sf_with_seall(monthly=True, fill_method=1)

Merge m(d)sf with m(d)seall.

This method adjusts m(d)sf return (‘ret’) using m(d)seall delist return (‘dlret’). The adjusted return replaces ‘ret’ and ‘dlret’ column is added to m(d)sf. For msf, this method also adds cash dividend column, ‘cash_div’, to msf.

Parameters:
  • monthly – If True, merge msf with mseall; else, merge dsf with dseall.

  • fill_method

    Method to fill missing dlret. 0: don’t fill, 1: JKP code, or 2: GHZ code. Default to 1.

    • fill_method = 1:

      • dlret = -0.30 if dlstcd is 500 or between 520 and 584.

    • fill_method = 2:

      • dlret = -0.35 if dlstcd is 500 or between 520 and 584, and exchcd is 1 or 2.

      • dlret = -0.55 if dlstcd is 500 or between 520 and 584, and exchcd is 3.

Returns:

m(d)sf with adjusted return (and cash dividend).

Note

The msenames can be missing when a firm is delisted, resulting in missing shrcd/exchcd in m(d)sf. Missing shrcd and exchcd are filled with the latest values.

References

Delist codes: http://www.crsp.com/products/documentation/delisting-codes

static preprocess_crsp(use_ccmxpf_linktable=None)

Create crspm and crspd files.

This method calls merge_sf_with_seall() and add_gvkey_to_crsp() to add delist return, gveky, and primary indicator to m(d)sf. The result is saved to config.input_dir/crspm(d).

Parameters:

use_ccmxpf_linktable – If True, use crsp.ccmxpf_linktable to link CRSP and Compustat; if False, use internally created link table, crsp_comp_linktable. If None, use crsp.ccmxpf_linktable if it exists, otherwise, use crsp_comp_linktable.

static read_data(table, library=None, index_col=None, sort_col=None, typecast=True)

Read data from a saved table.

The file path is config.input_dir/library/table. The library argument is redundant: if it is None, all folders under config.input_dir is searched.

Parameters:
  • table – File name without extension.

  • library – Directory.

  • index_col – (List of) column(s) to be set as index.

  • sort_col – (List of) column(s) to sort data on.

  • typecast – If True, cast float to config.float_type and object to string after reading from the file.

Returns:

DataFrame. Data read. Index = index_col.

static save_as_csv(table, library=None, fpath=None)

Read a file and save it to a csv file.

Parameters:
  • table – File name without extension.

  • library – Directory.

  • fpath – File path for the csv file. If None, the file is saved to config.input_dir/library/table.csv.

static save_data(data, table, library=None, index_col=None, sort_col=None, typecast=True)

Save downloaded table to a file.

The file format can be either pickle (default) or parquet and configured by set_config(). The data is saved in the following location:

  • If library = None, config.input_dir/table.

  • Otherwise, config.input_dir/library/table.

Parameters:
  • data – Data to save (DataFrame).

  • table – File name without extension.

  • library – Directory.

  • index_col – (List of) column(s) to be set as index.

  • sort_col – (List of) column(s) to sort data on.

  • typecast – If True, cast float to config.float_type and object to string before saving to a file.

Note

A parquet file size can be significantly smaller especially when there are many duplicate values in columns. However, it tends to be slower to read and write and takes significantly more memory in some cases for unknown reasons. To change the file format to parquet, use set_config(file_format='parquet'). We use parquet or pickle file format to store data as they preserve data types and are much faster to read compared to a csv file. To convert a file to a csv file, use save_as_csv().