pyanomaly

Python library for asset pricing research.

pyanomaly.analytics

This module defines analytic functions.

Sorting

one_dim_sort(): One-dimensional sort.
two_dim_sort(): Two-dimensional sort.

Time-Series Analysis

time_series_average(): Calculate time-series mean and t-value.
rolling_beta(): Run rolling OLS on a panel data.

Cross-sectional Analysis

crosssectional_regression(): Run cross-sectional OLS and calculate the time-series means and t-values of the coefficients.

Portfolio Analysis

make_future_return(): Compute future returns.
weighted_mean(): Calculate weighted mean.
make_quantile_portfolios(): Make quantile portfolios.

pyanomaly.analytics.append_long_short(data, level=-1, l_label=None, s_label=None, ls_label=None)

Add long-short to data.

Long-short is defined as (first group - last group) in each date. If labels are not given, long-short will be (class 0 - class N-1), where N is the number of classes, and the class label of the long-short is set to N.

Parameters:

data – DataFrame with index = date/class1/class2/….
level – Index level to make long-short on. Default to the last level.
l_label – Label of the long class. If None, l_label = 0.
s_label – Label of the short class. If None, s_label = num. classes - 1.
ls_label – Label of the long-short. If None, ls_label = num. classes.

Returns:

data with long-short appended.

pyanomaly.analytics.crosssectional_regression(data, endo_col, exog_cols, add_constant=True, cov_type='nonrobust', cov_kwds=None)

Run cross-sectional OLS on each date and calculate the time-series means and t-values of the coefficients.

Parameters:

data – DataFrame with index = date/id.
endo_col – y column.
exog_cols – List of X columns.
add_constant (bool) – Add constant to x.
cov_type – See sm.OLS.fit(), eg, ‘HAC’ for Newey-West.
cov_kwds – See sm.OLS.fit(), eg, {‘maxlags: 12} for cov_type = ‘HAC’.

Returns:

mean, tval, coefs.

mean (DataFrame): Time-series means of coefficients with index = (‘const’) + exog_cols and columns = ‘mean’.
tval (DataFrame): t-values of coefficients with index = (‘const’) + exog_cols and columns = ‘t-stat’.
coefs (DataFrame): Coefficient time-series with index = dates and columns = (‘const’) + exog_cols.

pyanomaly.analytics.make_future_return(ret, period=1)

Compute future returns.

This is a simple function to compute period-period ahead return.

pyanomaly.characteristics

This module defines classes for firm characteristic generation. Refer to the cookbook for use cases.

FUNDA
Class to generate firm characteristics from funda.

FUNDQ
Class to generate firm characteristics from fundq.

CRSPM
Class to generate firm characteristics from crspm.

CRSPD
Class to generate firm characteristics from crspd.

Merge
Class to generate firm characteristics from a merged dataset of funda, fundq, crspm, and crspd.

pyanomaly.characteristics.FUNDA

class pyanomaly.characteristics.FUNDA(alias=None, data=None, freq=1)

Bases: Panel

Class to generate firm characteristics from funda.

The firm characteristics generated in this class can be viewed using FUNDA.show_available_functions():

>>> FUNDA().show_available_functions()

Parameters:

alias – Characteristic column name in mapping.xlsx. If None, function names (without ‘c_’) are used as the characteristic names.
data – DataFrame with index = date/id, sorted on id/date.
freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY. Default to ANNUAL.

add_crsp_me(crspm, method='latest')

Replace funda’s market equity (‘me’) with crspm’s firm level market equity (‘me_company’).

This method also adds a column ‘me_fiscal’ to data. The ‘me_fiscal’ is the me on datadate if method = ‘latest` and it is the as me (December me) if method = ‘year_end’. Without calling this function, me and me_fiscal are set to me of funda (prcc_f * csho).

Parameters:

crspm – CRSPM instance.
method – How to merge crsp me with funda. ‘latest’: latest me; ‘year_end’: December me.

c_absacc(): Absolute accruals. Bandyopadhyay, Huang, and Wirjanto (2010)

c_acc(): Operating accruals (GHZ, Org). Sloan (1996)

c_age(): Firm age. Jiang, Lee, and Zhang (2005)

c_aliq_at(): Asset liquidity to book assets. Ortiz-Molina and Phillips (2014)

c_aliq_mat(): Asset liquidity to market assets. Ortiz-Molina and Phillips (2014)

c_at_be(): Book leverage. Fama and French (1992)

c_at_gr1(): Asset growth. Cooper, Gulen, and Schill (2008)

c_at_me(): Assets-to-market. Fama and French (1992)

c_at_turnover(): Capital turnover. Haugen and Baker (1996)

c_be_gr1a(): Chage in common equity. Richardson et al. (2005)

c_be_me(): Book-to-market (December ME). Rosenberg, Reid, and Lanstein (1985)

c_bev_mev(): Book-to-market enterprise value. Penman, Richardson, and Tuna (2007)

c_bm_ia(): Industry-adjusted book-to-market. Asness, Porter, and Stevens (2000)

c_capex_abn(): Abnormal corporate investment. Titman, Wei, and Xie (2004)

c_capx_gr1(): CAPEX growth (1 year). Xie (2001)

c_capx_gr2(): Two-year investment growth. Anderson and Garcia-Feijoo (2006)

c_capx_gr3(): Three-year investment growth. Anderson and Garcia-Feijoo (2006)

c_cash_at(): Cash-to-assets. Palazzo (2012)

c_cashdebt(): Cash flow-to-debt. Ou and Penman(1989)

c_cashpr(): Cash productivity. Chandrashekar and Rao (2009)

c_cfp(): Operating Cash flows to price (Org, GHZ). Desai, Rajgopal, and Venkatachalam (2004)

c_cfp_ia(): Industry-adjusted cash flow-to-price. Asness, Porter, and Stevens (2000)

c_chatoia(): Change in profit margin. Soliman (2008)

c_chcsho(): Net stock issues (GHZ). Pontiff and Woodgate (2008)

c_chempia(): Industry-adjusted change in employees. Asness, Porter, and Stevens (2000)

c_chpmia(): Change in profit margin. Soliman (2008)

c_coa_gr1a(): Change in current operating assets. Richardson et al. (2005)

c_col_gr1a(): Change in current Ooperating liabilities. Richardson et al. (2005)

c_convind(): Convertible debt indicator. Valta (2016)

c_cop_at(): Cash-based operating profitablility. Ball et al. (2016)

c_cop_atl1(): Cash-based operating profits to lagged assets. Ball et al. (2016)

c_cowc_gr1a(): Change in net non-cash working capital. Richardson et al. (2005)

c_currat(): Current ratio. Ou and Penman (1989)

c_dbnetis_at(): Net debt finance. Bradshaw, Richardson, and Sloan (2006)

c_debt_gr3(): Composite debt issuance. Lyandres, Sun, and Zhang (2008)

c_debt_me(): Debt to market. Bhandari (1988)

c_depr(): Depreciation to PP&E. Holthausen and Larcker (1992)

c_dgp_dsale(): Gross margin growth to sales growth. Abarbanell and Bushee (1998)

c_dsale_dinv(): Sales growth to inventory growth. Abarbanell and Bushee (1998)

c_dsale_drec(): Sales growth to receivable growth. Abarbanell and Bushee (1998)

c_dsale_dsga(): Sales growth to SG&A growth. Abarbanell and Bushee (1998)

c_dy(): Dividend yield (GHZ). Litzenberger and Ramaswamy (1979)

c_earnings_variability(): Earnings smoothness. Francis et al. (2004)

c_ebit_bev(): Return on net operating assets. Soliman (2008)

c_ebit_sale(): Profit margin. Soliman (2008)

c_ebitda_mev(): Enterprise multiple. Loughran and Wellman (2011)

c_emp_gr1(): Employee growth. Belo, Lin, and Bazdresch (2014)

c_enterprise_multiple(): Enterprise multiple. Loughran and Wellman (2011)

c_eq_dur(): Equity duration. Dechow, Sloan, and Soliman (2004)

c_eqnetis_at(): Net equity finance. Bradshaw, Richardson, and Sloan (2006)

c_eqnpo_me(): Net payout yield. Boudoukh et al. (2007)

c_eqpo_me(): Payout yield. Boudoukh et al. (2007)

c_f_score(): Piotroski F-Score (JKP). Piotroski (2000)

c_fcf_me(): Cash flow-to-price. Lakonishok, Shleifer, and Vishny (1994)

c_fnl_gr1a(): Change in financial liabilities. Richardson et al. (2005)

c_gp_at(): Gross profits-to-assets. Novy-Marx (2013)

c_gp_atl1(): Gross profits-to-lagged assets. Novy-Marx (2013)

c_herf_at(): Industry concentration (total assets). Hou and Robinson (2006)

c_herf_be(): Industry concentration (book equity). Hou and Robinson (2006)

c_herf_sale(): Industry concentration (sales). Hou and Robinson (2006)

c_intrinsic_value(): Intrinsic value-to-market. Frankel and Lee (1998)

c_inv_gr1(): Inventory growth. Belo and Lin (2012)

c_inv_gr1a(): Inventory change. Thomas and Zhang (2002)

c_invest(): CAPEX and inventory. Chen and Zhang (2010)

c_kz_index(): Kaplan-Zingales Index. Lamont, Polk, and Saa-Requejo (2001)

c_lgr(): Change in long-term debt. Richardson et al. (2005)

c_lnoa_gr1a(): Change in long-term net operating assets. Fairfield, Whisenant, and Yohn (2003)

c_lti_gr1a(): Chagne in long-term investments. Richardson et al. (2005)

c_mve_ia(): Industry-adjusted firm size. Asness, Porter, and Stevens (2000)

c_ncoa_gr1a(): Change in non-current operating assets. Richardson et al. (2005)

c_ncol_gr1a(): Change in non-current operating liabilities. Richardson et al. (2005)

c_netdebt_me(): Net debt-to-price. Penman, Richardson, and Tuna (2007)

c_netis_at(): Net external finance. Bradshaw, Richardson, and Sloan (2006)

c_nfna_gr1a(): Change in net financial assets. Richardson et al. (2005)

c_ni_ar1(): Earnings persistence. Francis et al. (2004)

c_ni_be(): Return on equity. Haugen and Baker (1996)

c_ni_ivol(): Earnings predictability. Francis et al. (2004)

c_ni_me(): Earnings to price. Basu (1983)

c_nncoa_gr1a(): Change in net non-current operating assets. Richardson et al. (2005)

c_noa_at(): Net operating assets. Hirshleifer et al. (2004)

c_noa_gr1a(): Change in net operating assets. Hirshleifer et al. (2004)

c_o_score(): Ohlson O-Score. Dichev (1998)

c_oaccruals_at(): Operating Accruals (JKP). Sloan (1996)

c_oaccruals_ni(): Percent Operating Accruals (JKP). Hafzalla, Lundholm, and Van Winkle (2011)

c_ocf_at(): Operating cash flow to assets. Bouchard et al. (2019)

c_ocf_at_chg1(): Change in operating cash flow to assets. Bouchard et al. (2019)

c_ocf_me(): Operating Cash flows to price (JKP). Desai, Rajgopal, and Venkatachalam (2004)

c_op_at(): Operating profits-to-assets. Ball et al. (2016)

c_op_atl1(): Operating profits-to-lagged assets. Ball et al. (2016)

c_ope_be(): Operating profits to book equity (JKP). Fama and French (2015)

c_ope_bel1(): Operating profits to lagged book equity. Fama and French (2015)

c_operprof(): Operating profits to book equity (GHZ, Org). Fama and French (2015)

c_opex_at(): Operating leverage. Novy-Marx (2011)

c_pchcapx_ia(): Industry-adjusted change in capital investment. Abarbanell and Bushee (1998)

c_pchcurrat(): Change in current ratio. Ou and Penman (1989)

c_pchdepr(): Change in depreciation to PP&E. Holthausen and Larcker (1992)

c_pchquick(): Change in quick ratio. Ou and Penman (1989)

c_pchsaleinv(): Change in sales to inventory. Ou and Penman(1989)

c_pctacc(): Percent operating accruals (GHZ, Org). Hafzalla, Lundholm, and Van Winkle (2011)

c_pi_nix(): Taxable income to income (JKP). Lev and Nissim (2004)

c_ppeinv_gr1a(): Changes in PPE and inventory/assets. Lyandres, Sun, and Zhang (2008)

c_ps(): Piotroski score (GHZ, Org). Piotroski (2000)

c_quick(): Quick ratio. Ou and Penman (1989)

c_rd(): Unexpected R&D increase. Eberhart, Maxwell, and Siddique (2004)

c_rd5_at(): R&D capital-to-assets. Li (2011)

c_rd_me(): R&D to market. Chan, Lakonishok, and Sougiannis (Guo, Lev, and Shi (2006) in GHZ)

c_rd_sale(): R&D to sales. Chan, Lakonishok, and Sougiannis (2001) (Guo, Lev, and Shi (2006) in GHZ)

c_realestate(): Real estate holdings. Tuzel (2010)

c_roic(): Return on invested capital. Brown and Rowe (2007)

c_sale_bev(): Asset turnover. Soliman (2008)

c_sale_emp_gr1(): Labor force efficiency. Abarbanell and Bushee (1998)

c_sale_gr1(): Annual sales growth. Lakonishok, Shleifer, and Vishny (1994)

c_sale_gr3(): Three-year sales growth. Lakonishok, Shleifer, and Vishny (1994)

c_sale_me(): Sales to price. Barbee, Mukherji, and Raines (1996)

c_salecash(): Sales-to-cash. Ou and Penman (1989)

c_saleinv(): Sales-to-inventory. Ou and Penman(1989)

c_salerec(): Sales-to-receivables. Ou and Penman(1989)

c_secured(): Secured debt-to-total debt. Valta (2016)

c_securedind(): Secured debt indicator. Valta (2016)

c_sin(): Sin stocks. Hong and Kacperczyk (2009)

c_sti_gr1a(): Change in short-term investments. Richardson et al. (2005)

c_taccruals_at(): Total Accruals. Richardson et al. (2005)

c_taccruals_ni(): Percent total accruals. Hafzalla, Lundholm, and Van Winkle (2011)

c_tangibility(): Tangibility. Hahn and Lee (2009)

c_tb(): Taxable income to income (Org, GHZ). Lev and Nissim (2004)

c_z_score(): Altman Z-Score. Dichev (1998)

convert_currency()

Convert the currency of data to USD.

This method needs to be called if

the data contains non USD-denominated firms, e.g., CAD, and

CRSP’s market equity is used, which is always in USD.

convert_to_monthly(lag=4, limit=12)

Populate data to monthly frequency.

The index of data changes from datadate/gvkey to date/gvkey and datadate is kept as a column. The date and datadate has a gap of at least lag months.

Parameters:

lag – Minimum months between date and datadate: lag = 4 assumes funda is available 4 months after datadate.
limit – Maximum months to forward-fill the data.

load_data(sdate=None, edate=None)

Load funda data from file.

The loaded data is sorted on gvkey/datadate, has index = datadate/gvkey, and stored in the data attribute.

merge_with_fundq(fundq)

Merge funda with fundq.

If variable X is available in fundq, i.e., Xq or Xy exists, Xq(y) replaces X if:

X is missing, or

Xq(y) is not missing and fundq.datadate > funda.datadate.

Note

JKP create characteristics in funda and fundq separately and merge them, whereas we merge the raw data first and then generate characteristics. Since some variables in funda are not available in fundq, eg, ebitda, JKP make those unavailable variables from other variables and create characteristics, even when they are available in funda. We prefer to merge funda with fundq at the raw data level and create characteristics from the merged data.

Columns in both funda and fundq:

datadate, cusip, cik, sic, naics, sale, revt, cogs, xsga, dp, xrd, ib, nopi, spi, pi, txp, ni, txt, xint, capx, oancf, gdwlia, gdwlip, rect, act, che, ppegt, invt, at, aco, intan, ao, ppent, gdwl, lct, dlc, dltt, lt, pstk, ap, lco, lo, drc, drlt, txdi, ceq, scstkc, csho, prcc_f, oibdp, oiadp, mii, xopr, xi, do, xido, ibc, dpc, xidoc, fincf, fiao, txbcof, dltr, dlcch, prstkc, sstk, dv, ivao, ivst, re, txditc, txdb, seq, mib, icapt, ajex, curcd, exratd, rcount

Columns in funda but not in fundq:

xad, gp, ebitda, ebit, txfed, txfo, dvt, ob, gwo, fatb, fatl, dm, dcvt, cshrc, dcpstk, emp, xlr, ds, dvc, itcb, pstkrv, pstkl, dltis, ppenb, ppenls

Parameters:: fundq – FUNDQ instance.

postprocess()

Postprocess data.

This method deletes temporary variables (variables starting with ‘_’) and replaces infinite values with nan. This method can be overridden to add code to trim or winsorize characteristics.

update_variables()

Preprocess data before creating characteristics.

Replace missing values with other variables.
Create frequently used variables.

pyanomaly.characteristics.FUNDQ

class pyanomaly.characteristics.FUNDQ(alias=None, data=None, freq=4)

Bases: Panel

Class to generate firm characteristics from fundq.

The firm characteristics generated in this class can be viewed using FUNDQ.show_available_functions():

>>> FUNDQ().show_available_functions()

Parameters:

alias – Characteristic column name in mapping.xlsx. If None, function names (without ‘c_’) are used as the characteristic names.
data – DataFrame with index = date/id, sorted on id/date.
freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY. Default to QUARTERLY.

c_chtx(): Tax expense surprise. Thomas and Zhang (2011)

c_ni_inc8q(): Number of consecutive quarters with earnings increases. Barth, Elliott, and Finn (1999)

c_niq_at(): Quarterly return on assets. Balakrishnan, Bartov, and Faurel (2010)

c_niq_at_chg1(): Change in quarterly return on assets. Balakrishnan, Bartov, and Faurel (2010)

c_niq_be(): Return on equity (quarterly). Hou, Xue, and Zhang (2015)

c_niq_be_chg1(): Change in quarterly return on equity. Balakrishnan, Bartov, and Faurel (2010)

c_niq_su(): Earnings surprise. Foster, Olsen, and Shevlin (1984)

c_ocfq_saleq_std(): Cash flow volatility. Huang (2009)

c_roavol(): ROA volatility. Francis et al. (2004)

c_rsup(): Revenue surprise (Karma). Kama (2009)

c_saleq_su(): Revenue surprise. Jegadeesh and Livnat (2006)

c_stdacc(): Accrual volatility. Bandyopadhyay, Huang, and Wirjanto (2010)

convert_currency()

Convert the currency of data to USD.

This method needs to be called if

the data contains non USD-denominated firms, e.g., CAD, and

CRSP’s market equity is used, which is always in USD.

convert_to_monthly(lag=4, limit=3)

Populate data to monthly frequency.

The index of data changes from datadate/gvkey to date/gvkey and datadate is kept as a column. The date and datadate has a gap of at least lag months.

Parameters:

lag – Minimum months between date and datadate: lag = 4 assumes funda is available 4 months after datadate.
limit – Maximum months to forward-fill the data.

create_qitems_from_yitems()

Quarterize ytd items.

Quarterize a ytd column, Xy, and used the quarterized values to fill missing Xq if Xq exists or to ceate a new column Xq.

generate_funda_vars()

Generate quarterly-updated annual data from fundq.

The output can be merged with funda using FUNDA.merge_with_fundq().

Returns:: DataFrame of quarterly-updated annual data.

load_data(sdate=None, edate=None)

Load fundq data from file.

The loaded data is sorted on gvkey/datadate, has index = datadate/gvkey, and stored in the data attribute.

postprocess()

Postprocess data.

This method deletes temporary variables (variables starting with ‘_’) and replaces infinite values with nan. This method can be overridden to add code to trim or winsorize characteristics.

remove_duplicates()

Drop duplicates.

In fundq, there are duplicate rows (rows with the same datadate and gvekey). Remove duplicates in the following manner:

Remove records with missing fqtr.

Choose the latest record in the sense that it has the maximum fyearq and the minimum fqtr.

update_variables()

Preprocess data before creating characteristics.

Replace missing values with other variables.
Create frequently used variables.

pyanomaly.characteristics.CRSPM

class pyanomaly.characteristics.CRSPM(alias=None, data=None, freq=12)

Bases: Panel

Class to generate firm characteristics from crspm.

The firm characteristics generated in this class can be viewed using CRSPM.show_available_functions():

>>> CRSPM().show_available_functions()

Parameters:

alias – Characteristic column name in mapping.xlsx. If None, function names (without ‘c_’) are used as the characteristic names.
data – DataFrame with index = date/id, sorted on id/date.
freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY. Default to MONTHLY.

c_beta_60m(): Market beta (Org, JKP). Fama and MacBeth (1973)

c_chcsho_12m(): Net stock issues (JKP). Pontiff and Woodgate (2008)

c_chmom(): Change in 6-month momentum. Gettleman and Marks (2006)

c_div12m_me(): Dividend yield (JKP). Litzenberger and Ramaswamy (1979)

c_divi(): Dividend initiation. Michaely, Thaler, and Womack (1995)

c_divo(): Dividend omission. Michaely, Thaler, and Womack (1995)

c_dolvol(): Dollar trading volume (Org, GHZ). Brennan, Chordia, and Subrahmanyam (1998)

c_eqnpo_12m(): Composite equity issuance (JKP, 12 months). Daniel and Titman (2006)

c_eqnpo_60m(): Composite equity issuance (Org). Daniel and Titman (2006)

c_indmom(): Industry momentum. Moskowitz and Grinblatt (1999)

c_ipo(): Initial public offerings. Loughran and Ritter (1995)

c_market_equity(): Market equity. Banz (1981)

c_price(): Share price. Miller and Scholes (1982)

c_resff3_12_1(mp=True): 12 month residual momentum. Blitz, Huij, and Martens (2011)

c_resff3_6_1(): 6 month residual momentum. Blitz, Huij, and Martens (2011)

c_ret_12_1(): Momentum (12 months). Jegadeesh and Titman (1993)

c_ret_12_6(): Intermediate momentum (7-12). Novy-Marx (2012)

c_ret_1_0(): Short-term reversal. Jegadeesh (1990)

c_ret_36_12(): Long-term reversal (12-36). De Bondt and Thaler (1985)

c_ret_3_1(): Momentum (3 months). Jegadeesh and Titman (1993)

c_ret_60_12(): Long-term reversal (12-60). De Bondt and Thaler (1985)

c_ret_6_1(): Momentum (6 months). Jegadeesh and Titman (1993)

c_ret_9_1(): Momentum (9 months). Jegadeesh and Titman (1993)

c_seas_11_15an(): Years 11-15 lagged returns, annual. Heston and Sadka (2008)

c_seas_11_15na(): Years 11-15 lagged returns, nonannual. Heston and Sadka (2008)

c_seas_16_20an(): Years 16-20 lagged returns, annual. Heston and Sadka (2008)

c_seas_16_20na(): Years 16-20 lagged returns, nonannual. Heston and Sadka (2008)

c_seas_1_1an(): Year 1-lagged return, annual. Heston and Sadka (2008)

c_seas_1_1na(): Year 1-lagged return, nonannual. Heston and Sadka (2008)

c_seas_2_5an(): Years 2-5 lagged returns, annual. Heston and Sadka (2008)

c_seas_2_5na(): Years 2-5 lagged returns, nonannual. Heston and Sadka (2008)

c_seas_6_10an(): Years 6-10 lagged returns, annual. Heston and Sadka (2008)

c_seas_6_10na(): Years 6-10 lagged returns, nonannual. Heston and Sadka (2008)

c_turn(): Share turnover (Org, GHZ). Datar, Naik, and Radcliffe (1998)

filter_data()

Filter data.

Currently, we filter data only on shrcd.

load_data(sdate=None, edate=None)

Load crspm data from file.

The loaded data is sorted on permno/date, has index = date/permno, and stored in the data attribute.

Note

In CRSP monthly tables, date is the last business day of the month, whereas datadate in Compustat is the end-of-month date. To make the two dates consistent, crspm.date is shifted to the end of the month.

merge_with_factors(factors=None)

Merge crspm with factors.

The factors should contain Fama-French 3 factors with column names as defined in config.factor_names.

Parameters:: factors – DataFrame of factors with index = date. If None, it will be read from config.monthly_factors_fname.

postprocess()

Postprocess data.

This method deletes temporary variables (variables starting with ‘_’) and replaces infinite values with nan. This method can be overridden to add code to trim or winsorize characteristics.

update_variables()

Preprocess data before creating characteristics.

Replace missing values with other variables.
Create frequently used variables.

pyanomaly.characteristics.CRSPD

class pyanomaly.characteristics.CRSPD(alias=None, data=None, freq=365)

Bases: Panel

Class to generate firm characteristics from crspd.

CRSPD generates firm characteristics monthly and store them in the chars attribute instead of the data attribute, which is a daily data.

The firm characteristics generated in this class can be viewed using CRSPD.show_available_functions():

>>> CRSPD().show_available_functions()

Parameters:

alias – Characteristic column name in mapping.xlsx. If None, function names (without ‘c_’) are used as the characteristic names.
data – DataFrame with index = date/id, sorted on id/date.
freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY. Default to DAILY.

chars: DataFrame to store firm characteristics. Index = date/permno.

c_ami_126d(): Illiquidity. Amihud (2002)

c_baspread(): Bid-ask spread. Amihud and Mendelson (1986)

c_beta(): Market beta (GHZ). Fama and MacBeth (1973)

c_beta_dimson_21d(): Dimson Beta. Dimson (1979)

c_betabab_1260d(): Frazzini-Pedersen beta. Frazzini and Pedersen (2014)

c_betadown_252d(): Downside beta. Ang, Chen, and Xing (2006)

c_betasq(): Beta squared (GHZ). Fama and MacBeth (1973)

c_bidaskhl_21d(): High-low bid-ask spread. Corwin and Schultz (2012)

c_corr_1260d(): Market correlation. Assness et al. (2020)

c_coskew_21d(): Coskewness. Harvey and Siddique (2000)

c_dolvol_126d(): Dollar trading volume (JKP). Brennan, Chordia, and Subrahmanyam (1998)

c_dolvol_var_126d(): Volatility of dollar trading volume (JKP). Chordia, Subrahmanyam, and Anshuman (2001)

c_idiovol(): Idiosyncratic volatility (GHZ). Ali, Hwang, and Trombley (2003)

c_iskew_capm_21d(): Idiosyncratic skewness (CAPM). Bali, Engle, and Murray (2016)

c_iskew_ff3_21d(): Idiosyncratic skewness (FF3). Bali, Engle, and Murray (2016)

c_iskew_hxz4_21d(): Idiosyncratic skewness (HXZ). Bali, Engle, and Murray (2016)

c_ivol_capm_21d(): Idiosyncratic volatility (CAPM). Ang et al. (2006)

c_ivol_capm_252d(): Idiosyncratic volatility (Org, JKP). Ali, Hwang, and Trombley (2003)

c_ivol_ff3_21d(): Idiosyncratic volatility (FF3). Ang et al. (2006)

c_ivol_hxz4_21d(): Idiosyncratic volatility (HXZ). Ang et al. (2006)

c_prc_highprc_252d(): 52-week high. George and Hwang (2004)

c_pricedelay(): Price delay based on R-squared. Hou and Moskowitz (2005)

c_pricedelay_slope(): Price delay based on slopes. Hou and Moskowitz (2005)

c_retvol(): Return volatility. Ang et al. (2006)

c_rmax1_21d(): Maximum daily return. Bali, Cakici, and Whitelaw (2011)

c_rmax5_21d(): Highest 5 days of return. Bali, Brown, and Tang (2017)

c_rmax5_rvol_21d(): Highest 5 days of return to volatility. Assness et al. (2020)

c_rskew_21d(): Return skewness. Bali, Engle, and Murray (2016)

c_std_dolvol(): Volatility of dollar trading volume (GHZ). Chordia, Subrahmanyam, and Anshuman (2001)

c_std_turn(): Volatility of share turnover (GHZ). Chordia, Subrahmanyam, and Anshuman (2001)

c_trend_factor(): Trend factor. Han, Zhou, and Zhu (2016)

c_turnover_126d(): Share turnover (JKP). Datar, Naik, and Radcliffe (1998)

c_turnover_var_126d(): Volatility of share turnover (JKP). Chordia, Subrahmanyam, and Anshuman (2001)

c_zero_trades_126d(): Zero-trading days (6 months). Liu (2006)

c_zero_trades_21d(): Zero-trading days (1 month). Liu (2006)

c_zero_trades_252d(): Zero-trading days (12 months). Liu (2006)

create_chars(char_list=None)

Generate firm characteristics.

This method overrides Panel.create_chars() to store characteristics in chars instead of data.

filter_data()

Filter data.

Currently, we filter data only on shrcd.

load(fname=None, fdir=None)

Load data and parameters.

This method overrides Panel.load() to load chars instead of data.

load_data(sdate=None, edate=None)

Load crspd data from file.

The loaded data is sorted on permno/date, has index = date/permno, and stored in the data attribute.

merge_with_factors(factors=None)

Merge crspd with factors.

The factors should contain Fama-French 3 factors and Hou-Xue-Zhang 4 factors with column names as defined in config.factor_names.

Parameters:: factors – DataFrame of factors with index = date. If None, it will be read from config.daily_factors_fname.

postprocess()

Postprocess data.

This method deletes temporary variables (variables starting with ‘_’) and replaces infinite values with nan. This method can be overridden to add code to trim or winsorize characteristics.

prepare(char_fcns)

Prepare “ingredient” characteristics that are required to generate a characteristic.

This method overrides Panel.prepare() to store characteristics in chars instead of data.

static rolling_apply_m(cd, data_columns, fcn, n_retval)

Apply fcn to cd[data_columns] in every month for each stock.

This function runs a loop over permno-month and can be used when the calculation inside fcn requires only the data within the month.

Parameters:

cd – crspd.data.
data_columns – Columns of cd that are used as input to fcn.
fcn – function to apply. The first argument (data) should be cd[data_columns] of a permno-month and the second argument (isnan) should be a nan indicator of data: isnan[i] = True if any column of data[i, :] has nan. fcn should return a tuple of size n_retval.
n_retval – Number of returns from fcn.

Returns:

Concatenated ndarray of the returns from fcn. Size = permno/month x n_retval.

save(fname=None, fdir=None, other_columns=None)

Save this object.

This method overrides Panel.save() to save chars instead of data.

update_variables()

Preprocess data before creating characteristics.

Replace missing values with other variables.
Create frequently used variables.

pyanomaly.characteristics.Merge

class pyanomaly.characteristics.Merge

Bases: Panel

Class to generate firm characteristics from a merged dataset of funda, fundq, crspm, and crspd.

The firm characteristics generated in this class can be viewed using Merge.show_available_functions():

>>> Merge().show_available_functions()

c_age(): Firm age. Jiang, Lee, and Zhang (2005)

c_mispricing_mgmt(): Mispricing factor: Management. Stambaugh and Yuan (2016)

c_mispricing_perf(): Mispricing factor: Performance. Stambaugh and Yuan (2016)

c_qmj(): Quality minus Junk: Composite. Assness, Frazzini, and Pedersen (2018)

c_qmj_growth(): Quality minus Junk: Growth. Assness, Frazzini, and Pedersen (2018)

c_qmj_prof(): Quality minus Junk: Profitability. Assness, Frazzini, and Pedersen (2018)

c_qmj_safety(): Quality minus Junk: Safety. Assness, Frazzini, and Pedersen (2018)

postprocess()

Postprocess data.

This method deletes temporary variables (variables starting with ‘_’) and replaces infinite values with nan. This method can be overridden to add code to trim or winsorize characteristics.

preprocess(crspm=None, crspd=None, funda=None, fundq=None, delete_data=True)

Merge crspm, crspd, funda, and fundq by left-joining crspd, funda, and fundq to crspm. The resulting data has index = date/permno.

Parameters:

crspm – CRSPM instance.
crspd – CRSPD instance.
funda – FUNDA instance.
fundq – FUNDQ instance.
delete_data – True to delete the data of crspm, crspd, funda, and fundq after merge to save memory.

pyanomaly.datatools

This module defines various toolkits for data-handling.

Rolling Application of Function

The functions below apply a function to each group of a grouped data. For the differences, see the documentation of each function.

groupby_apply()
groupby_apply_np()
rolling_apply()
rolling_apply_np()

Grouping/Filtering/Trimming/Winsorization

classify(): Classify data cross-sectionally.
filter_n(): Filter data on a column.
filter(): Filter data on a column with cut points based on another column.
trim(): Trim data.
winsorize(): Winsorize data.

Data Inspection/Comparison

inspect_data(): Inspect a dataset.
compare_data(): Compare two data sets.

Auxiliary Functions

populate(): Populate data.
to_month_end(): Shift dates to the last dates of the same month.
add_months(): Add months to dates.

pyanomaly.datatools.add_months(date, months, to_month_end=True)

Add months to dates.

Parameters:

date – datetime Series.
months – Months to add. Can be negative.
to_month_end – If True, returned dates are end-of-month dates.

Returns:

datetime Series of (date + months). Dates are end-of-month dates if to_month_end = True.

pyanomaly.datatools.classify(array, split, ascending=True, by_array=None)

Classify array cross-sectionally.

Class labels are set to 0, …, no. quantiles-1, where 0 corresponds to the lowest (highest) group if ascending = True (False). If array contains nan, their classes are set to nan.

Parameters:

array – Data to classify. Series with date/id index.
split – Number of classes or list of quantiles (see classify_array()).
ascending (bool) – Sorting order.
by_array – Data based on which quantile breakpoints are determined. If None, by_array = array.

Returns:

Series of class labels with index = data.index.

Examples

Suppose data is a panel data that has columns ‘me’ (market equity) and ‘me_nyse’ (same as ‘me’ but has a value only when the stock is listed on NYSE). The following code groups stocks into terciles based on size with NYSE-size breakpoints and store the output in data[‘me_class’].

split = [0.2, 0.7, 1.0]  # bottom 20%, 20-70%, top 30%
data['me_class'] = classify(data['me'], split, by=data['me_nyse'])

pyanomaly.datatools.classify_array(array, split, ascending=True, by_array=None)

Classify array into classes.

Class labels are set to 0, …, no. quantiles-1, where 0 corresponds to the lowest (highest) group if ascending = True (False). If array contains nan, their classes are set to nan.

Parameters:

array – Nx1 ndarray or Series to be grouped.
split – Number of classes (for equal-size quantiles) or list of quantiles. (0.3, 0.7) is equivalent to (0.3, 0.7, 1.0).
ascending (bool) – Sorting order.
by_array – Array based on which cut points are determined. If None, by_array = array. Eg, array can be set to ME and by_array to NYSE-ME to group firms on size with NYSE-size cut points.

Returns:

Nx1 ndarray or Series of classes. The output type corresponds to the input type.

Examples

split = 10  # Classify array into 10 equally-spaced groups (0, ..., 9).
split = [0.3, 0.7, 1.0]  # Classify array into three groups (0, 1, 2) that correspond to 0.3, 03-0.7, 0.7-1.0
                         # quantiles.
ascending = False  # Label 0 represents the largest group.

pyanomaly.datatools.compare_data(data1, data2=None, on=None, how='inner', tolerance=0.01, suffixes=('_x', '_y'), returns=True)

Compare data1 with data2.

This function compares the common columns of data1 and data2. This is similar to data1.compare(data2), but data1 and data2 are not required to have the same index and columns. Also, a tolerance can be set to determine whether two values are the same.

Parameters:

data1 – Dataframe for comparison.
data2 – Dataframe for comparison. If None, data1 is assumed to be a merged dataset of data1 and data2. If data1 is a merged dataset, on and how have no effect.
on – A column or a list of columns to merge data sets on. If None, data sets will be merged on index.
how – How to merge: ‘inner’, ‘outer’, ‘left’, or ‘right’
tolerance – Tolerance level to determine equality. Two values, val1 and val2, are considered to be the same if abs((val1 - val2) / val2) < tolerance.
suffixes – suffixes to add to common columns or suffixes used in the merged dataset. suffixes[0]: suffix for data1, suffixes[1]: suffix for data2.
returns – If True, return the merged data.

pyanomaly.datatools.filter(data, on, limits, by=None)

Filter data on on with cut points determined by by.

Filter out data if data[on] is below limits[0] quantile or above (1 - limits[1]) quantile. This function is similar to filter_n() but the cut points can be determined by another column values instead of the values of on.

Parameters:

data – DataFrame or Series. The first index should be date.
on – Column to filter on.
limits – Pair of quantiles, eg, (0.1, 0.1), (0.1, None).
by – Column based on which cut points are determined. If None, by = col.

Returns:

Filtered data.

Examples

Suppose data is a panel data (index = date/id) and has columns ‘me’ (market equity) and ‘nyse_me’ (market equity of NYSE stocks). The stocks whose sizes are smaller than the 0.2 NYSE-size quantile can be removed as follows:

>>> data = filter(data, 'me', (0.2, None), by='nyse_me')

pyanomaly.datatools.filter_n(data, on, n=None, q=None, ascending=True)

Filter data on on.

If n is given, choose n smallest (if ascending = True) or largest (if ascending = False) within each date. If q is given, choose q quantile (from the bottom if ascending = True) within each date.

Parameters:

data – DataFrame or Series. The first index should be date.
on – Column to filter on.
n – Number of rows to keep.
q – Quantile to keep.
ascending (bool) – If True, keep the smallest.

Returns:

Filtered data.

Examples

Suppose data is a panel data (index = date/id) and has a column ‘me’ (market equity). The smallest 20% can be filtered out (keep the largest 80%) as follows:

>>> data = filter(data, 'me', 0.8, ascending=False)

pyanomaly.datatools.groupby_apply(gb, fcn, *args)

Apply fcn to groups in gb.

Parameters:

gb – pd.GroupBy object.
fcn – Function to apply to groups. Function arguments should be (gbdata, *args), where gbdata is an element of gb.

Returns:

Concatenated outputs of fcn.

pyanomaly.datatools.groupby_apply_np(gb, fcn, *args)

Apply fcn to the values of the groups in gb.

This is the same as groupby_apply() except that the first argument of fcn is the values of the grouped data (ndarray). Use this function if fcn cannot receive DataFrame as an argument, e.g., when fcn is wrapped by numpy.jit.

Note

The groupby operation must not change the order of the data as the input to fcn, data.values, does not have an index. E.g., if data is sorted on id/date, grouping data by id is fine but not by date.

Parameters:

gb – pd.GroupBy object.
fcn – Function to apply to groups. Function arguments should be (gbdata_values, *args), where gbdata_values is the values of an element of gb.

Returns:

Concatenated outputs of fcn.

pyanomaly.datatools.inspect_data(data, option=['summary'], date_col=None, id_col=None)

Inspect data.

This function inspects a panel data, data, and print the results.

‘summary’: data shape, number of unique dates, and number of unique ids.
‘id_count`: Number of ids per date.
‘nans’: Number of nans and infs per column.
‘stats’: Descriptive statistics. Same as data.describe().

Parameters:

data – Dataframe. It should have date and id columns or index = date/id.
option – List of items to display. Available options: ‘summary’, ‘id_count’, ‘nans’, ‘stats’.
date_col – Date column. If None, date.index[0] is assumed to be date.
id_col – ID column. If None, date.index[1] is assumed to be id.

pyanomaly.datatools.populate(data, freq, method='ffill', limit=None)

Populate data.

References

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html

Parameters:

data – Dataframe with index = date/id.
freq – Frequency to populate: QUARTERLY, MONTHLY, or DAILY.
method – Filling method for newly added rows. ‘ffill’: forward fill, None: None.
limit – Maximum number of rows to forward-fill.

Returns:

Populated data.

pyanomaly.datatools.rolling_apply(data, by=None, fcn=None, *args)

Apply fcn to each group of data grouped by by.

This is similar to groupby_apply(): groupby_apply() receives pd.GroupBy object as input whereas this function performs pd.groupby() inside.

Parameters:

data – DataFrame to be grouped.
by – Column by which data is grouped. If None, data is grouped by the last index.
fcn – Function to apply to groups. Function arguments should be (gbdata, *args), where gbdata is an element of the grouped data.

Returns:

Concatenated outputs of fcn.

pyanomaly.datatools.rolling_apply_np(data, by=None, fcn=None, *args)

Apply fcn to each group of data grouped by by.

This is the same as rolling_apply() except that the first argument of fcn is the values of the grouped data (ndarray). Use this function if fcn cannot receive DataFrame as an argument, e.g., when fcn is wrapped by numpy.jit.

Note

The groupby operation must not change the order of the data as the input to fcn, data.values, does not have an index. E.g., if data is sorted on id/date, grouping data by id is fine but not by date.

Parameters:

data – DataFrame to be grouped.
by – Column by which data is grouped. If None, data is grouped by the last index, i.e., groupby(level=-1).
fcn – Function to apply to groups. Function arguments should be (gbdata_values, *args), where gbdata_values is the values of an element of the grouped data.

Returns:

Concatenated outputs of fcn.

pyanomaly.datatools.to_month_end(date)

Shift dates to the last dates of the same month.

Parameters:: date – datetime Series.
Returns:: datetime Series shifted to month end.

pyanomaly.datatools.trim(array, limits, by_array=None)

Trim array within each date that are outside of the quantile values defined by limits.

E.g., if limits = (0.1, 0.1), trim data below 0.1 quantile and above 0.9 quantile. The elements of limits can be set to None for one-sided trim, e.g., limits = (0.1, None).

Parameters:

array (Series) – Data to trim. The first index should be date.
limits – Pair of quantiles, eg, (0.1, 0.1), (0.1, None).
by_array (Series) – Data based on which cut points are determined. If None, by_array = array.

Returns:

Trimmed data.

pyanomaly.datatools.trim_data(data, limits)

Trim each column of data within each date that are outside of the quantile values defined by limits.

The trimmed values are set to nan and the output shape is equal to the input shape.

Parameters:

data (DataFrame) – Data to trim. The first index should be date.
limits – Pair of quantiles, eg, (0.1, 0.1), (0.1, None).

Returns:

Trimmed data.

pyanomaly.datatools.winsorize(array, limits, by_array=None)

Winsorize array within each date that are outside of the quantile values defined by limits.

Parameters:

array (Series) – Data to winsorize. The first index should be date.
limits – Pair of quantiles, eg, (0.1, 0.1), (0.1, None).
by_array (Series) – Data based on which cut points are determined. If None, by_array = array.

Returns:

Winsorized data.

pyanomaly.datatools.winsorize_data(data, limits)

Winsorize each column of data within each date that are outside of the quantile values defined by limits.

Parameters:

data (DataFrame) – Data to winsorize.. The first index should be date.
limits – Pair of quantiles, eg, (0.1, 0.1), (0.1, None).

Returns:

Winsorized data.

pyanomaly.ff

Fama-French Factors.

This module defines a function to generate Fama-French factors.

pyanomaly.ff.make_ff_factors()

Generate Fama-French 3 factors.

This function refers to the WRDS code, but the results are slightly different as the code is written under the architecture of PyAnomaly. Compared to the data from the K. French website, HML has a correlation of 0.967, and SMB has a correlation of 0.989. Compared to the WRDS code, HML has a correlation of 0.991, and SMB has a correlation of 0.993.

WRDS code: https://wrds-www.wharton.upenn.edu/pages/support/applications/python-replications/fama-french-factors-python/

Major differences from WRDS code:

Primary stock identification: for our method, refer to wrds.add_gvkey_to_crsp().
Delist return: for our method, refer to wrds.merge_sf_with_seall().

Returns:

factors, number of firms.

factors (DataFrame): index = ‘date’, columns: ‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’.
number of firms (DataFrame): index = ‘date’, columns: ‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’.

pyanomaly.ff.make_ff_factors_wrds()

This is a simple copy of the WRDS code.

https://wrds-www.wharton.upenn.edu/pages/support/applications/python-replications/fama-french-factors-python/

Returns:: factors, number of firms.

pyanomaly.fileio

This module defines functions for file IO.

pyanomaly.fileio.read_from_file(fname, fdir=None)

Read data from a pickle file.

Parameters:

fname – File name without extension.
fdir – Directory. If None, config.output_dir.

Returns:

DataFrame read from fdir/fname.pickle.

pyanomaly.fileio.write_to_file(data, fname, fdir=None)

Write data to a pickle file.

Parameters:

data (Dataframe) – Data to save.
fname – File name without extension.
fdir – Directory. If None, config.output_dir.

pyanomaly.jkp

This module defines functions to generate firm characteristics, factors, and characteristic portfolios replicating JKP’s SAS code.

pyanomaly.jkp.generate_firm_characterisitcs(fname=None)

Firm characteristics generation (Replication of JKP’s SAS code).

This function generate the firm characteristics that appear in JKP (2021). We replicate JKP’s SAS code as closely as possible. The output is saved to ‘config.output_dir/fname.pickle`.

It is assumed that

the raw data has been downloaded from WRDS. If not, call WRDS.download_all();

factor portfolios has been created. If not, call make_factor_portfolios().

Parameters:: fname – Output file name. If None, fname = ‘merge’.

pyanomaly.jkp.make_char_portfolios(data, char_list, weighting)

Make characteristic portfolios.

Parameters:

data – DataFrame of firm characteristics.
char_list – List of characteristics to generate.
weighting – ‘ew’ (Equal-weight), ‘vw’ (Value-weight), or ‘vw_cap’ (Value-weight capped at 0.8 NYSE-size quantile).

Returns:

Characteristic portfolio DataFrame with index = date/class and columns = [char, ‘ret’, ‘signal’, ‘n_firms’]. The class values are one of ‘h’, ‘m’, ‘l’, ‘hml’.

pyanomaly.jkp.make_factor_portfolio(data, char, ret_col, size_class, weight_col=None)

Make a factor portfolio.

Parameters:

data – Dataframe with index = date/id. Its columns should include char, ret_col, size_class, and weight_col (optional).
char – Characteristic column to make a factor portfolio from.
ret_col – Return column.
size_class – Size class column.
weight_col – Weight column. If None, stocks are equally weighted.

Returns:

Dataframe of (size x char) factor portfolios. Index = ‘date’, columns: [‘bh’, ‘bl’, ‘bm’, ‘sh’, ‘sl’, ‘sm’, ‘hml’, ‘smb’].

pyanomaly.jkp.make_factor_portfolios(monthly=True, sdate=None)

Make factor portfolios.

FF 3 and HXZ 4 factors are generated. The result is saved to config.factors_monthly(daily)_fname.

Parameters:

monthly – If True (False), generate data monthly (daily).
sdate – Start date (‘yyyy-mm-dd’).

Returns:

Factor portfolio dataframe with index = ‘date’ and columns = [‘mktrf’, ‘rf’, ‘smb_ff’, ‘hml’, ‘inv’, ‘roe’, ‘smb_hxz’].

pyanomaly.jkp.prepare_data_for_factors(monthly=True, sdate=None)

Create characteristics needed to make factors.

Parameters:

monthly – If True (False), generate data monthly (daily).
sdate – Start date (‘yyyy-mm-dd’).

pyanomaly.panel

This module defines Panel class, which serves as the base class for panel data analysis.

class pyanomaly.panel.Panel(alias=None, data=None, freq=12, base_freq=None)

Base class for panel data.

Data is stored in the attribute data. data should have index = date/id, i.e., the first index should be time-series identifier and the second index should be cross-section identifier, and always be sorted on id/date. The date index must be of datetime type.

Parameters:

alias – Characteristic column name in mapping.xlsx. If None, function names (without ‘c_’) are used as the characteristic names.
data – DataFrame with index = date/id, sorted on id/date.
freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.
base_freq – Frequency of data values. funda: ANNUAL, fundq: QUARTERLY, crspm: MONTHLY, crspd: DAILY. For example, if funda is populated monthly, freq = MONTHLY and base_freq = ANNUAL. If None, base_freq = freq.

data: DataFrame with index = date/id, sorted on id/date that stores a panel data.

freq: Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.

base_freq: Frequency of data values. ANNUAL, QUARTERLY, MONTHLY, or DAILY.

char_map: Dictionary to map functions with aliases. char_map[‘alias’] returns its corresponding function name.

reverse_map: Function to alias mapping. reverse_map[‘function’] returns its corresponding alias.

add_row_count()

Add row count column (‘rcount`) to data.

The rcount has values 0, 1, … for each ID starting from the first date. This can be use to remove the first n rows of each ID.

Note

If the rows of data is changed, e.g., by filtering, this function should be called again as rcount may no longer be continuous.

copy(columns=None, deep=False)

Make a copy of this object and return it.

Parameters:

columns – Columns of data to copy.
deep – If True, copy data, otherwise, only the address of data is copied.

Returns:

Copy of this object.

copy_from(panel, columns=None, deep=False)

Copy panel to this object.

Parameters:

panel – Panel object to be copied.
columns – Columns of panel.data to copy.
deep – If True, copy panel.data, otherwise, only the address of panel.data is copied.

create_chars(char_list=None)

Generate firm characteristics.

Created firm characteristics are added to data using their function names as the column names.

Parameters:: char_list – List of characteristics (aliases) to create. If None, all available characteristics are created.

cumret(ret, period=1, lag=0)

Compute cumulative returns between t-period and t-lag.

E.g., 12-month momentum can be obtained by setting period = 12 and lag = 1. A negative period will generate future returns: eg, period = -1 and lag = 0 for one-period ahead return; period = -3 and lag = -1 for two-period ahead return starting from t+1.

Note

Returns ret should be in base_freq. If base_freq = ANNUAL, the returns in ret should be annual returns regardless of the value of freq.

Parameters:

ret – Series of returns in base_freq.
period – Target horizon. (+) for past returns and (-) for future returns.
lag – Period to calculate returns from.

Returns:

Series of cumulative returns.

date_col(): Get date index name.

date_values(): Get date index values.

diff(data, n=1)

Get the difference of the elements of data accounting for frequency.

This is similar to data.diff() but the actual period between two data points is determined by freq and base_freq. See Panel.shift() for details.

Parameters:

data – Series or Dataframe.
n – Shift size in terms of base frequency.

Returns:

Differenced data.

filter(filters, keep_row=False)

Filter data using the conditions defined by filters.

A filter is a tuple of three elements:

filter[0]: column to apply the filter to.

filter[1]: filter condition: ‘==’, ‘!=’, ‘>’, ‘<’, ‘>=’, ‘<=’, ‘in’, or ‘not in’.

filter[2]: rhs value.

If filters is a list of filters, they are applied sequentially.

For example, to remove data[‘x’] < 10,

>>> panel.filter_data(('x', '>=', 10))

This is equivalent to

>>> panel.data = panel.data[panel.data['x'] >= 10]

Parameters:

filters – A filter or list of filters.
keep_row – Whether to keep or remove the filtered out rows. If True, the values of the filtered rows are set to None.

futret(ret, period=1): Get period-period ahead return.

get_available_chars()

Get the list of available characteristics.

This function returns the aliases of the characteristics defined in alias column of mapping.xlsx.

Returns:: A list of characteristics.

id_col(): Get ID index name.

id_values(): Get ID index values.

inspect_data(option=['summary'])

Inspect data.

See datatools.inspect_data() for details.

load(fname=None, fdir=None)

Load this object from a file.

Parameters:

fname – File name without extension. If None, fname = class name.
fdir – Directory to read the file from. If None, fdir = config.output_dir.

merge(right, on=None, how='inner', drop_duplicates='right')

Merge data with right.

Unlike pd.merge(), the index of data is always retained.

Parameters:

right – Series or Dataframe to merge with
on – Columns to merge on. If None, merge on index.
how – Merge method: ‘inner’, ‘outer’, ‘left’, or ‘right’.
drop_duplicates – how to handle duplicate columns. ‘left’: keep right, ‘right’: keep left, None: keep both. If None, ‘_x’ (‘_y’) are added to the duplicate column names of left (right).

pct_change(data, n=1, allow_negative_denom=False)

Get the percentage change of data accounting for frequency.

This is similar to data.pct_change() but the actual period between two data points is determined by freq and base_freq. See Panel.shift() for details.

Parameters:

data – Series or Dataframe.
n – Shift size in terms of base frequency.
allow_negative_denom – If False, set the output to nan when the denominator is not positive.

Returns:

Series or Dataframe of percentage changes.

populate(freq, method='ffill', limit=None)

Populate data.

See datatools.populate() for details.

prepare(char_fcns)

Prepare “ingredient” characteristics that are required to generate a characteristic.

If a characteristic is defined as a function of other characteristics, this function will check whether they already exist and generate them if they do not exist.

Parameters:: char_fcns – list of function names (without ‘c_’) of the ingredient characteristics.

remove_rows(data, nrow=1)

Remove (set to nan) the first nrow rows of data for each ID.

Parameters:

data – Dataframe, Series, or ndarray. Its length should be the same as the length of self.data.
nrow – Number of rows to remove (set to nan).

Returns:

data with rows removed.

rename(to_alias=True)

Rename data columns.

Parameters:: to_alias – If True, rename firm characteristic columns from function names to aliases, and vice-versa.

rolling(data, n, function, min_n=None, lag=0)

Apply function to a rolling window of data.

This is similar to data.rolling() but the actual rolling window is determined by freq and base_freq. See Panel.shift() for details.

Parameters:

data – Series or Dataframe.
n – Window size.
function – Function to apply: ‘mean’, ‘std’, ‘var’, ‘sum’, ‘min’, or ‘max’.
min_n – Minimum number of observations in the window. If observations < min_n, result is nan. If min_n = None, min_n = n.
lag – Lag size before start rolling.

Returns:

(Series or Dataframe) Rolling calculation result.

Examples

Suppose funda is a Panel object with freq = MONTHLY and base_freq = ANNUAL (annual data populated monthly), and funda.data has a return column, ‘ret’. The past three-year average return starting from one year ago can be calculated (with a condition of at least two non-missing values within the sample window) and saved as ‘avg_ret3y’ as follows:

>>> funda.data['avg_ret3y'] = funda.rolling(funda.data['ret'], 3, 'mean', 2, 1)

rolling_beta(data, n, min_n=None)

Rolling OLS accounting for frequency.

This is similar to Panel.rolling_regression(): this function is faster but the output is limited to coefficients, R2, and idiosyncratic volatility.

Parameters:

data – DataFrame with index = date/id, sorted on id/date. The first column must be y and the rest X (not including constant).
n – Window size in terms of base frequency.
min_n – Minimum number of observations in the window. If observations < min_n, result is nan. If min_n = None, min_n = n.

Returns:

Coefficients, R2, idiosyncratic volatility. These are NxK, Nx1, and Nx1 ndarrays, respectively, where N = len(data) and K = number of X + 1.

rolling_regression(y, X, n, min_n=None, add_constant=True, output=None)

Rolling OLS accounting for frequency.

This function internally uses statsmodels.RollingOLS(), which is slow when output is not None. Use Panel.rolling_beta() if possible.

Note

When this function is used, the first n-1 rows of each ID should be removed. This means that when min_n < n, some results can be lost. E.g., if n = 5 and min_n = 3, the rows with data.rcount < 4 should be removed, and the results for data.rcount = 3, 4 will be lost.

Parameters:

y – Series of y with index = date/id.
X – Series or Dataframe of X with index = date/id.
n – Window size in terms of base frequency.
min_n – Minimum number of observations in the window. If observations < min_n, result is nan. If min_n = None, min_n = n.
add_constant – If True, add constant to X.
output – List of outputs supported by statsmodels.RollingOLS(). If None, estimate coefficients only.

Returns:

DataFrame of the coefficients if output = None, otherwise, DataFrames of the outputs defined by output.

save(fname=None, fdir=None, other_columns=None)

Save this object.

Parameters are saved to a json file and data is saved to a pickle file.

Parameters:

fname – File name without extension. If None, fname = class name.
fdir – Directory to save the file. If None, fdir = config.output_dir.
other_columns – List of columns to save other than firm characteristics. If given, characteristic columns plus other_columns are saved. If None, all columns are saved. Note that unsaved columns will be deleted from self.data.

shift(data, n=1)

Shift data accounting for frequency.

This is similar to data.shift() but the actual shift period is determined by freq and base_freq. E.g., if freq = MONTHLY, base_freq = ANNUAL, n = 1 means 1-year shift, and data is shifted by 12 (12 months). That is, if base_freq = ANNUAL (QUARTERLY), shift(n) will always shift data by n years (quarters) regardless of the data frequency, freq.

Parameters:

data – Series or Dataframe.
n – Shift size in terms of base frequency.

Returns:

Shifted data.

show_available_functions(): Display all functions for firm characteristics.

sum(*args)

Sum of args.

This is similar to sum() of SAS: nan’s of args are replaced by 0.

Parameters:: args – List of Series.
Returns:: (Series) Sum of args

pyanomaly.portfolio

This module defines Portfolo and Portfolios classes that are used for portfolio analysis.

class pyanomaly.portfolio.Portfolio(name=None, position=None, rf=None, pfval0=1, costfcn=None, keep_position=False)

Portfolio class.

This class makes a portfolio from positions and evaluate it.

Position information is saved in position attribute and portfolio information is saved in value attribute. If positive weights of the positions on a date don’t add up to 1, (1 - sum(positive weights)) will be assumed to be invested in a risk free asset, and its information is saved in fposition attribute. The transaction cost is assumed to be 0 for risk-free assets. Once the portfolio is evaluated by calling Portfolio.eval(), performance attribute is generated.

Parameters:

name – Portfolio name.
position – Position DataFrame. It should have index = ‘date’ and columns = ‘id’ (security ID) ‘ret’ (return), and ‘wgt’ (portfolio weight). If it has other columns such as price, they will be kept in the position attribute. If it has ‘rf’ (risk-free rate) column, its values are used as risk-free rates.
rf – Risk-free rate DataFrame with index = ‘date’ and columns = ‘rf’. The rf has priority over ‘rf’ column in position. If rf = None and position does not have ‘rf’ column, the risk-free is assumed to be 0.
pfval0 – Initial portfolio value. Default to 1.
costfcn – TransactionCost class, a transaction cost function, or value. If a const transaction cost of 20 basis points is assumed, this can be set to 0.002.
keep_position – If False, position information (position) is deleted after the portfolio is created.

name: Portfolio name.

position

Position DataFrame. Its index is ‘date’ and has the following columns:

‘id’: security ID provided by the user.
‘ret’: Return between t-1 and t.
‘exret’: Excess return over risk-free rate between t-1 and t.
‘wgt`: Weight at the beginning of t.
‘val1’: Value at t.
‘val’: Value at the beginning of t.
‘val0’: value at t-1.
‘cost’: Transaction cost incurred at the beginning of t.
Other columns: Any columns that are in the input position data.

In the description above, ‘at the beginning of t` means at t-1 after rebalancing.

value

Portfolio value DataFrame. Its index is ‘date’ and has the following columns:

‘ret’: Return between t-1 and t. This can be either net return or gross return depending on the evaluation method: see Portfolio.eval().
‘exret’: Excess return over risk-free rate between t-1 and t. This can be either net excess return or gross excess return depending on the evaluation method.
‘val1’: Value at t.
‘val’: Value at the beginning of t.
‘cost’: Transaction cost incurred at the beginning of t.
‘tover’: Turnover incurred at the beginning of t. tover = sum(pos.val - pos.val0) / sum(pf.val), where pos.val is the position values and pf.val is the portfolio value.
‘netret’: Return between t-1 and t, net of transaction cost.
‘grossret’: Gross return between t-1 and t.
‘netexret’: Excess return between t-1 and t, net of transaction cost.
‘grossexret’: Excess gross return between t-1 and t.
‘lposition’: Number of long positions.
‘sposition’: Number of short positions.

Columns added once the portfolio is evaluated by calling Portfolio.eval():

‘cumret’: Cumulative return since the first date.
‘drawdown’: Drawdown.
‘drawdur’: Duration of daawdown in the frequency of data, e.g., if rebalanced monthly, 3 means 3 months.
‘drawstart’: Beginning date of drawdown.
‘succdown’: Successive down; down without any up during the period.
‘succdur’: Duration of successive down.
‘succstart’ Beginning date of successive down.

fposition

Risk-free asset position DataFrame. Its index is ‘date’ and has the following columns:

‘ret’: Return between t-1 and t.
‘wgt`: Weight at the beginning of t.
‘val1’: Value at t.
‘val’: Value at the beginning of t.

performance

Portfolio performance DataFrame. Its column is equal to the portfolio name and has the following indexes:

‘mean’: Mean excess return over the evaluation period.
‘std’: Standard deviation of the excess returns over the evaluation period.
‘sharpe’: Sharpe ratio.
‘cum’: : Cumulative return over the evaluation period.
‘mdd’: Maximum drawdown.
‘mdd start’: Maximum drawdown start date.
‘mdd end’: Maximum drawdown end date.
‘msd’: Maximum successive down.
‘msd start’: Maximum successive down start date.
‘msd end’: Maximum successive down end date.
‘turnover’: Average turnover.
‘lposotion’: Average number of long positions.
‘sposition’: Average number of short positions.

costfcn

TransactionCost object, a transaction cost function, or value. For example, if a const transaction cost of 20 basis points is assumed, costfcn can be simply set to 0.002.

If costfcn is a TransactionCost object, TransactionCost.get_cost(position) is called to get transaction costs. TransactionCost allows transaction costs varying across time and securities. See pyanomaly.tcost module for more details.

When costfcn is defined as a function, the transaction cost function should have arguments val (value after rebalancing) and val0 (value before rebalancing). For example, if the transaction cost to buy (sell) is 20 (30) bps, a function can be defined as follows:

def cost_fcn(val, val0):
    return 0.002 if val > val0 else 0.003

Note

If a position exists at t-1 but not at t, it will be added at t with 0 weight. This is to compute the transaction cost.

‘val’, ‘val1’, and ‘val0’ in position and value are calculated without taking transaction costs into account. For the value increase after transaction costs, use the cumulative return.

add_position(position, rf=None)

Add new positions.

This method assumes positions are added in order of time. All positions of a given date must be added together. The arguments have the same formats as those of the initializer.

copy(sdate=None, edate=None)

Copy this object for the given period.

Parameters:

sdate – Start date (‘yyyy-mm-dd’).
edate – End date.

Returns:

Portfolio object.

cum_return(sdate=None, edate=None, logscale=True)

Get the cumulative return over the period.

Parameters:

sdate – Start date.
edate – End date.
logscale – If True, return log-scale cumulative return.

Returns:

Cumulative return.

cum_returns(sdate=None, edate=None, logscale=True, zero_start=False)

Get portfolio cumulative returns.

Both sdate and edate are inclusive, i.e., the first return is the return over sdate-1 and sdate.

Parameters:

sdate – Start date.
edate – End date.
logscale – If True, return log-scale cumulative returns.
zero_start – If True, the return at sdate is forced to 0. That is, it is assumed that trading starts at the end of sdate. This is useful when plotting cumulative returns as all curves will start at the same point, 0.

Returns:

Cumulative return Series with index = ‘date’.

drawdown(sdate=None, edate=None, logscale=True)

Get the drawdowns over the period.

Parameters:

sdate – Start date.
edate – End date.
logscale – If True, return log-scale values.

Returns:

DataFrame of drawdowns with index = ‘date’ and columns = [‘value’, ‘duration’, ‘start’].

eval(sdate=None, edate=None, logscale=True, annualize_factor=1, consider_cost=True, percentage=False)

Evaluate the portfolio over the period.

This method evaluate the portfolio and create performance attribute. It also adds performance-related columns to value attribute.

performance: DataFrame with columns = [name] and the following indexes:

‘mean’: Mean excess return over the evaluation period.

‘std’: Standard deviation of the excess returns over the evaluation period.

‘sharpe’: Sharpe ratio.

‘cum’: : Cumulative return over the evaluation period.

‘mdd’: Maximum drawdown.

‘mdd start’: Maximum drawdown start date.

‘mdd end’: Maximum drawdown end date.

‘msd’: Maximum successive down.

‘msd start’: Maximum successive down start date.

‘msd end’: Maximum successive down end date.

‘turnover’: Average turnover.

‘lposotion’: Average number of long positions.

‘sposition’: Average number of short positions.

Columns added to value:

‘cumret’: Cumulative return since sdate.

‘drawdown’: Drawdown.

‘drawdur’: Duration of daawdown in the frequency of data, e.g., if rebalanced monthly, 3 means 3 months.

‘drawstart’: Beginning date of drawdown.

‘succdown’: Successive down; down without any up during the period.

‘succdur’: Duration of successive down.

‘succstart’ Beginning date of successive down.

Parameters:

sdate – Start date.
edate – End date.
logscale – If True, ‘cum’, ‘mdd’, ‘msd’ are in log-scale.
annualize_factor – ‘mean’, ‘std’, ‘sharpe’, and ‘turnover’ are annualized by this factor, e.g., mean is multiplied by annualize_factor and std by its square-root. If the data is monthly, the results can be annualized by setting annualize_factor = 12. Default to 1.
consider_cost – If True, the results are calculated using net returns, otherwise, they are calculated using gross returns. If True (False), value.ret will be set to value.netret (value.grossret).
percentage – If True, ‘mean’, ‘std’, ‘cum’, ‘mdd’, ‘msd’, and ‘turnover’ are multiplied by 100. Default to False.

Returns:

performance, value.

eval_series(sdate=None, edate=None, logscale=True, annualize_factor=1, consider_cost=True, percentage=False)

Evaluate the portfolio repeatedly over the period.

Evaluate the portfolio repeatedly for the period [sdate, sdate+1], [sdate, sdate+2], …, [sdate, edate]. For the description of the arguments, see Portfolio.eval().

Returns:: Performance for each period. DataFrame with index values equal to the period end dates and columns equal to the indexes of performance attribute, i.e., a row with index t contains the performance up to t.

static from_portfolo_return(pfret, val=1)

Make a portfolio given portfolio returns.

If portfolio returns are already known, this method can be used to construct a Portfoliio object and evaluate the portfolio.

Parameters:: pfret – DataFrame with index = ‘date’ and columns = ‘ret’.
Returns:: Portfolio object.

max_drawdown(sdate=None, edate=None, logscale=True)

Get the maximum drawdown over the period.

Parameters:

sdate – Start date.
edate – End date.
logscale – If True, return log-scale value.

Returns:

Maximum drawdown. Series with index = [‘value’, ‘duration’, ‘start’].

max_succdown(sdate=None, edate=None, logscale=True)

Get the maximum successive down over the period.

Parameters:

sdate – Start date.
edate – End date.
logscale – If True, return log-scale value.

Returns:

Maximum successive down. Series with index = [‘value’, ‘duration’, ‘start’].

mean_return(sdate=None, edate=None)

Get portfolio mean return over the period.

Parameters:

sdate – Start date.
edate – End date.

Returns:

Mean return.

returns(sdate=None, edate=None)

Get portfolio returns.

Both sdate and edate are inclusive, i.e., the first return is the return over sdate-1 and sdate.

Parameters:

sdate – Start date.
edate – End date.

Returns:

Return Series with index = ‘date’.

set_position(position, rf=None, pfval0=1, keep_position=False)

Set positions.

This method sets position attribute from the input arguments. For the details of the input arguments, See the arguments of the initializer.

set_return_type(consider_cost=True)

Determine which return (gross vs. net) to use for portfolio evaluation.

If consider_cost is True, value.ret and value.exret are respectively set to net returns (value.netret) and net excess returns (value.netexret). Otherwise, they are set to gross returns (value.grossret) and gross excess returns (value.grossexret).

Parameters:: consider_cost – True to use returns net of transaction costs for evaluation.

sharpe_ratio(sdate=None, edate=None)

Get the Sharpe ratio over the period.

If risk-free rates are not set, they are assumed to be 0.

Parameters:

sdate – Start date.
edate – End date.

Returns:

Sharpe ratio.

std_return(sdate=None, edate=None)

Get the standard deviation of the returns over the period.

Parameters:

sdate – Start date.
edate – End date.

Returns:

Standard deviation of the returns.

succdown(sdate=None, edate=None, logscale=True)

Get the successive downs over the period.

Parameters:

sdate – Start date.
edate – End date.
logscale – If True, return log-scale values.

Returns:

Successive downs. DataFrame with index = ‘date’ and columns = [‘value’, ‘duration’, ‘start’].

class pyanomaly.portfolio.Portfolios(portfolios=None)

Class for a group of portfolios.

This class can have several portfolios as its members and evaluate them together. This class facilitates portfolio comparison by evaluating them and saving the results in a single DataFrame.

Parameters:: portfolios – List or dict of Portfolio objects to add. If it is a dict, its keys are used as portfolio names. Portfolios can be added later using Portfolios.add() or Portfolios.set().

members

Portfolio members. A Portfolio object is added to members with its name as the key. A member portfolio can be accessed by __getitem__():

>>> pf1 = Portfolio('pf1')
>>> portfolios = Portfolios()
>>> portfolios.add(pf1)
>>> pf1 = portfolios['pf1']  # This and the next line are equivalent.
>>> pf1 = portfolios.members['pf1']

Type:: dict

value: Portfolios’ values. This is a concatenated DataFrame of the value attributes of the members. It has the same structure as Portfolio.value except that its columns are two-level: the first level is the same as the columns of Portfolio.value and the second level is the member names.

performance: Portfolios’ performances. This is a concatenated DataFrame of the performance attributes of the members. It has the same structure as Portfolio.performance except that it has multiple columns, each corresponding to a member. The column names are the same as the portfolio names.

add(portfolio, alias=None)

Add a portfolio.

Parameters:

portfolio (Portfolio) – Portfolio to add.
alias – Portfolio alias. If not None, this is used as the portfolio name.

eval(sdate=None, edate=None, logscale=True, annualize_factor=1, consider_cost=True, percentage=False)

Evaluate the portfolios.

For the arguments, see Portfolio.eval().

Returns:: performance, value.

set(portfolios)

Add portfolios.

Parameters:: portfolios – List or dict of Portfolio objects. If it is a dict, its keys are used as portfolio names.

pyanomaly.tcost

This module defines classes for transaction costs.

class pyanomaly.tcost.TimeVaryingCost(me=None, normalize=True)

Bases: TransactionCost

Transaction costs that vary over time and across firm sizes.

This class implements the time-varying transaction costs used in, e.g., Brandt et al. (2009), Hand and Green (2011), DeMiguel et al. (2020), and Han (2021). Transaction cost parameter k is defined as k = y * z, where y decreases linearly from 4.0 in 1974.01 to 1.0 in 2002.01 and remains at 1.0 thereafter, and z = 0.006 - 0.0025 * nme, where nme is the normalized market equity that has a value between 0 and 1.

The maximum transaction cost is 240 basis points (the smallest firm before 1974) and the minimum transaction cost is 35 basis points (the largest firm after 2002).

We find this assumption is too conservative since the normalized me is sensitive to the largest company. In 1974, the mean of the normalized me is only 0.0045 and most firms have the transaction cost of 240 basis points, and in 2002, the mean of the normalized me is only 0.0059 and most firms have the transaction cost of 60 basis points.

Using a logarithm of the market equity or capping the me of the largest firms may make more sense.

set_params(me, normalize=True)

Set transaction cost parameters.

Parameters:

me – DataFrame or Series of market equity with index = date/id.
normalize – If True, normalize me so that its values are between 0 and 1.

Note

If the input contains only a subset of all listed stocks, normalizing the market equity can result in over- or underestimation of the transaction costs. For example, if me contains only top 80% of the stocks, the transaction costs will be overestimated. Use all listed stocks in the market or normalize the market equity outside and set normalize = False.

class pyanomaly.tcost.TransactionCost(**kwargs)

Bases: object

Transaction cost class.

Trnansaction cost can be set at the security level. It can also vary over time. See set_params() for details.

params: Transaction cost parameters. This can be a float number, dict, or DataFrame. See set_params().

get_cost(position)

Get transaction costs.

Parameters:: position – Portfolio positions. Portfolio object calls this function with the argument, Portfolio.position, to get transaction costs. The position argument should have index = ‘date’ and columns = [id, ‘val’, ‘val0’], where ‘val’ is the value after rebalancing and ‘val0’ is the value before rebalancing.
Returns:: Transaction costs. A vector with the same length as position.

static get_cost_(val, val0, params)

Get a quadratic transaction cost

Parameters:

val – Value after rebalancing.
val0 – Value before rebalancing.
params (dict) – Transaction cost parameters.

Returns:

Transaction cost.

set_params(**kwargs)

Set transaction cost parameters.

Parameters can be set either by this method or by __init__().

Parameters:

kwargs –

kwargs can have the following keywords:

’cost’: For a constant (scalar) transaction cost.
’buy_fixed`, ‘buy_linear`, ‘buy_quad’, ‘sell_fixed’, ‘sell_linear’, ‘sell_quad’: For a quadratic transaction cost.
’params’: For a transaction cost that varies across securities (and over time).

Below are some examples.

For a constant transaction cost of 20 basis points:

>>> tc = TransactionCost(cost=0.002)

Asymmetric quadratic cost function:

cost = fixed + linear * Amount + quad * Amount^2

To buy: fixed cost = $5, linear cost = 0.002, and quadratic cost = 0.001

To sell: fixed cost = 0, linear cost = 0.003, and quadratic cost = 0.001

>>> tc = TransactionCost(buy_fixed=5, buy_linear=0.002, buy_quad=0.001, sell_linear=0.003, sell_quad=0.001)

Only non-zero parameters need to be provided.

Transaction costs that vary across securities:

Security 1 (id: 0001): 0.002, security 2 (id: 0002): 0.003

>>> params = pd.DataFrame({'cost': [0.002, 0.003]}, index=pd.Index(['0001', '0002'], name='id'))
>>> params
      cost
id
0001 0.002
0002 0.003
>>> tc = TransactionCost(params=params)

Transaction costs that vary across securities and dates:

Security 1 (id: 0001): 0.004 on ‘2000-03-31’, 0.003 on ‘2000-04-30’

Security 2 (id: 0002): 0.005 on ‘2000-03-31’, 0.004 on ‘2000-04-30’

>>> dates = ['2000-03-31', '2000-04-30']
>>> ids = ['0001', '0002']
>>> params = pd.DataFrame(index=pd.MultiIndex.from_product([dates, ids], names=('date', 'id')))
>>> params['cost'] = [0.004, 0.005, 0.003, 0.004]
>>> params
                 cost
date       id
2000-03-31 0001 0.004
           0002 0.005
2000-04-30 0001 0.003
           0002 0.004
>>> tc = TransactionCost(params=params)

The params DataFrame must have index = ‘id’ or ‘date’/’id’. It can have columns such as ‘buy_fixed’ instead of ‘cost’ for a more complex transaction cost structure.

pyanomaly.util

This module defines utility functions.

pyanomaly.util.drop_columns(data, cols): Delete cols columns from data.

pyanomaly.util.is_iterable(x)

Check if x is iterable.

A string, a iterable object, is considered not iterable in this function.

Returns:: True if x is iterable.

pyanomaly.util.is_zero(x)

Check if x is zero.

A value is considered 0 if it is in the range (-1.e-7, 1.e-7).

Parameters:: x – A scalar or array.

pyanomaly.util.keep_columns(data, cols)

Delete columns of data except cols.

Equivalent to data = data[cols], but much more memory efficient. Not sure why but the following two methods seem to momentarily consume a lot of memory:

>>> data = data[cols]
>>> data.drop(columns=data.columns.difference(cols), inplace=True)

Use this function when handling a large dataset.

Parameters:

data – DataFrame.
cols – List of columns to keep.

pyanomaly.util.to_list(x)

Convert x to a list.

Parameters:: x – a list, tuple, or scalar.
Returns:: x converted to list.

pyanomaly.util.unique_list(x, exclude_nan=True)

Get a list of unique elements of x.

If x contains a list, its elements are considered elements of x, not the list itself.

>>> x = [1, 2, 2, None]
>>> unique_list(x)
[1, 2]
>>> x = [1, 2, [1, 3], None]
>>> unique_list(x)
[1, 2, 3]

Parameters:

x – A list.
exclude_nan – If True, exclude None elements.

Returns:

List of unique elements of x.

pyanomaly.wrdsdata

class pyanomaly.wrdsdata.WRDS(wrds_username=None)

Class to download/handle WRDS data.

Parameters:: wrds_username – WRDS username. Necessary only when downloading data: can be set to None when reading data from files.

Methods for Data Download

download_table()

download_table_async()

download_funda()

download_fundq()

download_sf()

download_seall()

download_secd()

download_g_secd()

download_all()

Other Methods

create_pgpass_file()

merge_sf_with_seall()

add_gvkey_to_crsp()

preprocess_crsp()

convert_fund_currency_to_usd()

get_risk_free_rate()

add_gvkey_to_crsp()

save_data()

read_data()

save_as_csv()

References

CRSP overview: https://wrds-www.wharton.upenn.edu/pages/support/data-overview/wrds-overview-crsp-us-stock-database/

CRSP-Compustat merge: https://wrds-www.wharton.upenn.edu/pages/support/manuals-and-overviews/crsp/crspcompustat-merged-ccm/wrds-overview-crspcompustat-merged-ccm/

CRSP annual update tables: https://wrds-www.wharton.upenn.edu/data-dictionary/crsp_a_indexes/

shrcd: https://wrds-www.wharton.upenn.edu/data-dictionary/form_metadata/crsp_a_stock_msf_identifyinginformation/shrcd/

exchcd: https://wrds-www.wharton.upenn.edu/data-dictionary/form_metadata/crsp_a_stock_msf_identifyinginformation/exchcd/

db: A wrds object to connect to WRDS database.

static add_gvkey_to_crsp(sf)

Add gvkey to m(d)sf using ccmxpf_linktable and identify primary stocks.

There are two tables we can use to link permno with gvkey: ccmxpf_linktable and ccmxpf_linkhist. ccmxpf_lnkused is simply a merge table of ccmxpf_lnkhist and ccmxpf_lnkused.

We identify primary stocks in the following order.

If linkprim = ‘P’ or ‘C’, set the security as primary.

If permno and gvkey have 1:1 mapping, set the security as primary.

Among the securities with the same gvkey, set the one with the maximum trading volume as primary.

Among the securities with the same permco and missing gvkey, set the one with the maximum trading volume as primary.

References

https://wrds-www.wharton.upenn.edu/pages/support/research-wrds/macros/wrds-macros-cvccmlnksas/

https://wrds-www.wharton.upenn.edu/pages/support/applications/linking-databases/linking-crsp-and-compustat/

Parameters:: sf – m(d)sf Dataframe with index = ‘date’/’permno’.
Returns:: m(d)sf with ‘gveky’ and ‘primary’ (primary stock indicator) columns added.

static convert_fund_currency_to_usd(fund, table='funda')

Convert non-USD values of funda(q) to USD values.

Note

In Compustat North America, the accounting data can be in either USD and CAD. This is no problem if firm characteristics are generated using only Compustat. However, if you mix data from different databases, e.g., if you use market equity of CRSP, which is in USD, Compustat data should be converted to USD.

Following JKP, we use compustat.exrt_dly to obtain exchange rates. exrt_dly starts from 1982-02-01.

Parameters:

fund – funda(q) DataFrame with index = ‘datadate’/’gvkey’.
table – ‘funda’ or ‘fundq’: indicator whether fund is funda or fundq.

Returns:

Converted fund DataFrame.

create_pgpass_file(): Create pgpass file. Need to be called only once. Once created, you don’t need to enter password when connecting to WRDS.

download_all(run_in_executer=True)

Download all tables.

Currently, this function downloads the following tables:

comp.funda
comp.fundq
crsp.msf (merged with crsp.msenames)
crsp.dsf (merged with crsp.dsenames)
crsp.mseall
crsp.dseall
crsp.ccmxpf_linktable
crsp.mcti
ff.factors_monthly
ff.factors_daily
comp.exrt_dly

Parameters:: run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_funda(sdate=None, edate=None, run_in_executer=True)

Download comp.funda.

Parameters:

sdate – Start date. e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_fundq(sdate=None, edate=None, run_in_executer=True)

Download comp.fundq.

Parameters:

sdate – Start date. e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_g_secd(sdate=None, edate=None, run_in_executer=True)

Download comp.g_secd.

Parameters:

sdate – Start date. e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_seall(sdate=None, edate=None, monthly=True, run_in_executer=True)

Download delist info from m(d)seall.

Delist can be obtained from either mseall or msedelist. We use mseall since it contains exchcd, which is used when replacing missing dlret. The shrcd and exchcd in mseall are usually those before halting/suspension. If a stock in NYSE is halted, exchcd in msenames can be -2, whereas that in mseall is 1.

Parameters:

sdate – Start date. e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
monthly – If True download mseall else dseall.
run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_secd(sdate=None, edate=None, run_in_executer=True)

Download comp.secd.

Parameters:

sdate – Start date. e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_sf(sdate=None, edate=None, monthly=True, run_in_executer=True)

Download crsp.m(d)sf joined with crsp.m(d)senames.

Parameters:

sdate – Start date. e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
monthly – If True download msf else dsf.
run_in_executer – If True, download concurrently. Faster but memory hungrier.

download_table(library, table, obs=-1, offset=0, columns=None, coerce_float=None, index_col=None, date_cols=None)

Download a table from WRDS library.

This is a wrapping function of wrds.get_table(). The queried table is saved in config.input_dir/library/table.pickle. Useful when downloading an entire table.

Parameters:

library – WRDS library. e.g., crsp, comp, …
table – A table in library.
obs – See wrds.get_table().
offset – See wrds.get_table().
columns – See wrds.get_table().
coerce_float – See wrds.get_table().
index_col – See wrds.get_table().
date_cols – See wrds.get_table().

download_table_async(library, table, sql=None, date_col=None, sdate=None, edate=None, interval=5, src_tables=None, type=<class 'float'>, date_cols=None, run_in_executer=True)

Asynchronous download.

This method splits the total period into interval years and downloads data for each sub-period asynchronously. If download fails, it can be started from the failed date: already downloaded files will be gathered together. This method allow us to download a large table, e.g., crsp.dsf, reliably without connection timeout and consumes much less memory than WRDS.download_table().

Note

For a small table, this can be slower than WRDS.download_table(). Table should have a date field (date_col) to split the period.

Parameters:

library – WRDS library.
table – WRDS table. If a complete query is given, this can be any name: this is used only as the file name for the data.
sql – String of the fields to select or a complete query statement. See below.
date_col – Date field on which downloaing will be split.
sdate – Start date (‘yyyy-mm-dd’). If None, ‘1900-01-01’.
edate – End date. If None, today.
interval – Sub-period size in years.
src_tables – List of (library, table) that are used in the query. src_tables are used to get data types and convert the types of double precision fields to float. When data is selected from a single table, library.table, this can be set to None.
type – Type for numeric fields: float or np.float32. For a large dataset, converting to float32 can save disc space and read/write time.
date_cols – List of date columns. The dtype of these columns will be converted to datetime.
run_in_executer – If True, download data concurrently using executer. Setting this to True will increase download speed but can take up much memory.

Note

To download library.table, library, table, and date_col should be given.
To download selected fields from library.table, library, table, sql, and date_col should be given, where sql is a string of the fields to select, e.g.,
```
>>> wrds = WRDS('user_name')
>>> sql = 'permno, prc, exchcd'
>>> wrds.download_table_async('crsp', 'msf', sql, 'date')
```
To download data using a complete query, library, table, and sql should be given, where table is a table name for the data (can be any name), and sql is a query statement. The sql should contain ‘WHERE [date_col] BETWEEN {} and {}’. See, e.g., the code of WRDS.download_sf().

static get_risk_free_rate(sdate=None, edate=None, src='mcti', month_end=False)

Get risk-free rate.

The risk free rate can be obtained either from crsp.mcti or ff.factors_monthly. mcti is preferred since the values in factors_monthly have only 4 decimal places. Both risk-free rates are in decimal (not percentage values).

Parameters:

sdate – Start date.
edate – End date.
src – data source. ‘mcti’: crsp.mcti, ‘ff’: ff.factors_monthly.
month_end – Shift dates to the end of the month.

Returns:

Dataframe of risk-free rates with index = ‘date’ and columns = [‘rf’].

static merge_sf_with_seall(sf, monthly=True, fill_method=1)

Merge m(d)sf with m(d)seall.

This method adjusts m(d)sf return (‘ret’) using m(d)seall delist return (‘dlret’). The adjusted return replaces ‘ret’ and ‘dlret’ column is added to m(d)sf. For msf, this method also adds cash dividend columns, ‘cash_div’, to m(d)sf.

Parameters:

sf – m(d)sf DataFrame.
monthly – sf = msf if True, else sf = dsf.
fill_method –
Method to fill missing dlret. 0: don’t fill, 1: JKP code, or 2: GHZ code.
- fill_method = 1:
  - dlret = -0.30 if dlstcd is 500 or between 520 and 584.
- fill_method = 2:
  - dlret = -0.35 if dlstcd is 500 or between 520 and 584, and exchcd is 1 or 2.
  - dlret = -0.55 if dlstcd is 500 or between 520 and 584, and exchcd is 3.

Returns:

m(d)sf with adjusted return (and cash dividend).

References

Delist codes: http://www.crsp.com/products/documentation/delisting-codes

static preprocess_crsp()

Create crspm and crspd.

This method calls WRDS.merge_sf_with_seall() and WRDS.add_gvkey_to_crsp() to add delist return, gveky, and primary indicator to m(d)sf. The result is saved to config.input_dir/crspm(d).pickle.

static read_data(table, library=None, index_col=None, sort_col=None)

Read data from config.input_dir/library/table.pickle.

library argument is redundant as if it is None, all folders under config.input_dir is searched.

Parameters:

table – File name without extension.
library – Directory.
index_col – (List of) column(s) to be set as index.
sort_col – (List of) column(s) to sort data.

Returns:

(DataFrame) data read. Index = index_col.

static save_as_csv(table, library=None, fpath=None)

Read table and save it to a csv file.

Parameters:

table – File name without extension.
library – Directory.
fpath – File path for the csv file. If None, the file is saved to config.input_dir/library/table.csv

static save_data(data, table, library=None)

Save data in pickle format.

We use pickle file format to store data as a pickle file preserves data types and is much faster to read compared to a csv file. To convert a pickle file to a csv file, WRDS.save_as_csv() can be used. The data is saved in the following location:

If library = None, config.input_dir/table.pickle

Otherwise, config.input_dir/library/table.pickle

Parameters:

data – Data to save (DataFrame).
table – File name without extension.
library – Directory.