pyanomaly
PyAnomaly is a Python library for asset pricing research.
This module defines analytic functions. |
|
This module defines classes for firm characteristic generation. |
|
This module defines functions to set/get package configuration. |
|
This module defines functions for data handling. |
|
This module defines functions to generate factor portfolios and characteristic portfolios. |
|
This module defines functions to generate Fama-French factors. |
|
This module defines functions for file IO. |
|
This module defines global constants. |
|
This module defines logging functions. |
|
This module defines jitted function. |
|
This module defines classes for panel data analysis. |
|
This module defines classes for portfolio analysis. |
|
This module defines classes for transaction costs. |
|
This module defines utility functions. |
|
This module defines WRDS class that is used to download and handle WRDS data. |
analytics
This module defines analytic functions.
- Sorting
One-dimensional sort.
Two-dimensional sort.
Auxiliary functions
Relabel classes.
Calculate weighted means.
Add long-short to a quantile data.
- Time-Series Analysis
Calculate time-series mean and t-statistic.
Run GRS (Gibbons, Ross, and Shanken, 1989) test.
Auxiliary functions
Calculate t-statistic.
- Cross-sectional Analysis
Run cross-sectional OLS.
- Portfolio Analysis
Make portfolio position data.
Make a portfolio.
Make a long-short portfolio.
Make quantile portfolios.
Auxiliary functions
Compute future returns.
- pyanomaly.analytics.append_long_short(data, level=-1, l_label=None, s_label=None, ls_label=None)
Add long-short to a quantile data.
Long-short is defined as (first group - last group) in each date. If labels are not given, long-short will be (class 0 - class N-1), where N is the number of classes, and the class label of the long-short is set to N.
- Parameters:
data – DataFrame with index = date/class1/class2/….
level – Index level to make long-short on. Default to the last level.
l_label – Label of the long class. If None, l_label = 0.
s_label – Label of the short class. If None, s_label = N-1.
ls_label – Label of the long-short. If None, ls_label = N.
- Returns:
The data with long-short appended.
- pyanomaly.analytics.crosssectional_regression(data, endo_col, exog_cols, add_constant=True, cov_type='nonrobust', cov_kwds=None)
Run cross-sectional OLS.
Run cross-sectional OLS on each date and calculate the time-series means and t-stats of the coefficients.
- Parameters:
- Returns:
mean (DataFrame). Time-series means of coefficients with index = (‘const’ +) exog_cols and columns = ‘mean’.
t-stat (DataFrame). t-statistics of coefficients with index = (‘const’ +) exog_cols and columns = ‘t-stat’.
coefs (DataFrame). Coefficient time-series with index = dates and columns = (‘const’ +) exog_cols.
- pyanomaly.analytics.future_return(ret, period=1)
Compute future returns.
Compute period-period ahead returns. If ret has a MultiIndex of date/id, future returns are calculated for each id.
- Parameters:
ret – Series of returns with index = date or date/id. If index = date/id, ret must be sorted on id/date.
period – Target period.
- Returns:
Series of future returns.
- pyanomaly.analytics.grs_test(assets, factors)
Run GRS (Gibbons, Ross, and Shanken, 1989) test.
- Parameters:
assets – T x N DataFrame or ndarray of asset returns.
factors – T x F DataFrame or ndarray of factor returns.
- Returns:
pricing error (alpha.T * inv(Sigma) * alpha)
squared Sharpe ratio of the factors
GRS statistic
p value
- pyanomaly.analytics.make_long_short_portfolio(lposition, sposition, rf=None, costfcn=None, keep_position=True, name='H-L', ls_wgt=(1, -1))
Make a long-short portfolio.
- Parameters:
lposition – DataFrame. Long position data. See
make_position()for the data format.sposition – DataFrame. Short position data.
rf – Series or DataFrame of risk-free rates. The index should be date.
costfcn – Transaction cost. See
Portfolio.costfcn().keep_position – If True, keep the position information in the returned value. If position information is not needed, set this to False to save memory.
name – Portfolio name.
ls_wgt – Long-short weights. (1, -1) means 1:1 long-short.
- Returns:
Portfolioobject.
- pyanomaly.analytics.make_portfolio(data, ret_col, weight_col=None, rf=None, costfcn=None, keep_position=True, name='')
Make a portfolio.
This function creates portfolio position data from data and construct a portfolio from it.
- Parameters:
data – DataFrame with index = date/id.
ret_col – Return column of data. Return should be over t to t+1.
weight_col – Weight column of data. If None, constituents are equally weighted.
rf – Series or DataFrame of risk-free rates. The index should be date.
costfcn – Transaction cost. See
Portfolio.costfcn().keep_position – If True, keep the position information in the returned value. If position information is not needed, set this to False to save memory.
name – Portfolio name.
- Returns:
Portfolioobject.
- pyanomaly.analytics.make_position(data, ret_col, weight_col=None, pf_col=None, other_cols=None)
Make portfolio position data.
To construct and evaluate a portfolio using
Portfolio, position data is required. This function makes the position data from data, which is a panel of securities. The position data is generated via the following operations:Change column names as assumed in
Portfolio:‘date’: Date column.
‘id’: Security id column.
‘ret’: Return column.
‘wgt’: Weight column.
Normalize weights so that their cross-sectional sum becomes 1 within each portfolio.
- Parameters:
data – DataFrame with index = date/id.
ret_col – Return column of data. Return should be over t to t+1.
weight_col – Weight column of data. If None, constituents are equally weighted.
pf_col – Portfolio column of data, i.e., a column that maps securities with portfolios. This can be None if the input data is for one portfolio.
other_cols – Other columns of data to include in the
positionattribute ofPortfolio.
- Returns:
Position DataFrame with index = ‘date’ and columns = [‘id’, ‘ret’, ‘wgt’] + other_cols.
- pyanomaly.analytics.make_quantile_portfolios(data, q_col, ret_col, weight_col=None, rf=None, costfcn=None, keep_position=True, names=None, ls_wgt=(1, -1))
Make quantile portfolios.
This function makes quantile portfolios and the long-short portfolio from data.
- Parameters:
data – DataFrame with index = date/id.
q_col – Column of data that maps a security with quantiles (portfolios). The values should be integers starting from 0.
ret_col – Return column of data. Return should be over t to t+1.
weight_col – Weight column of data. If None, constituents are equally weighted.
rf – Series or DataFrame of risk-free rates. The index should be date.
costfcn – Transaction cost. See
Portfolio.costfcn().keep_position – If True, keep the position information in the returned value. If position information is not needed, set this to False to save memory.
names – Portfolio names. If None, the values in pf_col are used.
ls_wgt – Long-short weights. (1, -1) means 1:1 long-short. If None, long-short portfolio is not constructed.
- Returns:
Portfoliosobject.
- pyanomaly.analytics.one_dim_sort(data, class_col, target_cols=None, weight_col=None, function='mean', add_long_short=True)
One-dimensional sort.
This function assumes that data has already been sorted/grouped and class labels are given in class_col column. Aggregate target_cols values using class_col and return aggregated results. Class labels in class_col should be 0, 1, …
- Parameters:
data – DataFrame to be grouped. Index must be date/id.
class_col – Class label column.
target_cols – (List of) column(s) to aggregate. If None, target_cols = all numeric columns of data.
weight_col – Weight column. If given, weighted mean is returned. Applicable only when function = ‘mean’.
function – Aggregate function, e.g., ‘sum’, ‘mean’, ‘count’, or a list of functions.
add_long_short (bool) – Add long-short to the output.
- Returns:
Aggregated data with index = date/class, columns = target_cols. If function is a list of functions, the columns has two levels: first level = target_cols and second level = function.
- pyanomaly.analytics.relabel_class(data, labels=None, axis=0, level=-1, col=None)
Relabel classes.
Relabel (rename) columns, indexes, or column values of data. The existing labels (values) should be continuous integers starting from 0. The data is relabeled in-place.
- Parameters:
data – DataFrame to be relabeled.
labels (list) – New class labels. Label 0 is replaced by the first element of labels, and so forth.
axis – 1: index, 2: column.
level – Level of index/column to be relabeled.
col – Column name. If column name is given, axis and level are ignored.
- pyanomaly.analytics.t_stat(data, cov_type='nonrobust', cov_kwds=None)
Calculate t-statistic.
Calculate t-statistic for each column of data under H0: x = 0.
- Parameters:
data – Series, DataFrame, or ndarray with each column containing samples.
cov_type – Covariance estimator: e.g., ‘HAC’ for Newey-West.
cov_kwds – Parameters required for the chosen covariance estimator: e.g., {‘maxlags: 12} for cov_type = ‘HAC’.
- Returns:
t-stat. Float (if data is one dimensional) or (1 x K) ndarray, where K is the number of columns of data.
Note
See statsmodels.api.OLS.fit for possible values of cov_type and cov_kwds.
- pyanomaly.analytics.time_series_average(data, cov_type='nonrobust', cov_kwds=None)
Calculate time-series mean and t-statistic.
Time-series mean and t-statistic are calculated for each column of data. The data can be either a time-series data (index = date) or a panel data (index = date/id). If it is a panel, time-series mean and t-statistic are calculated for each id.
- Parameters:
- Returns:
mean (DataFrame).
t-stat (DataFrame).
If data has MultiIndex, mean (t-stat) has index = level 1 index of data and columns = data.columns. Otherwise, mean (t-stat) has index = data.columns and columns = ‘mean’ (‘t-stat’).
- pyanomaly.analytics.two_dim_sort(data, class_col1, class_col2, target_cols=None, weight_col=None, function='mean', add_long_short=True, output_dim=1)
Two-dimensional sort.
This function assumes that data has already been sorted/grouped and class labels are given in class_col1 and class_col2 columns. Aggregate target_cols values using class_col1 and class_col2 and return aggregated results. Class labels in class_col1(2) should be 0, 1, …
- Parameters:
data – Data to be grouped. Index must be date/id.
class_col1 – Class label column for the 1st dimension.
class_col2 – Class label column for the 2nd dimension.
target_cols – (List of) column(s) to aggregate. If None, target_cols = all numeric columns of data.
weight_col – Weight column. If given, weighted mean is returned. Applicable only when function = ‘mean’.
function – Aggregate function, e.g., ‘sum’, ‘mean’, ‘count’.
add_long_short (bool) – Add long-short to the output.
output_dim – If 1, output is a DataFrame with index = date/class1/class2; if 2, output is a DataFrame with index = date/class1 and column = class2.
- Returns:
Aggregated data (DataFrame or dict of DataFrames).
If output_dim = 1, index = date/class1/class2 and columns = target_cols.
If output_dim = 2 and len(target_cols) = 1, index = date/class1 and columns = class2.
If output_dim = 2 and len(target_cols) > 1, output is dict with keys = target_cols and values = DataFrames (as in the second case).
- pyanomaly.analytics.weighted_mean(data, target_cols, weight_col, group_cols)
Calculate weighted means.
Calculate weighted means of each column in target_cols within each group defined by group_cols.
- Parameters:
data – DataFrame.
target_cols – (List of) column(s) to calculate weighted-mean.
weight_col – Weight column name or Series or ndarray of weights.
group_cols – (List of) grouping column(s).
- Returns:
DataFrame of weighted means with index = group_cols and columns = target_cols.
Examples
If data is a panel with index = date/permno and column ‘ret’ contains returns and ‘me’ contains market equity at t-1, value-weighted returns can be obtained as follows:
>>> wmean = weighted_mean(data, 'ret', 'me', 'date')
characteristics
This module defines classes for firm characteristic generation.
Class to generate firm characteristics from funda. |
|
Class to generate firm characteristics from fundq. |
|
Class to generate firm characteristics from crspm. |
|
Class that handles crspd data. |
|
Class to generate firm characteristics from crspd. |
|
Class to generate firm characteristics from a combined dataset of crspm, crspd, funda, and fundq. |
FUNDA
- class pyanomaly.characteristics.FUNDA(alias=None, data=None)
Bases:
FCPanelClass to generate firm characteristics from funda.
The firm characteristics defined in this class can be viewed using
show_available_chars().- Parameters:
alias (str, list, or dict) –
Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.
str: A column name in the mapping file (
config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.list: List of firm characteristics (method names).
dict: Dict of firm characteristics and their aliases of the form {method name: alias}.
If aliases are not specified, method names are used as aliases.
data – DataFrame of funda data with index = datadate/gvkey, sorted on gvkey/datadate. The funda data can be given at initialization or loaded later using
load_data().
Methods
Load funda data from file.
Convert currency to USD.
Merge funda with fundq.
Add crsp market equity.
Preprocess data before creating firm characteristics.
Postprocess data.
Firm characteristic generation methods have a name like
c_characteristic().- add_crsp_me(crspm, method='latest')
Add crsp market equity.
In funda, market equity (‘me’) and fiscal market equity (‘me_fiscal’) are both defined as (prcc_f * csho). This method replaces them with crspm’s firm-level market equity (‘me_company’). If method = ‘latest’, ‘me’ is the latest ‘me_company’ and ‘me_fiscal’ is the ‘me_company’ on datadate. If method = ‘year_end’, both ‘me’ and ‘me_fiscal’ are the ‘me_company’ in December of datadate year.
- Parameters:
crspm –
CRSPMinstance.method – How to merge crsp me with funda. ‘latest’: latest me; ‘year_end’: December me.
- c_absacc()
Absolute accruals. Bandyopadhyay, Huang, and Wirjanto (2010)
- c_acc()
Operating accruals (GHZ, Org). Sloan (1996)
- c_age()
Firm age. Jiang, Lee, and Zhang (2005)
- c_aliq_at()
Asset liquidity to book assets. Ortiz-Molina and Phillips (2014)
- c_aliq_mat()
Asset liquidity to market assets. Ortiz-Molina and Phillips (2014)
- c_at_be()
Book leverage. Fama and French (1992)
- c_at_gr1()
Asset growth. Cooper, Gulen, and Schill (2008)
- c_at_me()
Assets-to-market. Fama and French (1992)
- c_at_turnover()
Capital turnover. Haugen and Baker (1996)
- c_be_gr1a()
Chage in common equity. Richardson et al. (2005)
- c_be_me()
Book-to-market (December ME). Rosenberg, Reid, and Lanstein (1985)
- c_bev_mev()
Book-to-market enterprise value. Penman, Richardson, and Tuna (2007)
- c_bm_ia()
Industry-adjusted book-to-market. Asness, Porter, and Stevens (2000)
- c_capex_abn()
Abnormal corporate investment. Titman, Wei, and Xie (2004)
- c_capx_gr1()
CAPEX growth (1 year). Xie (2001)
- c_capx_gr2()
Two-year investment growth. Anderson and Garcia-Feijoo (2006)
- c_capx_gr3()
Three-year investment growth. Anderson and Garcia-Feijoo (2006)
- c_cash_at()
Cash-to-assets. Palazzo (2012)
- c_cashdebt()
Cash flow-to-debt. Ou and Penman(1989)
- c_cashpr()
Cash productivity. Chandrashekar and Rao (2009)
- c_cfp()
Operating Cash flows to price (Org, GHZ). Desai, Rajgopal, and Venkatachalam (2004)
- c_cfp_ia()
Industry-adjusted cash flow-to-price. Asness, Porter, and Stevens (2000)
- c_chatoia()
Change in profit margin. Soliman (2008)
- c_chcsho()
Net stock issues (GHZ). Pontiff and Woodgate (2008)
- c_chempia()
Industry-adjusted change in employees. Asness, Porter, and Stevens (2000)
- c_chpmia()
Change in profit margin. Soliman (2008)
- c_coa_gr1a()
Change in current operating assets. Richardson et al. (2005)
- c_col_gr1a()
Change in current Ooperating liabilities. Richardson et al. (2005)
- c_convind()
Convertible debt indicator. Valta (2016)
- c_cop_at()
Cash-based operating profitablility. Ball et al. (2016)
- c_cop_atl1()
Cash-based operating profits to lagged assets. Ball et al. (2016)
- c_cowc_gr1a()
Change in net non-cash working capital. Richardson et al. (2005)
- c_currat()
Current ratio. Ou and Penman (1989)
- c_dbnetis_at()
Net debt finance. Bradshaw, Richardson, and Sloan (2006)
- c_debt_gr3()
Composite debt issuance. Lyandres, Sun, and Zhang (2008)
- c_debt_me()
Debt to market. Bhandari (1988)
- c_depr()
Depreciation to PP&E. Holthausen and Larcker (1992)
- c_dgp_dsale()
Gross margin growth to sales growth. Abarbanell and Bushee (1998)
- c_dsale_dinv()
Sales growth to inventory growth. Abarbanell and Bushee (1998)
- c_dsale_drec()
Sales growth to receivable growth. Abarbanell and Bushee (1998)
- c_dsale_dsga()
Sales growth to SG&A growth. Abarbanell and Bushee (1998)
- c_dy()
Dividend yield (GHZ). Litzenberger and Ramaswamy (1979)
- c_earnings_variability()
Earnings smoothness. Francis et al. (2004)
- c_ebit_bev()
Return on net operating assets. Soliman (2008)
- c_ebit_sale()
Profit margin. Soliman (2008)
- c_ebitda_mev()
Enterprise multiple. Loughran and Wellman (2011)
- c_emp_gr1()
Employee growth. Belo, Lin, and Bazdresch (2014)
- c_enterprise_multiple()
Enterprise multiple. Loughran and Wellman (2011)
- c_eq_dur()
Equity duration. Dechow, Sloan, and Soliman (2004)
- c_eqnetis_at()
Net equity finance. Bradshaw, Richardson, and Sloan (2006)
- c_eqnpo_me()
Net payout yield. Boudoukh et al. (2007)
- c_eqpo_me()
Payout yield. Boudoukh et al. (2007)
- c_f_score()
Piotroski F-Score (JKP). Piotroski (2000)
- c_fcf_me()
Cash flow-to-price. Lakonishok, Shleifer, and Vishny (1994)
- c_fnl_gr1a()
Change in financial liabilities. Richardson et al. (2005)
- c_gp_at()
Gross profits-to-assets. Novy-Marx (2013)
- c_gp_atl1()
Gross profits-to-lagged assets. Novy-Marx (2013)
- c_herf_at()
Industry concentration (total assets). Hou and Robinson (2006)
- c_herf_be()
Industry concentration (book equity). Hou and Robinson (2006)
- c_herf_sale()
Industry concentration (sales). Hou and Robinson (2006)
- c_intrinsic_value()
Intrinsic value-to-market. Frankel and Lee (1998)
- c_inv_gr1()
Inventory growth. Belo and Lin (2012)
- c_inv_gr1a()
Inventory change. Thomas and Zhang (2002)
- c_invest()
CAPEX and inventory. Chen and Zhang (2010)
- c_kz_index()
Kaplan-Zingales Index. Lamont, Polk, and Saa-Requejo (2001)
- c_lgr()
Change in long-term debt. Richardson et al. (2005)
- c_lnoa_gr1a()
Change in long-term net operating assets. Fairfield, Whisenant, and Yohn (2003)
- c_lti_gr1a()
Chagne in long-term investments. Richardson et al. (2005)
- c_mve_ia()
Industry-adjusted firm size. Asness, Porter, and Stevens (2000)
- c_ncoa_gr1a()
Change in non-current operating assets. Richardson et al. (2005)
- c_ncol_gr1a()
Change in non-current operating liabilities. Richardson et al. (2005)
- c_netdebt_me()
Net debt-to-price. Penman, Richardson, and Tuna (2007)
- c_netis_at()
Net external finance. Bradshaw, Richardson, and Sloan (2006)
- c_nfna_gr1a()
Change in net financial assets. Richardson et al. (2005)
- c_ni_ar1()
Earnings persistence. Francis et al. (2004)
- c_ni_be()
Return on equity. Haugen and Baker (1996)
- c_ni_ivol()
Earnings predictability. Francis et al. (2004)
- c_ni_me()
Earnings to price. Basu (1983)
- c_nncoa_gr1a()
Change in net non-current operating assets. Richardson et al. (2005)
- c_noa_at()
Net operating assets. Hirshleifer et al. (2004)
- c_noa_gr1a()
Change in net operating assets. Hirshleifer et al. (2004)
- c_o_score()
Ohlson O-Score. Dichev (1998)
- c_oaccruals_at()
Operating Accruals (JKP). Sloan (1996)
- c_oaccruals_ni()
Percent Operating Accruals (JKP). Hafzalla, Lundholm, and Van Winkle (2011)
- c_ocf_at()
Operating cash flow to assets. Bouchard et al. (2019)
- c_ocf_at_chg1()
Change in operating cash flow to assets. Bouchard et al. (2019)
- c_ocf_me()
Operating Cash flows to price (JKP). Desai, Rajgopal, and Venkatachalam (2004)
- c_op_at()
Operating profits-to-assets. Ball et al. (2016)
- c_op_atl1()
Operating profits-to-lagged assets. Ball et al. (2016)
- c_ope_be()
Operating profits to book equity (JKP). Fama and French (2015)
- c_ope_bel1()
Operating profits to lagged book equity. Fama and French (2015)
- c_operprof()
Operating profits to book equity (GHZ, Org). Fama and French (2015)
- c_opex_at()
Operating leverage. Novy-Marx (2011)
- c_pchcapx_ia()
Industry-adjusted change in capital investment. Abarbanell and Bushee (1998)
- c_pchcurrat()
Change in current ratio. Ou and Penman (1989)
- c_pchdepr()
Change in depreciation to PP&E. Holthausen and Larcker (1992)
- c_pchquick()
Change in quick ratio. Ou and Penman (1989)
- c_pchsaleinv()
Change in sales to inventory. Ou and Penman(1989)
- c_pctacc()
Percent operating accruals (GHZ, Org). Hafzalla, Lundholm, and Van Winkle (2011)
- c_pi_nix()
Taxable income to income (JKP). Lev and Nissim (2004)
- c_ppeinv_gr1a()
Changes in PPE and inventory/assets. Lyandres, Sun, and Zhang (2008)
- c_ps()
Piotroski score (GHZ, Org). Piotroski (2000)
- c_quick()
Quick ratio. Ou and Penman (1989)
- c_rd()
Unexpected R&D increase. Eberhart, Maxwell, and Siddique (2004)
- c_rd5_at()
R&D capital-to-assets. Li (2011)
- c_rd_me()
R&D to market. Chan, Lakonishok, and Sougiannis (Guo, Lev, and Shi (2006) in GHZ)
- c_rd_sale()
R&D to sales. Chan, Lakonishok, and Sougiannis (2001) (Guo, Lev, and Shi (2006) in GHZ)
- c_realestate()
Real estate holdings. Tuzel (2010)
- c_roic()
Return on invested capital. Brown and Rowe (2007)
- c_sale_bev()
Asset turnover. Soliman (2008)
- c_sale_emp_gr1()
Labor force efficiency. Abarbanell and Bushee (1998)
- c_sale_gr1()
Annual sales growth. Lakonishok, Shleifer, and Vishny (1994)
- c_sale_gr3()
Three-year sales growth. Lakonishok, Shleifer, and Vishny (1994)
- c_sale_me()
Sales to price. Barbee, Mukherji, and Raines (1996)
- c_salecash()
Sales-to-cash. Ou and Penman (1989)
- c_saleinv()
Sales-to-inventory. Ou and Penman(1989)
- c_salerec()
Sales-to-receivables. Ou and Penman(1989)
- c_secured()
Secured debt-to-total debt. Valta (2016)
- c_securedind()
Secured debt indicator. Valta (2016)
- c_sin()
Sin stocks. Hong and Kacperczyk (2009)
- c_sti_gr1a()
Change in short-term investments. Richardson et al. (2005)
- c_taccruals_at()
Total Accruals. Richardson et al. (2005)
- c_taccruals_ni()
Percent total accruals. Hafzalla, Lundholm, and Van Winkle (2011)
- c_tangibility()
Tangibility. Hahn and Lee (2009)
- c_tb()
Taxable income to income (Org, GHZ). Lev and Nissim (2004)
- c_z_score()
Altman Z-Score. Dichev (1998)
- convert_currency()
Convert currency to USD.
Convert the currency of funda to USD. This method needs to be called if
the data contains non USD-denominated firms, e.g., CAD; and
CRSP’s market equity is used, which is always in USD.
See also
- load_data(sdate=None, edate=None, fname='funda')
Load funda data from file.
This method loads funda data from
config.input_dir/comp/fundaand stores it in thedataattribute. Thedatahas index = datadate/gvkey and is sorted on gvkey/datadate.- Parameters:
sdate – Start date (‘yyyy-mm-dd’). If None, from the earliest.
edate – End date (‘yyyy-mm-dd’). If None, to the latest.
fname – The funda file name.
- merge_with_fundq(fundq)
Merge funda with fundq.
Merge funda with quarterly-updated annual data generated from fundq. If a value exists in both data, funda has the priority.
Note
JKP create characteristics in funda and fundq separately and merge them, whereas we merge the raw data first and then generate characteristics. Since some variables in funda are not available in fundq, e.g., ebitda, JKP synthesize those unavailable variables with other variables and create characteristics, even when they are available in funda. We prefer to merge funda with fundq at the raw data level and create characteristics from the merged data.
Columns in both funda and fundq:
datadate, cusip, cik, sic, naics, sale, revt, cogs, xsga, dp, xrd, ib, nopi, spi, pi, txp, ni, txt, xint, capx, oancf, gdwlia, gdwlip, rect, act, che, ppegt, invt, at, aco, intan, ao, ppent, gdwl, lct, dlc, dltt, lt, pstk, ap, lco, lo, drc, drlt, txdi, ceq, scstkc, csho, prcc_f, oibdp, oiadp, mii, xopr, xi, do, xido, ibc, dpc, xidoc, fincf, fiao, txbcof, dltr, dlcch, prstkc, sstk, dv, ivao, ivst, re, txditc, txdb, seq, mib, icapt, ajex, curcd, exratd
Columns in funda but not in fundq:
xad, gp, ebitda, ebit, txfed, txfo, dvt, ob, gwo, fatb, fatl, dm, dcvt, cshrc, dcpstk, emp, xlr, ds, dvc, itcb, pstkrv, pstkl, dltis, ppenb, ppenls
- Parameters:
fundq –
FUNDQinstance.
- postprocess()
Postprocess data.
This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.
- update_variables()
Preprocess data before creating firm characteristics.
Synthesize missing values with other variables.
Create frequently used variables.
FUNDQ
- class pyanomaly.characteristics.FUNDQ(alias=None, data=None)
Bases:
FCPanelClass to generate firm characteristics from fundq.
The firm characteristics defined in this class can be viewed using
show_available_chars().- Parameters:
alias (str, list, or dict) –
Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.
str: A column name in the mapping file (
config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.list: List of firm characteristics (method names).
dict: Dict of firm characteristics and their aliases of the form {method name: alias}.
If aliases are not specified, method names are used as aliases.
data – DataFrame of fundq data with index = datadate/gvkey, sorted on gvkey/datadate. The fundq data can be given at initialization or loaded later using
load_data().
Methods
Load fundq data from file.
Drop duplicates.
Convert currency to USD.
Quarterize ytd items.
Preprocess data before creating firm characteristics.
Postprocess data.
Firm characteristic generation methods have a name like
c_characteristic().- c_chtx()
Tax expense surprise. Thomas and Zhang (2011)
- c_ni_inc8q()
Number of consecutive quarters with earnings increases. Barth, Elliott, and Finn (1999)
- c_niq_at()
Quarterly return on assets. Balakrishnan, Bartov, and Faurel (2010)
- c_niq_at_chg1()
Change in quarterly return on assets. Balakrishnan, Bartov, and Faurel (2010)
- c_niq_be()
Return on equity (quarterly). Hou, Xue, and Zhang (2015)
- c_niq_be_chg1()
Change in quarterly return on equity. Balakrishnan, Bartov, and Faurel (2010)
- c_niq_su()
Earnings surprise. Foster, Olsen, and Shevlin (1984)
- c_ocfq_saleq_std()
Cash flow volatility. Huang (2009)
- c_roavol()
ROA volatility. Francis et al. (2004)
- c_rsup()
Revenue surprise (Karma). Kama (2009)
- c_saleq_su()
Revenue surprise. Jegadeesh and Livnat (2006)
- c_stdacc()
Accrual volatility. Bandyopadhyay, Huang, and Wirjanto (2010)
- convert_currency()
Convert currency to USD.
Convert the currency of fundq to USD. This method needs to be called if
the data contains non USD-denominated firms, e.g., CAD; and
CRSP’s market equity is used, which is always in USD.
See also
- create_qitems_from_yitems()
Quarterize ytd items.
Quarterize ytd variables, Xy’s, and use them to fill missing Xq’s (if Xq exists) or to create new quarterly variables (if Xq does not exist).
- generate_funda_vars()
Generate quarterly-updated annual data from fundq.
The following variables are annualized by cumulating over the past 4 quarters.
‘cogs’, ‘xsga’, ‘xint’, ‘dp’, ‘txt’, ‘xrd’, ‘spi’, ‘sale’, ‘revt’, ‘xopr’, ‘oibdp’, ‘oiadp’, ‘ib’, ‘ni’, ‘xido’, ‘nopi’, ‘mii’, ‘pi’, ‘xi’, ‘oancf’, ‘dv’, ‘sstk’, ‘dlcch’, ‘capx’, ‘dltr’, ‘txbcof’, ‘xidoc’, ‘dpc’, ‘fiao’, ‘ibc’, ‘prstkc’, ‘fincf’.
- Returns:
DataFrame of quarterly-updated annual data.
- load_data(sdate=None, edate=None, fname='fundq')
Load fundq data from file.
This method loads fundq data from
config.input_dir/comp/fundqand stores it in thedataattribute. Thedatahas index = datadate/gvkey and is sorted on gvkey/datadate.- Parameters:
sdate – Start date (‘yyyy-mm-dd’). If None, from the earliest.
edate – End date (‘yyyy-mm-dd’). If None, to the latest.
fname – The fundq file name.
- postprocess()
Postprocess data.
This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.
- remove_duplicates()
Drop duplicates.
In fundq, there are duplicate rows (rows with the same datadate and gvekey). Remove duplicates in the following order:
Remove records with missing fqtr.
Choose records with the maximum fyearq.
Choose records with the minimum fqtr.
- update_variables()
Preprocess data before creating firm characteristics.
Synthesize missing values with other variables.
Create frequently used variables.
CRSPM
- class pyanomaly.characteristics.CRSPM(alias=None, data=None)
Bases:
FCPanelClass to generate firm characteristics from crspm.
The firm characteristics defined in this class can be viewed using
show_available_chars().- Parameters:
alias (str, list, or dict) –
Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.
str: A column name in the mapping file (
config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.list: List of firm characteristics (method names).
dict: Dict of firm characteristics and their aliases of the form {method name: alias}.
If aliases are not specified, method names are used as aliases.
data – DataFrame of crspm data with index = date/permno, sorted on permno/date. The crspm data can be given at initialization or loaded later using
load_data().
Methods
Load crspm data from file.
Filter data.
Preprocess data before creating firm characteristics.
Merge crspm with factors.
Postprocess data.
Firm characteristic generation methods have a name like
c_characteristic().- c_beta_60m()
Market beta (Org, JKP). Fama and MacBeth (1973)
- c_chcsho_12m()
Net stock issues (JKP). Pontiff and Woodgate (2008)
- c_chmom()
Change in 6-month momentum. Gettleman and Marks (2006)
- c_div12m_me()
Dividend yield (JKP). Litzenberger and Ramaswamy (1979)
- c_divi()
Dividend initiation. Michaely, Thaler, and Womack (1995)
- c_divo()
Dividend omission. Michaely, Thaler, and Womack (1995)
- c_dolvol()
Dollar trading volume (Org, GHZ). Brennan, Chordia, and Subrahmanyam (1998)
- c_eqnpo_12m()
Composite equity issuance (JKP, 12 months). Daniel and Titman (2006)
- c_eqnpo_60m()
Composite equity issuance (Org). Daniel and Titman (2006)
- c_indmom()
Industry momentum. Moskowitz and Grinblatt (1999)
- c_ipo()
Initial public offerings. Loughran and Ritter (1995)
- c_market_equity()
Market equity. Banz (1981)
- c_price()
Share price. Miller and Scholes (1982)
- c_resff3_12_1()
12 month residual momentum. Blitz, Huij, and Martens (2011)
- c_resff3_6_1()
6 month residual momentum. Blitz, Huij, and Martens (2011)
- c_ret_12_1()
Momentum (12 months). Jegadeesh and Titman (1993)
- c_ret_12_6()
Intermediate momentum (7-12). Novy-Marx (2012)
- c_ret_1_0()
Short-term reversal. Jegadeesh (1990)
- c_ret_36_12()
Long-term reversal (12-36). De Bondt and Thaler (1985)
- c_ret_3_1()
Momentum (3 months). Jegadeesh and Titman (1993)
- c_ret_60_12()
Long-term reversal (12-60). De Bondt and Thaler (1985)
- c_ret_6_1()
Momentum (6 months). Jegadeesh and Titman (1993)
- c_ret_9_1()
Momentum (9 months). Jegadeesh and Titman (1993)
- c_seas_11_15an()
Years 11-15 lagged returns, annual. Heston and Sadka (2008)
- c_seas_11_15na()
Years 11-15 lagged returns, nonannual. Heston and Sadka (2008)
- c_seas_16_20an()
Years 16-20 lagged returns, annual. Heston and Sadka (2008)
- c_seas_16_20na()
Years 16-20 lagged returns, nonannual. Heston and Sadka (2008)
- c_seas_1_1an()
Year 1-lagged return, annual. Heston and Sadka (2008)
- c_seas_1_1na()
Year 1-lagged return, nonannual. Heston and Sadka (2008)
- c_seas_2_5an()
Years 2-5 lagged returns, annual. Heston and Sadka (2008)
- c_seas_2_5na()
Years 2-5 lagged returns, nonannual. Heston and Sadka (2008)
- c_seas_6_10an()
Years 6-10 lagged returns, annual. Heston and Sadka (2008)
- c_seas_6_10na()
Years 6-10 lagged returns, nonannual. Heston and Sadka (2008)
- c_turn()
Share turnover (Org, GHZ). Datar, Naik, and Radcliffe (1998)
- filter_data()
Filter data.
The data is filtered using the following filters:
shrcd in [10, 11, 12]
Note
We do not filter the data using exchange code (exchcd in [1 (NYSE), 2 (ASE), 3 (NASDAQ)]) because exchcd can change when a stock is delisted: If the data is filtered using exchcd, the data of the delist month can be lost.
- load_data(sdate=None, edate=None, fname='crspm')
Load crspm data from file.
This method loads crspm data from
config.input_dir/crspmand stores it in thedataattribute. Thedatahas index = date/permno and is sorted on permno/date.Note
In CRSP monthly tables, date is the last business day of the month, whereas datadate in Compustat is the end-of-month date. To make the two dates consistent, crspm dates are shifted to the end of the month.
- Parameters:
sdate – Start date (‘yyyy-mm-dd’). If None, from the earliest.
edate – End date (‘yyyy-mm-dd’). If None, to the latest.
fname – The crspm file name.
- merge_with_factors(factors=None)
Merge crspm with factors.
The factors should contain Fama-French 3 factors with column names as defined in
config.factor_names.- Parameters:
factors – DataFrame of factors with index = date or list of factors. If None or list, factor data will be read from
config.monthly_factors_fnameand only the factors in factors will be merged.
- postprocess()
Postprocess data.
This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.
- update_variables()
Preprocess data before creating firm characteristics.
Convert negative prices (quotes) to positive.
Convert shares outstanding (shrout) unit from thousands to millions (divide by 1000).
Convert trading volume (vol) unit from 100 shares to shares (multiply by 100).
Adjust trading volume following Gao and Ritter (2010).
Create frequently used variables.
CRSPD
- class pyanomaly.characteristics.CRSPDRaw(alias=None, data=None)
Bases:
FCPanelClass that handles crspd data.
This class contains daily crspd data and is used to generate monthly firm characteristics in
CRSPD.- Parameters:
alias (str, list, or dict) –
Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.
str: A column name in the mapping file (
config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.list: List of firm characteristics (method names).
dict: Dict of firm characteristics and their aliases of the form {method name: alias}.
If aliases are not specified, method names are used as aliases.
data – DataFrame of crspd data with index = date/permno, sorted on permno/date. The crspd data can be given at initialization or loaded later using
load_data().
Methods
Load crspd data from file.
Filter data.
Preprocess data before creating firm characteristics.
Merge crspd with factors.
Get id-month group.
Get id-month group sizes.
Apply a function to each id-month group.
- apply_to_idyms(data, function, n_ret, *args, data2=None)
Apply a function to each id-month group.
This method groups data by id-month and applies function to each group. This method can be used when the function is a reduce function and requires only the data within the month.
- Parameters:
data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the
dataattribute.function – Jitted reduce function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).
n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data.
*args – Additional arguments of function.
data2 – DataFrame, Series, ndarray, str, or int. Optional argument when function requires two sets of input data.
- Returns:
Concatenated value of the outputs of function. Size = (number of id-month groups) x n_retval.
See also
- filter_data()
Filter data.
The data is filtered using the following filters:
shrcd in [10, 11, 12]
Note
We do not filter the data using exchange code (exchcd in [1 (NYSE), 2 (ASE), 3 (NASDAQ)]) because exchcd can change when a stock is delisted: If the data is filtered using exchcd, the data of the delist month can be lost.
- get_idym_group()
Get id-month group.
Group
databy permno and year-month and return the GroupBy object.- Returns:
Pandas GroupBy object.
- get_idym_group_size()
Get id-month group sizes.
- Returns:
Ndarray of id-month group sizes.
- load_data(sdate=None, edate=None, fname='crspd')
Load crspd data from file.
This method loads crspd data from
config.input_dir/crspdand stores it in thedataattribute. Thedatahas index = date/permno and is sorted on permno/date.- Parameters:
sdate – Start date (‘yyyy-mm-dd’). If None, from the earliest.
edate – End date (‘yyyy-mm-dd’). If None, to the latest.
fname – The crspd file name.
- merge_with_factors(factors=None)
Merge crspd with factors.
The factors should contain Fama-French 3 factors and Hou-Xue-Zhang 4 factors with column names as defined in
config.factor_names.- Parameters:
factors – DataFrame of factors with index = date or list of factors. If None or list, factor data will be read from
config.daily_factors_fnameand only the factors in factors will be merged.
- update_variables()
Preprocess data before creating firm characteristics.
Convert negative prices (quotes) to positive.
Set askhi and bidlo to nan if it is negative, the price is negative, or the volume is 0.
Conver cfacpr of 0 to nan.
Convert shares outstanding (shrout) unit from thousands to millions (divide by 1000).
Adjust trading volume following Gao and Ritter (2010).
Create frequently used variables.
- class pyanomaly.characteristics.CRSPD(alias=None, data=None)
Bases:
FCPanelClass to generate firm characteristics from crspd.
This class has a
CRSPDRawobject as a member attribute and use it to generate monthly firm characteristics.CRSPDRaw.datacontains daily crspd data andCRSPD.datacontains monthly firm characteristics. The firm characteristics defined in this class can be viewed usingshow_available_chars().- Parameters:
alias (str, list, or dict) –
Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.
str: A column name in the mapping file (
config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.list: List of firm characteristics (method names).
dict: Dict of firm characteristics and their aliases of the form {method name: alias}.
If aliases are not specified, method names are used as aliases.
data – DataFrame of crspd data with index = date/permno, sorted on permno/date. The crspd data can be given at initialization or loaded later using
load_data().
Attributes
Methods
Load crspd data from file.
Filter data.
Preprocess data before creating firm characteristics.
Merge crspd with factors.
Postprocess data.
Get id-month group.
Get id-month group sizes.
Firm characteristic generation methods have a name like
c_characteristic().- c_ami_126d()
Illiquidity. Amihud (2002)
- c_baspread()
Bid-ask spread. Amihud and Mendelson (1986)
- c_beta()
Market beta (GHZ). Fama and MacBeth (1973)
- c_beta_dimson_21d()
Dimson Beta. Dimson (1979)
- c_betabab_1260d()
Frazzini-Pedersen beta. Frazzini and Pedersen (2014)
- c_betadown_252d()
Downside beta. Ang, Chen, and Xing (2006)
- c_betasq()
Beta squared (GHZ). Fama and MacBeth (1973)
- c_bidaskhl_21d()
High-low bid-ask spread. Corwin and Schultz (2012)
- c_corr_1260d()
Market correlation. Assness et al. (2020)
- c_coskew_21d()
Coskewness. Harvey and Siddique (2000)
- c_dolvol_126d()
Dollar trading volume (JKP). Brennan, Chordia, and Subrahmanyam (1998)
- c_dolvol_var_126d()
Volatility of dollar trading volume (JKP). Chordia, Subrahmanyam, and Anshuman (2001)
- c_idiovol()
Idiosyncratic volatility (GHZ). Ali, Hwang, and Trombley (2003)
- c_iskew_capm_21d()
Idiosyncratic skewness (CAPM). Bali, Engle, and Murray (2016)
- c_iskew_ff3_21d()
Idiosyncratic skewness (FF3). Bali, Engle, and Murray (2016)
- c_iskew_hxz4_21d()
Idiosyncratic skewness (HXZ). Bali, Engle, and Murray (2016)
- c_ivol_capm_21d()
Idiosyncratic volatility (CAPM). Ang et al. (2006)
- c_ivol_capm_252d()
Idiosyncratic volatility (Org, JKP). Ali, Hwang, and Trombley (2003)
- c_ivol_ff3_21d()
Idiosyncratic volatility (FF3). Ang et al. (2006)
- c_ivol_hxz4_21d()
Idiosyncratic volatility (HXZ). Ang et al. (2006)
- c_prc_highprc_252d()
52-week high. George and Hwang (2004)
- c_pricedelay()
Price delay based on R-squared. Hou and Moskowitz (2005)
- c_pricedelay_slope()
Price delay based on slopes. Hou and Moskowitz (2005)
- c_retvol()
Return volatility. Ang et al. (2006)
- c_rmax1_21d()
Maximum daily return. Bali, Cakici, and Whitelaw (2011)
- c_rmax5_21d()
Highest 5 days of return. Bali, Brown, and Tang (2017)
- c_rmax5_rvol_21d()
Highest 5 days of return to volatility. Assness et al. (2020)
- c_rskew_21d()
Return skewness. Bali, Engle, and Murray (2016)
- c_std_dolvol()
Volatility of dollar trading volume (GHZ). Chordia, Subrahmanyam, and Anshuman (2001)
- c_std_turn()
Volatility of share turnover (GHZ). Chordia, Subrahmanyam, and Anshuman (2001)
- c_trend_factor()
Trend factor. Han, Zhou, and Zhu (2016)
- c_turnover_126d()
Share turnover (JKP). Datar, Naik, and Radcliffe (1998)
- c_turnover_var_126d()
Volatility of share turnover (JKP). Chordia, Subrahmanyam, and Anshuman (2001)
- c_zero_trades_126d()
Zero-trading days (6 months). Liu (2006)
- c_zero_trades_21d()
Zero-trading days (1 month). Liu (2006)
- c_zero_trades_252d()
Zero-trading days (12 months). Liu (2006)
- filter_data()
Filter data.
This is a wrapping method of
CRSPDRaw.filter_data().
- get_idym_group()
Get id-month group.
This is a wrapping method of
CRSPDRaw.get_idym_group().
- get_idym_group_size()
Get id-month group sizes.
This is a wrapping method of
CRSPDRaw.get_idym_group_size().
- load_data(sdate=None, edate=None, fname='crspd')
Load crspd data from file.
This is a wrapping method of
CRSPDRaw.load_data().
- merge_with_factors(factors=None)
Merge crspd with factors.
This is a wrapping method of
CRSPDRaw.merge_with_factors().
- postprocess()
Postprocess data.
This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.
- update_variables()
Preprocess data before creating firm characteristics.
This method calls
CRSPDRaw.update_variables()to update variables and initializes thedataattribute.
Merge
- class pyanomaly.characteristics.Merge(alias=None)
Bases:
FCPanelClass to generate firm characteristics from a combined dataset of crspm, crspd, funda, and fundq.
The firm characteristics defined in this class can be viewed using
show_available_chars().Methods
Merge crspm, crspd, funda, and fundq.
Postprocess data.
Firm characteristic generation methods have a name like
c_characteristic().- c_age()
Firm age. Jiang, Lee, and Zhang (2005)
- c_mispricing_mgmt()
Mispricing factor: Management. Stambaugh and Yuan (2016)
- c_mispricing_perf()
Mispricing factor: Performance. Stambaugh and Yuan (2016)
- c_qmj()
Quality minus Junk: Composite. Assness, Frazzini, and Pedersen (2018)
- c_qmj_growth()
Quality minus Junk: Growth. Assness, Frazzini, and Pedersen (2018)
- c_qmj_prof()
Quality minus Junk: Profitability. Assness, Frazzini, and Pedersen (2018)
- c_qmj_safety()
Quality minus Junk: Safety. Assness, Frazzini, and Pedersen (2018)
- postprocess()
Postprocess data.
This method deletes temporary variable columns (columns starting with ‘_’) and replaces infinite values with nan.
- preprocess(crspm=None, crspd=None, funda=None, fundq=None, delete_data=True)
Merge crspm, crspd, funda, and fundq.
The crspd, funda, and fundq are left-joined to crspm, and the resulting data has index = date/permno. The frequency of the final data is the same as the frequency of crspm data. This method also checks if “ingredient” firm characteristics have been generated and generates them if necessary.
config
This module defines functions to set/get package configuration.
A configuration can be accessed by either get_config(attr) or config.attr.
Configurations
Attribute |
Description |
Default value |
|---|---|---|
input_dir |
Input file top-level directory |
‘./input/’ |
output_dir |
Output file top-level directory |
‘./output/’ |
log_dir |
Log file directory |
‘./log/’ |
mapping_file_path |
Function-characteristic mapping file path |
‘./mapping.xlsx’ |
factors_monthly_fname |
Monthly factor file name |
‘factors_monthly’ |
factors_daily_fname |
Daily factor file name |
‘factors_daily’ |
factor_names |
Factor name mapping dictionary. The keys are the factor names used in PyAnomaly and the values are the factor names in monthly(daily) factor file. This configuration can be used when factor files are obtained externally and have different factor names. |
{‘rf’: ‘rf’, ‘mktrf’: ‘mktrf’, ‘smb_ff’: ‘smb_ff’, ‘hml’: ‘hml’, ‘smb_ff5’: ‘smb_ff5’, ‘rmw’: ‘rmw’, ‘cma’: ‘cma’, ‘smb_hxz’: ‘smb_hxz’, ‘inv’: ‘inv’, ‘roe’: ‘roe’, ‘smb_sy’: ‘smb_sy’, ‘mgmg’: ‘mgmt’, ‘perf’: ‘perf’} |
replicate_jkp |
Whether to replicate JKP version. True or False |
False |
float_type |
Float data type. ‘float32’ or ‘float64’ |
‘float64’ |
file_format |
File format used to save data. ‘pickle’ or ‘parquet’ |
‘pickle’ |
disable_jit |
Disable jitting. Applicable only to non-cached jitted functions. |
False |
jit_parallel |
Enable parallel looping in jitted functions. |
True |
numba_num_threads |
Number of threads used in Numba parallel mode. Should be fewer than the number of CPU cores. |
Number of CPU cores |
debug |
Print debugging messages. |
False |
Factor model |
Factors |
|---|---|
Fama and French 3-factor model |
mktrf, smb_ff, hml |
Fama and French 5-factor model |
mktrf, smb_ff5, hml, rmw, cma |
Hou, Xu, and Zhang 4-factor model |
mktrf, smb_hxz, inv, roe |
Stambaugh and Yuan 4-factor model |
mktrf, smb_sy, mgmt, perf |
Methods
Set configuration. |
|
Get configuration. |
- pyanomaly.config.get_config(attr)
Get configuration.
- Parameters:
attr – String. A configuration attribute.
- Returns:
The value of the attribute.
Examples
>>> get_config('input_dir') './input/'
- pyanomaly.config.set_config(**kwargs)
Set configuration.
- Parameters:
**kwargs – Keword arguments of configuration attributes and their values.
Examples
Change the float type to ‘float32’.
>>> set_config(float_type='float32')
Set input and output directories to ‘./my_input/’ and ‘./my_output/’, respectively.
>>> set_config(input_dir='./my_input/', out_dir='./my_output')
datatools
This module defines functions for data handling.
Group-and-Apply
Group data and apply a function to each group.
Group data and apply a function to each group (jitted version).
Group data and apply a reduce function to each group (jitted version).
Classify/Trim/Filter/Winsorize
Merge/Populate/Shift
Data Inspection/Comparison
Inspect data.
Compare two data sets.
Auxiliary Functions
Shift dates to the last dates of the same month.
Add months to dates.
- pyanomaly.datatools.add_months(date, months, to_month_end=True)
Add months to dates.
- Parameters:
date – Datetime Series.
months – Months to add. Can be negative.
to_month_end – If True, returned dates are end-of-month dates.
- Returns:
Datetime Series of (date + months). Dates are end-of-month dates if to_month_end = True.
- pyanomaly.datatools.apply_to_groups(data, ginfo, function, *args, data2=None)
Group data and apply a function to each group.
This function can be used for a complex groupby operation. The data (and data2) is grouped using the grouping information, ginfo, and function is applied to each group. The function can be either jitted or unjitted. If it is jitted, consider using
apply_to_groups_jit()instead, which runs the for loop along the groups in parallel. This function is faster thangroupby().apply(function)when the size of data is large.- Parameters:
data – DataFrame or ndarray (values of a DataFrame) to be grouped.
ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. See the note below.
function – Function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).
*args – Additional arguments of function.
data2 – DataFrame or ndarray (values of a DataFrame) to be grouped. Optional argument when function requires two sets of input data.
- Returns:
Concatenated value of the outputs of function.
Note
Suppose data is a DataFrame with index = date/id, sorted on id/date.
To apply a function to each id, ginfo can be set to any of the followings.
ginfo = ‘id’ (index name)
ginfo = 1 (index level)
ginfo = data.groupby(‘id’) (GroupBy object)
ginfo = data.groupby(‘id’).size().to_numpy() (group size)
ginfo = list(data.groupby(‘id’).indices.values()) (group index)
To apply a function to each date, ginfo can be set to any of the followings.
ginfo = ‘date’ (index name)
ginfo = 0 (index level)
ginfo = data.groupby(‘date’) (GroupBy object)
ginfo = list(data.groupby(‘date’).indices.values()) (group index)
Since data is sorted on id/date, group sizes can be used only when grouped by id. The most efficient method is to provide group sizes, followed by group indexes. Hence, if this function needs to be called repeatedly, performance can be improved by generating group sizes (if data is sorted on the grouping index (column)) or group indexes outside and use them as ginfo.
Examples
>>> data = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... 'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ).sort_index(level=['permno', 'date']) ... data ret me me_nyse date permno 2023-03-31 10000 0.010 100 NaN 2023-04-30 10000 -0.030 120 NaN 2023-03-31 20000 0.050 5000 5000.000 2023-04-30 20000 -0.010 4500 4500.000 2023-03-31 30000 0.020 2000 2000.000 2023-04-30 30000 0.030 2100 2100.000 2023-03-31 40000 0.030 300 300.000 2023-04-30 40000 -0.050 305 305.000 2023-03-31 50000 0.110 150 150.000 2023-04-30 50000 0.070 140 140.000
Define the function to apply.
>>> def rolling_sum(x, n): ... return x.rolling(n).sum()
Group data by ‘permno’ and calculate rolling sum of ‘ret’ and ‘me’.
>>> apply_to_groups(data[['ret', 'me']], 'permno', rolling_sum, 2) ret me date permno 2023-03-31 10000 NaN NaN 2023-04-30 10000 -0.020 220.000 2023-03-31 20000 NaN NaN 2023-04-30 20000 0.040 9500.000 2023-03-31 30000 NaN NaN 2023-04-30 30000 0.050 4100.000 2023-03-31 40000 NaN NaN 2023-04-30 40000 -0.020 605.000 2023-03-31 50000 NaN NaN 2023-04-30 50000 0.180 290.000
The followings are equivalent to the above.
>>> gb = data.groupby('permno') ... apply_to_groups(data[['ret', 'me']], gb, rolling_sum, 2)
>>> gsize = gb.size().to_numpy() ... apply_to_groups(data[['ret', 'me']], gsize, rolling_sum, 2)
Group data by ‘date’ and calculate cross-sectional mean of ‘ret’.
>>> apply_to_groups(data['ret'], 'date', np.mean) [[0.044] [0.002]]
The followings are equivalent to the above.
>>> gb = data.groupby('date') ... apply_to_groups(data['ret'], gb, np.mean)
>>> gidx = list(gb.indices.values()) ... apply_to_groups(data['ret'], gidx, np.mean)
- pyanomaly.datatools.apply_to_groups_jit(data, ginfo, function, n_ret, *args, data2=None)
Group data and apply a function to each group (jitted version).
This function is similar to
apply_to_groups(). The for loop along the groups is jitted for faster performance. The first call of this function can be slow as jitting takes place when first called. The function should be jitted, and the row size of its returns should be the same as the row size of the input data. For reduce functions, e.g., sum and mean, useapply_to_groups_reduce_jit().- Parameters:
data – DataFrame or ndarray (values of a DataFrame) to be grouped.
ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. See
apply_to_groups()for more details.function – Jitted function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).
n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data.
*args – Additional arguments of function.
data2 – DataFrame or ndarray (values of a DataFrame) to be grouped. Optional argument when function requires two sets of input data.
- Returns:
Concatenated value of the outputs of function.
Examples
>>> data = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... 'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ).sort_index(level=['permno', 'date']) ... data ret me me_nyse date permno 2023-03-31 10000 0.010 100 NaN 2023-04-30 10000 -0.030 120 NaN 2023-03-31 20000 0.050 5000 5000.000 2023-04-30 20000 -0.010 4500 4500.000 2023-03-31 30000 0.020 2000 2000.000 2023-04-30 30000 0.030 2100 2100.000 2023-03-31 40000 0.030 300 300.000 2023-04-30 40000 -0.050 305 305.000 2023-03-31 50000 0.110 150 150.000 2023-04-30 50000 0.070 140 140.000
Group data by ‘permno’ and calculate rolling sum of ‘ret’ and ‘me’ using
numba_support.roll_sum().>>> apply_to_groups_jit(data[['ret', 'me']], 'permno', roll_sum, 2, 2) array([[ nan, nan], [-2.00e-02, 2.20e+02], [ nan, nan], [ 4.00e-02, 9.50e+03], [ nan, nan], [ 5.00e-02, 4.10e+03], [ nan, nan], [-2.00e-02, 6.05e+02], [ nan, nan], [ 1.80e-01, 2.90e+02]])
The followings are equivalent to the above.
>>> gb = data.groupby('permno') ... apply_to_groups_jit(data[['ret', 'me']], gb, roll_sum, 2, 2)
>>> gsize = gb.size().to_numpy() ... apply_to_groups_jit(data[['ret', 'me']], gsize, roll_sum, 2, 2)
- pyanomaly.datatools.apply_to_groups_reduce_jit(data, ginfo, function, n_ret, *args, data2=None)
Group data and apply a reduce function to each group (jitted version).
This function is similar to
apply_to_groups(). The for loop along the groups is jitted for faster performance. The first call of this function can be slow as jitting takes place when first called. The function should be jitted and a reduce function such as mean or std: the row size of the returns should be 1.- Parameters:
data – DataFrame or ndarray (values of a DataFrame) to be grouped.
ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. See
apply_to_groups()for more details.function – Jitted function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).
n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data.
*args – Additional arguments of function.
data2 – DataFrame or ndarray (values of a DataFrame) to be grouped. Optional argument when function requires two sets of input data.
- Returns:
Concatenated value of the outputs of function.
Examples
>>> data = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... 'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ).sort_index(level=['permno', 'date']) ... data ret me me_nyse date permno 2023-03-31 10000 0.010 100 NaN 2023-04-30 10000 -0.030 120 NaN 2023-03-31 20000 0.050 5000 5000.000 2023-04-30 20000 -0.010 4500 4500.000 2023-03-31 30000 0.020 2000 2000.000 2023-04-30 30000 0.030 2100 2100.000 2023-03-31 40000 0.030 300 300.000 2023-04-30 40000 -0.050 305 305.000 2023-03-31 50000 0.110 150 150.000 2023-04-30 50000 0.070 140 140.000
Group data by ‘date’ and calculate cross-sectional standard deviation of ‘ret’ using
numba_support.nanstd().>>> apply_to_groups_reduce_jit(data['ret'], 'date', nanstd, None) array([0.03974921, 0.04816638])
The following is equivalent to the above.
>>> gidx = list(data.groupby('date').indices.values()) ... apply_to_groups_reduce_jit(data['ret'], gidx, nanstd, None)
- pyanomaly.datatools.classify(array, split, ascending=True, ginfo=None, by_array=None)
Classify array.
Classify (group) array into split classes based on its value. Class labels are set to 0, 1, … where 0 corresponds to the lowest (highest) value group if ascending = True (False). If array contains nan, their classes are set to nan.
- Parameters:
array – Nx1 ndarray or Series to be classified.
split – Number of classes (for equal-size quantiles) or list of quantiles, e.g., (0.3, 0.7).
ascending (bool) – Sorting order.
ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. If given, array is classified within each group. See
apply_to_groups()for more details.by_array – Array based on which cut points are determined. If None, by_array = array. E.g., array can be set to ME and by_array to NYSE-ME to group firms on size with NYSE-size cut points.
- Returns:
Nx1 ndarray of classes.
Note
If the array has one unique value, the class will be set to 0, and if the array has two unique values (binary variable), the class of the smaller value will be 0 and that of the larger value will be (number of quantiles - 1), when ascending = True. If the number of unique values is greater than 2 and smaller than the number of quantiles, the classes are not deterministic.
Examples
>>> array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) ... array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
Classify array into 5 equally-spaced groups.
>>> classify(array, 5) [0., 0., 0., 1., 1., 2., 2., 3., 3., 4., 4.]
Classify array into three groups that correspond to 0.3, 03-0.7, 0.7-1.0 quantiles.
>>> classify(array, [0.3, 0.7]) [0., 0., 0., 0., 1., 1., 1., 1., 2., 2., 2.]
Classify array into three groups in descending order.
>>> classify(array, [0.3, 0.7], ascending=False) [2., 2., 2., 2., 1., 1., 1., 1., 0., 0., 0.]
>>> data = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... 'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ).sort_index(level=['permno', 'date']) ... data ret me me_nyse date permno 2023-03-31 10000 0.010 100 NaN 2023-04-30 10000 -0.030 120 NaN 2023-03-31 20000 0.050 5000 5000.000 2023-04-30 20000 -0.010 4500 4500.000 2023-03-31 30000 0.020 2000 2000.000 2023-04-30 30000 0.030 2100 2100.000 2023-03-31 40000 0.030 300 300.000 2023-04-30 40000 -0.050 305 305.000 2023-03-31 50000 0.110 150 150.000 2023-04-30 50000 0.070 140 140.000
Cross-sectionally classify the data into three groups ([0.3, 0.7, 1.0]) on ‘me’ using ‘me_nyse’ cut points.
>>> data['me_cls'] = classify(data['me'], [0.3, 0.7], ginfo='date', by_array=data['me_nyse']) ... data ret me me_nyse me_cls date permno 2023-03-31 10000 0.010 100 NaN 0 2023-04-30 10000 -0.030 120 NaN 0 2023-03-31 20000 0.050 5000 5000.000 2 2023-04-30 20000 -0.010 4500 4500.000 2 2023-03-31 30000 0.020 2000 2000.000 1 2023-04-30 30000 0.030 2100 2100.000 1 2023-03-31 40000 0.030 300 300.000 1 2023-04-30 40000 -0.050 305 305.000 1 2023-03-31 50000 0.110 150 150.000 0 2023-04-30 50000 0.070 140 140.000 0
- pyanomaly.datatools.compare_data(data1, data2=None, on=None, how='inner', tolerance=0.01, suffixes=('_x', '_y'), returns=False)
Compare two data sets.
This function compares the common columns of data1 and data2. This is similar to
data1.compare(data2), but data1 and data2 are not required to have the same index and columns: Data sets are first merged and only common columns are compared. Also, a tolerance can be set to determine whether two values are identical. Whether two columns are identical within the tolerance (‘match’), their correlation (‘corr’), and the number of nans in data1 and data2 (‘nan_x’, ‘nan_y’) are printed.- Parameters:
data1 – DataFrame for comparison.
data2 – DataFrame for comparison. If None, data1 is assumed to be a merged dataset of data1 and data2. If data1 is a merged dataset, on and how have no effect.
on – A column or a list of columns to merge data sets on. If None, data sets will be merged on index.
how – How to merge: ‘inner’, ‘outer’, ‘left’, or ‘right’. If ‘inner’, only common indexes are compared.
tolerance – Tolerance level to determine equality. Two values, val1 and val2, are considered to be identical if abs((val1 - val2) / val2) < tolerance.
suffixes – suffixes to add to common columns or suffixes used in the merged dataset. suffixes[0]: suffix for data1, suffixes[1]: suffix for data2.
returns – If True, return the comparison results and merged data.
- Returns:
Comparison result. DataFrame with index = compared columns and columns = [‘match’, ‘corr’, ‘nan_x’, ‘nan_y’].
Merged DataFrame of data1 and data2.
Examples
>>> data1 = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ) ... data2 = pd.DataFrame( ... {'ret': [0.00, np.nan, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ) ... compare_data(data1, data2) column matched corr nan_x nan_y ret 0.88889 0.99773 0 1
- pyanomaly.datatools.filter(data, on, limits, ginfo=None, by=None)
Filter data.
Remove rows of data, where data[on] is outside limits.
- Parameters:
data – DataFrame to be filtered.
on – Column of data to filter data on.
limits – A pair of quantiles, e.g., (0.1, 0.1) to remove top and bottom 10%. An element of limits can be set to None for one-sided filtering.
ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. If given, data is filtered within each group. See
apply_to_groups()for more details.by – Column of data on which cut points are determined. If None, by = on. E.g., on can be set to ‘me’ and by to ‘nyse_me’ to remove small firms based on NYSE-size cut points.
- Returns:
DataFrame. Filtered data.
Examples
>>> data = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... 'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ).sort_index(level=['permno', 'date']) ... data ret me me_nyse date permno 2023-03-31 10000 0.010 100 NaN 2023-04-30 10000 -0.030 120 NaN 2023-03-31 20000 0.050 5000 5000.000 2023-04-30 20000 -0.010 4500 4500.000 2023-03-31 30000 0.020 2000 2000.000 2023-04-30 30000 0.030 2100 2100.000 2023-03-31 40000 0.030 300 300.000 2023-04-30 40000 -0.050 305 305.000 2023-03-31 50000 0.110 150 150.000 2023-04-30 50000 0.070 140 140.000
Remove smallest 20% using ‘me_nyse’ cut points.
>>> filter(data, 'me', [0.2, None], ginfo='date', by='me_nyse') ret me me_nyse date permno 2023-03-31 20000 0.050 5000 5000.000 2023-04-30 20000 -0.010 4500 4500.000 2023-03-31 30000 0.020 2000 2000.000 2023-04-30 30000 0.030 2100 2100.000 2023-03-31 40000 0.030 300 300.000 2023-04-30 40000 -0.050 305 305.000
- pyanomaly.datatools.inspect_data(data, option=['summary'], date_col=None, id_col=None)
Inspect data.
This function inspects a panel data, data, and print the results.
- Parameters:
data – DataFrame. It should have date and id columns or index = date/id.
option – List of items to display. See below for available options.
date_col – Date column. If None, date.index[0] is assumed to be date.
id_col – ID column. If None, date.index[1] is assumed to be id.
Option
Items displayed
‘summary’
Data shape, number of unique dates, and number of unique ids.
‘id_count`
Number of ids per date.
‘nans’
Number of nans and infs per column.
‘stats’
Descriptive statistics. Same as
data.describe().
- pyanomaly.datatools.merge(left, right, on=None, right_on=None, how='left', drop_duplicates='right', suffixes=None, method=None)
Merge two data sets.
This is similar to
pd.merge(), but often much faster and less memory-hungry when merging left. Also, the index of left is always retained.- Parameters:
left – Series or DataFrame. Left data to merge.
right – Series or DataFrame, Right data to merge.
on – (List of) column(s) to merge on. If None, merge on index.
right_on – (List of) column(s) of right to merge on. If None, right_on = on.
how – Merge method: ‘inner’, ‘outer’, ‘left’, or ‘right’.
drop_duplicates – how to handle duplicate columns. ‘left’: keep right, ‘right’: keep left, None: keep both. If None, suffixes should be provided.
suffixes – A tuple of suffixes for duplicate columns, e.g., suffixes=(‘_x’, ‘_y’) will add ‘_x’ and ‘_y’ to the left and right duplicate columns, respectively.
method – None or ‘pandas’. None uses an internal merge algorithm for left-merge; ‘pandas’ uses
pd.merge()internally. If how is not ‘left’, this option is ignored andpd.merge()is always used.
- Returns:
Merged DataFrame.
Note
The internal algorithm is much faster and more memory-efficient than
pd.merge()especially when how = ‘left’ and right data does not have many columns. In other cases, it could be slower. Try both method = None and ‘merge’, and choose the faster method.Warning
The left and right could be modified internally.
Examples
>>> data1 = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ) ... data1 ret me date permno 2023-03-31 10000 0.010 100 20000 0.050 5000 30000 0.020 2000 40000 0.030 300 50000 0.110 150 2023-04-30 10000 -0.030 120 20000 -0.010 4500 30000 0.030 2100 40000 -0.050 305 50000 0.070 140
>>> data2 = pd.DataFrame( ... {'me': [120, 4500, 2100, 305, 140], ... 'me_nyse': [np.nan, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ) ... data2 me me_nyse date permno 2023-04-30 10000 120 NaN 20000 4500 4500.000 30000 2100 2100.000 40000 305 305.000 50000 140 140.000
>>> merge(data1, data2) ret me me_nyse date permno 2023-03-31 10000 0.010 100 NaN 20000 0.050 5000 NaN 30000 0.020 2000 NaN 40000 0.030 300 NaN 50000 0.110 150 NaN 2023-04-30 10000 -0.030 120 NaN 20000 -0.010 4500 4500.000 30000 0.030 2100 2100.000 40000 -0.050 305 305.000 50000 0.070 140 140.000
>>> merge(data1, data2, how='inner') ret me me_nyse date permno 2023-04-30 10000 -0.030 120 NaN 20000 -0.010 4500 4500.000 30000 0.030 2100 2100.000 40000 -0.050 305 305.000 50000 0.070 140 140.000
>>> merge(data1, data2, drop_duplicates='left') ret me me_nyse date permno 2023-03-31 10000 0.010 NaN NaN 20000 0.050 NaN NaN 30000 0.020 NaN NaN 40000 0.030 NaN NaN 50000 0.110 NaN NaN 2023-04-30 10000 -0.030 120.000 NaN 20000 -0.010 4500.000 4500.000 30000 0.030 2100.000 2100.000 40000 -0.050 305.000 305.000 50000 0.070 140.000 140.000
- pyanomaly.datatools.populate(data, freq, method='ffill', limit=None, new_date_idx=None)
Populate data.
Populate data to freq frequency.
- Parameters:
data – DataFrame with index = date/id, sorted on id/date.
freq – Frequency to populate: ANNUAL, QUARTERLY, MONTHLY, or DAILY.
method – Filling method for newly added rows. ‘ffill’: forward fill, None: nan.
limit – Maximum number of rows to forward-fill. Default to None (no fill).
new_date_idx – Name of the new (populated) date index. If None, use the current date index name. If given, the original date index is kept as a column.
- Returns:
Populated data with index = new_date/id.
Examples
>>> data = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03], ... 'me': [100, 5000, 120, 4500], ... }, ... index=pd.MultiIndex.from_product([pd.to_datetime(['2023-03-31', '2024-03-31']), [10000, 20000]], ... names=['date', 'permno']) ... ) ... data ret me date permno 2023-03-31 10000 0.010 100 20000 0.050 5000 2024-03-31 10000 0.020 120 20000 0.030 4500
Populate to monthly and forward-fill up to 12 months.
>>> populate(data, MONTHLY, limit=12) ret me date permno 2023-03-31 10000 0.010 100.000 2023-04-30 10000 0.010 100.000 2023-05-31 10000 0.010 100.000 2023-06-30 10000 0.010 100.000 2023-07-31 10000 0.010 100.000 2023-08-31 10000 0.010 100.000 2023-09-30 10000 0.010 100.000 2023-10-31 10000 0.010 100.000 2023-11-30 10000 0.010 100.000 2023-12-31 10000 0.010 100.000 2024-01-31 10000 0.010 100.000 2024-02-29 10000 0.010 100.000 2024-03-31 10000 0.020 120.000 2023-03-31 20000 0.050 5000.000 2023-04-30 20000 0.050 5000.000 2023-05-31 20000 0.050 5000.000 2023-06-30 20000 0.050 5000.000 2023-07-31 20000 0.050 5000.000 2023-08-31 20000 0.050 5000.000 2023-09-30 20000 0.050 5000.000 2023-10-31 20000 0.050 5000.000 2023-11-30 20000 0.050 5000.000 2023-12-31 20000 0.050 5000.000 2024-01-31 20000 0.050 5000.000 2024-02-29 20000 0.050 5000.000 2024-03-31 20000 0.030 4500.000
- pyanomaly.datatools.shift(data, n, cols=None, excl_cols=None)
Shift data.
Shift data by n. If cols is given, only cols columns are shifted. If excl_cols is given, columns excluding excl_cols are shifted. Either cols or excl_cols should be None. The shifted data contains both shifted and not-shifted columns. If data has a MultiIndex of date/id, data is shifted within each id.
- Parameters:
data – Series or DataFrame with index = date or date/id. If index = date/id, data must be sorted on id/date.
n – Shift size
cols – Columns to shift.
excl_cols – Columns to not shift.
- Returns:
Series or DataFrame. Shifted data.
- pyanomaly.datatools.to_month_end(date)
Shift dates to the last dates of the same month.
- Parameters:
date – Datetime Series.
- Returns:
Datetime Series shifted to month end.
- pyanomaly.datatools.trim(array, limits, ginfo=None, by_array=None)
Trim array.
Trim array values that are outside limits. The returned value is a bool array that indicates which to be trimmed (True to keep and False to remove).
- Parameters:
array – Nx1 ndarray or Series to be trimmed.
limits – A pair of quantiles, e.g., (0.1, 0.1) to trim top and bottom 10%. An element of limits can be set to None for one-sided trim.
ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. If given, array is trimmed within each group. See
apply_to_groups()for more details.by_array – Array based on which cut points are determined. If None, by_array = array. E.g., array can be set to ME and by_array to NYSE-ME to remove small firms based on NYSE-size cut points.
- Returns:
Nx1 bool ndarray. Elements corresponding to trimmed values are set to False.
Examples
>>> array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) ... array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
Trim top/bottom 10% values.
>>> trim(array, [0.1, 0.1]) [False, True, True, True, True, True, True, True, True, True, False]
>>> data = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... 'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ).sort_index(level=['permno', 'date']) ... data ret me me_nyse date permno 2023-03-31 10000 0.010 100 NaN 2023-04-30 10000 -0.030 120 NaN 2023-03-31 20000 0.050 5000 5000.000 2023-04-30 20000 -0.010 4500 4500.000 2023-03-31 30000 0.020 2000 2000.000 2023-04-30 30000 0.030 2100 2100.000 2023-03-31 40000 0.030 300 300.000 2023-04-30 40000 -0.050 305 305.000 2023-03-31 50000 0.110 150 150.000 2023-04-30 50000 0.070 140 140.000
Trim smallest 20% cross-sectionally using ‘me_nyse’ cut points.
>>> data['trimmed'] = trim(data['me'], [0.2, None], ginfo='date', by_array=data['me_nyse']) ... data ret me me_nyse trimmed date permno 2023-03-31 10000 0.010 100 NaN False 2023-04-30 10000 -0.030 120 NaN False 2023-03-31 20000 0.050 5000 5000.000 True 2023-04-30 20000 -0.010 4500 4500.000 True 2023-03-31 30000 0.020 2000 2000.000 True 2023-04-30 30000 0.030 2100 2100.000 True 2023-03-31 40000 0.030 300 300.000 True 2023-04-30 40000 -0.050 305 305.000 True 2023-03-31 50000 0.110 150 150.000 False 2023-04-30 50000 0.070 140 140.000 False
- pyanomaly.datatools.winsorize(array, limits, ginfo=None, by_array=None)
Winsorize array.
Winsorize array values that are outside limits.
- Parameters:
array – Nx1 ndarray or Series to be winsorized.
limits – A pair of quantiles, e.g., (0.1, 0.1) to winsorize top and bottom 10%. An element of limits can be set to None for one-sided winsorization.
ginfo – Grouping information: integer (index level), str (column name), pandas GroupBy object, ndarray of group sizes, or list of group indexes. If given, array is winsorized within each group. See
apply_to_groups()for more details.by_array – Array based on which cut points are determined. If None, by_array = array. E.g., array can be set to ME and by_array to NYSE-ME to winsorize large firms’ weights based on NYSE-size cut points.
- Returns:
Nx1 ndarray of winsorized values.
Examples
>>> array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) ... array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
Winsorize top/bottom 10% values.
>>> winsorize(array, [0.1, 0.1]) [ 2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10]
>>> data = pd.DataFrame( ... {'ret': [0.01, 0.05, 0.02, 0.03, 0.11, -0.03, -0.01, 0.03, -0.05, 0.07], ... 'me': [100, 5000, 2000, 300, 150, 120, 4500, 2100, 305, 140], ... 'me_nyse': [np.nan, 5000, 2000, 300, 150, np.nan, 4500, 2100, 305, 140], ... }, ... index=pd.MultiIndex.from_product([['2023-03-31', '2023-04-30'], [10000, 20000, 30000, 40000, 50000]], ... names=['date', 'permno']) ... ).sort_index(level=['permno', 'date']) ... data ret me me_nyse date permno 2023-03-31 10000 0.010 100 NaN 2023-04-30 10000 -0.030 120 NaN 2023-03-31 20000 0.050 5000 5000.000 2023-04-30 20000 -0.010 4500 4500.000 2023-03-31 30000 0.020 2000 2000.000 2023-04-30 30000 0.030 2100 2100.000 2023-03-31 40000 0.030 300 300.000 2023-04-30 40000 -0.050 305 305.000 2023-03-31 50000 0.110 150 150.000 2023-04-30 50000 0.070 140 140.000
Winsorize top 10% returns cross-sectionally.
>>> data['ret_winsorized'] = winsorize(data['ret'], [None, 0.1], ginfo='date') ... data[['ret', 'ret_winsorized']].sort_index() ret ret_winsorized date permno 2023-03-31 10000 0.010 0.010 20000 0.050 0.050 30000 0.020 0.020 40000 0.030 0.030 50000 0.110 0.050 2023-04-30 10000 -0.030 -0.030 20000 -0.010 -0.010 30000 0.030 0.030 40000 -0.050 -0.050 50000 0.070 0.030
factors
This module defines functions to generate factor portfolios and characteristic portfolios.
Prepare data for factor portfolio generation. |
|
Make a factor portfolio. |
|
Make factor portfolios. |
|
Make all factor portfolios. |
|
Make characteristic portfolios. |
- pyanomaly.factors.make_all_factor_portfolios(monthly=True, daily=True, sdate=None)
Make all factor portfolios.
Currently, this function generates the following factors:
Fama-French 3 factors: mktrf, smb_ff, hml
Fama-French 5 factors: mktrf, smb_ff5, hml, rmw, cma
Hou-Xue-Zhang 4 factors: mktrf, smb_hxz, inv, roe
Stambaugh-Yuan 4 factors: mktrf, smb_sy, mgmt, perf
The DataFrame of factors is saved to
config.factors_monthly(daily)_fnameinconfig.output_dir.- Parameters:
monthly – If True, generate monthly factors.
daily – If True, generate daily factors.
sdate – Start date (‘yyyy-mm-dd’).
See also
- pyanomaly.factors.make_char_portfolios(panel, char_list, weighting)
Make characteristic portfolios.
Make characteristic portfolios using the method of JKP.
Split stocks into terciles (1:1:1) based on a firm characteristic. Use only NYSE stocks (excluding bottom 20%) to determine the cut points.
Make a characteristic portfolio: First Quantile - Last Quantile.
- Parameters:
panel –
FCPanelthat contains firm characteristics in char_list. It should also have ‘exret’, ‘me’, ‘primary’, and ‘exchcd’ columns.char_list – List of firm characteristics to generate.
weighting – ‘ew’ (equal-weight), ‘vw’ (value-weight), or ‘vw_cap’ (value-weight capped at 0.8 NYSE-size quantile).
- Returns:
Characteristic portfolio DataFrame with index = ‘date’ and columns = [‘group’, ‘char’, ‘ret’, ‘signal’, ‘n_firms’].
group: ‘h’, ‘m’, ‘l’, or ‘hml’.
char: characteristic name
ret: characteristic portfolio return
signal: average characteristic value
n_firms: number of firms
- pyanomaly.factors.make_factor_portfolio(panel, ret_col, char, char_split=(0.3, 0.7), nyse=True, ascending=False, size_class=None, weight_col=None)
Make a factor portfolio.
The procedure is as follows.
Split stocks into terciles based on the values of char column.
If nyse = True, cut points are determined by char of NYSE stocks.
If ascending is True, the first quantile contains stocks with the lowest char values.
Make the factor portfolio.
If size_class is given,
Factor portfolio (hml) = 1/2(Small High + Big High) − 1/2(Small Low + Big Low)
Size portfolio (smb) = 1/3(Small High + Small Mid + Small Low) - 1/3(Big High + Big Mid + Big Low)
Otherwise,
Factor portfolio (hml) = High - Low.
High (Low) is the first (last) quantile, i.e., the factor portfolio is (high - low) when ascending = False and (low - high) when ascending = True.
If weight_col is given, the factor portfolio is a weight_col-weighted portfolio; otherwise, it is an equal-weight portfolio.
- Parameters:
panel –
FCPanelthat contains data for factor generation. It should have char, ret_col, size_class, and weight_col (optional) columns.ret_col – Return column.
char – Firm characteristic column to make a factor portfolio from.
char_split – Tuple of splits for terciles. (0.3, 0.7) means 3:4:3 split.
nyse – If True, cut points are determined by char of NYSE stocks.
ascending – If True, the first quantile contains stocks with the lowest char values.
size_class – Size class column. If given, a factor portfolio is constructed in each size group and the factor portfolio is the average of them.
weight_col – Weight column. If None, stocks are equally weighted.
- Returns:
DataFrame of the factor and its ingredient portfolios. Index = ‘date’.
If size_class is given,
columns: [‘sh’, ‘sm’, ‘sl’, ‘bh’, ‘bm’, ‘bl’, ‘hml’, ‘smb’].
Otherwise,
columns: [‘h’, ‘m’, ‘l’, ‘hml’].
- pyanomaly.factors.make_factor_portfolios(panel, factor_groups)
Make factor portfolios.
Generate factor portfolios defined in factor_groups, which is a “factor group” or a list of them.
Factor group
A “factor group” is a dictionary that defines a factor model and has the following structure.
{ 'factor_name': { 'char'(str): Firm characteristic to use to make the factor 'ascending'(bool): If True (False), the factor portfolio is low-high (high-low). Default to False 'char_split'(tuple): How to split stocks into three groups. Default to (0.3, 0.7) } }
Example (Fama-French 5 factors):
ff5 = { 'smb_ff5': {}, 'hml': dict(char='be_me'), 'rmw': dict(char='ope_be'), 'cma': dict(char='at_gr1', ascending=True), }
Note that
Any firm characteristic defined in
FUNDA,FUNDQ,CRSPM,CRSPD, andMergecan be used to generate a factor portfolio.Size factor should have an empty dict.
Default value items can be omitted.
Procedure
Split stocks into two size groups (50:50) using NYSE-size cut points.
Make market factor portfolio: weighted mean excess returns of all stocks.
Make factor portfolios. See
make_factor_portfolio().Make size factor portfolio: average of the size factor portfolios (smb) from each factor.
- Parameters:
panel –
FCPanelthat contains data for factor generation, generated byprepare_data_for_factors().factor_groups – (List of) factor group(s).
- Returns:
Factor portfolio dataframe with index = ‘date’ and columns = [‘mktrf’, ‘rf’] + factor names in factor_groups.
- pyanomaly.factors.prepare_data_for_factors(chars, monthly=True, daily=True, sdate=None)
Prepare data for factor portfolio generation.
Generate firm characteristics needed to make factors. The output data are used as input for
make_factor_portfolios().- Parameters:
chars – List of firm characteristics to generate.
monthly – If True, generate firm characteristics monthly.
daily – If True, generate firm characteristics daily.
sdate – Start date (‘yyyy-mm-dd’).
- Returns:
mdata.
FCPanelof monthly data. None if monthly = False.ddata.
FCPanelof daily data. None if daily = False.
ff
This module defines functions to generate Fama-French factors. This module is for validation only.
The Fama-French factors used in this library are generated by factors.make_all_factor_portfolios().
Generate Fama-French 3 factors.
Generate Fama-French 3 factors (copy of the WRDS code).
- pyanomaly.ff.make_ff_factors()
Generate Fama-French 3 factors.
This function refers to the WRDS code, but the results are slightly different as the code is written under the architecture of PyAnomaly. Compared to the data from the K. French website, HML has a correlation of 0.967, and SMB has a correlation of 0.989. Compared to the WRDS code, HML has a correlation of 0.991, and SMB has a correlation of 0.993.
Major differences from WRDS code:
Primary stock identification: for our method, refer to
WRDS.add_gvkey_to_crsp().Delist return: for our method, refer to
WRDS.merge_sf_with_seall().
- Returns:
Factors. DataFrame with index = ‘date’, columns = [‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’].
Number of firms. DataFrame with index = ‘date’, columns = [‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’].
References
- pyanomaly.ff.make_ff_factors_wrds()
Generate Fama-French 3 factors (copy of the WRDS code).
- Returns:
Factors. DataFrame with index = ‘date’, columns = [‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’].
Number of firms. DataFrame with index = ‘date’, columns = [‘bh’, ‘bm’, ‘bl’, ‘sh’, ‘sm’, ‘sl’, ‘hml’, ‘smb’].
References
fileio
This module defines functions for file IO.
Write data to a file. |
|
Read data from a file. |
- pyanomaly.fileio.read_from_file(fname, fdir=None, typecast=True)
Read data from a file.
- Parameters:
fname – File name without extension.
fdir – Directory. If None,
config.output_dir.typecast – If True, cast float to
config.float_typeand object to string after reading from a file.
- Returns:
DataFrame read from fdir/fname.
- pyanomaly.fileio.write_to_file(data, fname, fdir=None, typecast=True)
Write data to a file.
The data is saved to fdir/fname. The file format is determined by
config.file_format.- Parameters:
data – DataFrame to save.
fname – File name without extension.
fdir – Directory. If None,
config.output_dir.typecast – If True, cast float to
config.float_typeand object to string before writing to a file.
globals
This module defines global constants. Importing this module will also import frequently used modules:
config, log, and util.
Constants
- Weight_scheme
VW: value-weight
EW: equal-weight
- Data frequency
ANNUAL
QUARTERLY
MONTHLY
DAILY
log
This module defines logging functions.
Set log file path. |
|
Write log message. |
|
Write error message. |
|
Write warning message. |
|
Write debugging message. |
|
Draw line in the log file. |
|
Start timer. |
|
Get elapsed time. |
- pyanomaly.log.debug(msg='')
Write debugging message.
- Parameters:
msg – Debugging message.
- pyanomaly.log.drawline(level=0, width=80)
Draw line in the log file.
- Parameters:
level – Shape of line. 0: ‘#’, 1: ‘*’, 2: ‘-’
width – Line width.
- pyanomaly.log.elapsed_time(msg='', headmsg=None)
Get elapsed time.
The total elapsed time (‘hh:mm:ss’) since the timer started and the elapsed time (in seconds) since the last call of this function are printed. If
start_timer()was never called before, this function starts timer.- Parameters:
msg – Message to print.
headmsg – Message header.
Examples
>>> start_timer('start') [2024/02/06 01:14] Timer On: start
>>> elapsed_time('check1') [2024/02/06 01:15] Elapsed [0:00:35.564, 35.564]: check1
>>> elapsed_time('check2', 'parallel1') [2024/02/06 01:15: parallel1] Elapsed [0:00:53.802, 18.238]: check1
- pyanomaly.log.err(msg)
Write error message.
- Parameters:
msg – Error message.
- pyanomaly.log.log(msg, headmsg=None, header=True)
Write log message.
The format is [yyyy/mm/dd hh:mm: headmsg] msg.
- Parameters:
msg – Log message.
headmsg – Message header.
header – If True, write the header.
- pyanomaly.log.set_log_path(fpath=None, append=True)
Set log file path.
- Parameters:
fpath – Log file name or path or
__file__. If it’s a name, the path becomesconfig.log_dir+ fpath. If__file__, the name of the module calling this function is retrieved from__file__and the path is set toconfig.log_dir+ module name. If None, the path isconfig.log_dir+ ‘log.log’.append – If True, append to the existing log file. Otherwise, delete the current log file.
Examples
>>> set_log_path('./log/example.log') # Full file path. >>> set_log_path('example.log') # file name. Path = config.log_dir + 'example.log' >>> set_log_path(__file__) # Make path from a module name. Path = config.log_dir + module name
- pyanomaly.log.start_timer(msg='', headmsg=None)
Start timer.
- Parameters:
msg – Message to print.
headmsg – Message header.
- pyanomaly.log.warn(msg)
Write warning message.
- Parameters:
msg – Warning message.
numba_support
This module defines jitted function.
Sum excluding nan. |
|
Mean excluding nan. |
|
Variance excluding nan. |
|
Standard deviation excluding nan. |
|
Skewness excluding nan. |
|
Shift. |
|
Rolling sum. |
|
Rolling mean. |
|
Rolling variance. |
|
Rolling standard deviation. |
|
Rank. |
|
Set rows to nan. |
|
Check nan along columns. |
|
Add a constant column to a matrix. |
|
Bivariate regression. |
|
Multivariate regression. |
|
Rolling regression. |
- pyanomaly.numba_support.add_constant(x)
Add a constant column to a matrix.
A vector of 1’s is prepended to x.
- Parameters:
x – N x K ndarray.
- Returns:
N x (K+1) ndarray (x with a vector of 1’s prepended).
Examples
>>> x = np.array([[1, 2], [3, 4]]) ... x [[1 2] [3 4]] >>> add_constant(x) [[1 1 2] [1 3 4]]
- pyanomaly.numba_support.bivariate_regression(y, x)
Bivariate regression.
A constant is added internally.
- Parameters:
y – Nx1 ndarray. Dependent variable.
x – Nx1 ndarray. Independent variable.
- Returns:
Coefficients. 2x1 ndarray of [constant, beta].
R-squared.
Residuals. Nx1 ndarray.
- pyanomaly.numba_support.isnan1(x)
Check nan along columns.
- Parameters:
x – NxK ndarray.
- Returns:
Nx1 bool ndarray. True if any value in the corresponding row of x is nan, False otherwise.
Examples
>>> x = np.array([[np.nan, 1, 2], [1, 2, 3]]) ... x [[nan 1. 2.] [ 1. 2. 3.]]
>>> isnan1(x) [ True, False]
- pyanomaly.numba_support.nanmean(x)
Mean excluding nan.
- Parameters:
x – 1D ndarray.
- Returns:
Mean of x excluding nan.
- pyanomaly.numba_support.nanskew(x, dof=1)
Skewness excluding nan.
- Parameters:
x – 1D ndarray.
dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.
- Returns:
Skewness of x excluding nan.
- pyanomaly.numba_support.nanstd(x, dof=1)
Standard deviation excluding nan.
- Parameters:
x – 1D ndarray.
dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.
- Returns:
Standard deviation of x excluding nan.
- pyanomaly.numba_support.nansum(x)
Sum excluding nan.
- Parameters:
x – 1D ndarray.
- Returns:
Sum of x excluding nan.
- pyanomaly.numba_support.nanvar(x, dof=1)
Variance excluding nan.
- Parameters:
x – 1D ndarray.
dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.
- Returns:
Variance of x excluding nan.
- pyanomaly.numba_support.rank(x, ascending=True, pct=False)
Rank.
Calculate ranks of the elements of x within each column. Nan values are excluded.
- Parameters:
x – Ndarray.
ascending – If True, rank increases with value starting from 1.
pct – If True, percentile ranks are returned.
- Returns:
(Percentile) ranks of x. Ndarray of the same size as x.
Examples
>>> x = np.array([[1, 2, 0, np.nan, 3, -1], [1, 2, 3, 4, 5, 6]]).T ... x array([[ 1., 1.], [ 2., 2.], [ 0., 3.], [nan, 4.], [ 3., 5.], [-1., 6.]])
>>> rank(x) array([[ 3., 1.], [ 4., 2.], [ 2., 3.], [nan, 4.], [ 5., 5.], [ 1., 6.]])
>>> rank(x, pct=True) array([[0.6 , 0.16666667], [0.8 , 0.33333333], [0.4 , 0.5 ], [ nan, 0.66666667], [1. , 0.83333333], [0.2 , 1. ]])
- pyanomaly.numba_support.regression(y, X)
Multivariate regression.
- Parameters:
y – Nx1 ndarray. Dependent variable.
X – NxK ndarray. Independent variables (including constant).
- Returns:
Coefficients. Kx1 ndarray.
R-squared.
Residuals. Nx1 ndarray.
- pyanomaly.numba_support.roll_mean(x, n, min_n=-1)
Rolling mean.
The x is rolled along the first axis with the window size n, and the mean of each window is calculated. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.
- Parameters:
x – Ndarray.
n – Window size.
min_n – Minimum number of observations. Default to n.
- Returns:
Rolling mean. Ndarray of the same size as x.
- pyanomaly.numba_support.roll_std(x, n, min_n=-1, dof=1)
Rolling standard deviation.
The x is rolled along the first axis with the window size n, and the standard deviation of each window is calculated. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.
- Parameters:
x – Ndarray.
n – Window size.
min_n – Minimum number of observations. Default to n.
dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.
- Returns:
Rolling standard deviation. Ndarray of the same size as x.
- pyanomaly.numba_support.roll_sum(x, n, min_n=-1)
Rolling sum.
The x is rolled along the first axis with the window size n, and the sum of each window is calculated. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.
- Parameters:
x – Ndarray.
n – Window size.
min_n – Minimum number of observations. Default to n.
- Returns:
Rolling sum. Ndarray of the same size as x.
Examples
>>> x = np.array([[1, 2, 0, np.nan, 3, -1], [1, 2, 3, 4, 5, 6]]).T ... x array([[ 1., 1.], [ 2., 2.], [ 0., 3.], [nan, 4.], [ 3., 5.], [-1., 6.]])
>>> roll_sum(x, 3) array([[nan, nan], [nan, nan], [ 3., 6.], [nan, 9.], [nan, 12.], [nan, 15.]])
>>> roll_sum(x, 3, 2) array([[nan, nan], [ 3., 3.], [ 3., 6.], [ 2., 9.], [ 3., 12.], [ 2., 15.]])
- pyanomaly.numba_support.roll_var(x, n, min_n=-1, dof=1)
Rolling variance.
The x is rolled along the first axis with the window size n, and the variance of each window is calculated. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.
- Parameters:
x – Ndarray.
n – Window size.
min_n – Minimum number of observations. Default to n.
dof – Degrees of freedom. 0 (1) for biased (unbiased) estimate. Default to 1.
- Returns:
Rolling variance. Ndarray of the same size as x.
- pyanomaly.numba_support.rolling_regression(data, n, min_n=-1, add_const=True)
Rolling regression.
The data is rolled along the first axis with the window size n, and the regression is conducted for each window. Nan values are excluded: The result will be nan if the number of not-nan values is smaller than min_n.
- Parameters:
data – NxK ndarray. The first column is the dependent variable and the rest are the independent variables.
n – Window size.
min_n – Minimum number of observations. Default to n.
add_const – If True, add a constant to the independent variables.
- Returns:
Nx(K’+2) ndarray. K’ = K if add_const is True, else K’ = K-1.
First K’ columns: Coefficients.
K’+1-th column: R-squared.
K’+2-th column: Idiosyncratic volatility (standard deviation of residuals).
- pyanomaly.numba_support.set_to_nan(x, n)
Set rows to nan.
Set the first n (last -n if n < 0) rows of x to nan. The x is changed in-place.
- Parameters:
x – Ndarray.
n – Number of rows to set to nan.
- pyanomaly.numba_support.shift(x, n)
Shift.
x is shifted by n along the first axis. A negative n shifts x backward.
- Parameters:
x – Ndarray.
n – Shift size.
- Returns:
Shifted x. Ndarray of the same size as x.
panel
This module defines classes for panel data analysis.
Base class for panel data analysis. |
|
Base class for firm characteristic generation. |
- class pyanomaly.panel.FCPanel(alias=None, data=None, freq=12, base_freq=None)
Bases:
PanelBase class for firm characteristic generation.
For each firm characteristic, there should be one method to generate it and the method name should start with ‘c_’. Generated firm characteristics are added to the
dataattribute with column names equal to their method names (without ‘c_’).- Parameters:
alias (str, list, or dict) –
Firm characteristics to generate and their aliases. If None, all available firm characteristics are generated.
str: A column name in the mapping file (
config.mapping_file_path). The firm characteristics defined in this class and in the alias column of the mapping file are generated. See Mapping File.list: List of firm characteristics (method names).
dict: Dict of firm characteristics and their aliases of the form {method name: alias}.
If aliases are not specified, method names are used as aliases.
data – Raw data DataFrame with index = date/id, sorted on id/date.
freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.
base_freq – Frequency of data values. ANNUAL, QUARTERLY, MONTHLY, or DAILY. For example, if funda is populated monthly, freq = MONTHLY and base_freq = ANNUAL. If None, base_freq = freq.
Note
Firm characteristics are stored in the
dataattribute with column names equal to their method names (without ‘c_’). When aFCPanelis saved to a file usingsave(), the firm characteristic columns are renamed by the aliases , and when a savedFCPanelis loaded usingload(), the firm characteristic columns are renamed by the method names.Attributes
- data
DataFrame with index = date/id, sorted on id/date that stores a panel data.
- freq
Frequency of
data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.
- base_freq
Frequency of
datavalues. ANNUAL, QUARTERLY, MONTHLY, or DAILY.
- char_map
Dictionary to map firm characteristic generation methods with aliases:
char_map['method']returns the alias.
Methods
Display firm characteristics available in this class.
Get all firm characteristics available in this class.
Get the firm characteristics to generate.
Generate firm characteristics.
Check if a firm characteristic has already been created.
Prepare characteristics that are required to generate a characteristic.
Rename the firm characteristic columns.
Remove raw data.
Save this object to a file.
Load a FCPanel object from a file.
- create_chars(char_list=None)
Generate firm characteristics.
Generated firm characteristics are added to
datausing their method names as the column names.- Parameters:
char_list – List or dict of firm characteristics (method names) to generate. If dict, keys should be method names and values aliases. If None, all firm characteristics available in this class and specified by the alias argument are generated.
- created(char)
Check if a firm characteristic has already been created.
- Parameters:
char – Firm characteristic (the method name without ‘c_’) to check.
- Returns:
True if the firm characteristic has been created.
- get_available_chars()
Get all firm characteristics available in this class.
- Returns:
List of characteristics (method names)
- get_char_list()
Get the firm characteristics to generate.
Among all the firm characteristics to generate, the firm characteristics defined in this class are returned.
- Returns:
A dictionary of firm characteristics with keys equal to method names and values equal to aliases.
- load(fname=None, fdir=None)
Load a FCPanel object from a file.
Firm characteristic columns are renamed by method names after loading.
- Parameters:
fname – File name without extension. If None, fname = lower-case class name .
fdir – Directory to read the file from. If None, fdir =
config.output_dir.
- prepare(char_list)
Prepare characteristics that are required to generate a characteristic.
If a characteristic is defined as a function of other characteristics, this function will check whether they already exist and generate them if they don’t.
- Parameters:
char_list – list of firm characteristics (method names without ‘c_’) to prepare.
- remove_rawdata(excl_columns=None)
Remove raw data.
Raw data columns of
dataexcept excl_columns are deleted. If raw data are not needed after generating firm characteristics, calling this method can reduce memory and hard disc usage.- Parameters:
excl_columns – List of columns to exclude.
- rename_chars(to_alias=True)
Rename the firm characteristic columns.
- Parameters:
to_alias – If True, rename firm characteristic columns from method names to aliases, and vice versa.
- save(fname=None, fdir=None, other_columns='all')
Save this object to a file.
The
dataattribute is saved to aconfig.file_formatfile and the other attributes are saved to a json file. Firm characteristic columns are renamed by aliases before saving.- Parameters:
fname – File name without extension. If None, fname = lower-case class name.
fdir – Directory to save the file. If None, fdir =
config.output_dir.other_columns – List of columns other than firm characteristic columns to save. If None, only firm characteristics are saved; if ‘all’, all columns are saved; if list, firm characteristic columns plus other_columns are saved.
- show_available_chars(all=False)
Display firm characteristics available in this class.
- Parameters:
all – If True, display all firm characteristics available in this class, otherwise, display only the firm characteristics to generate.
- class pyanomaly.panel.Panel(data=None, freq=12, base_freq=None)
Bases:
objectBase class for panel data analysis.
This class stores a panel data, data, and offers various tools to handle and analyze it. The data should be a Pandas DataFrame with a MultiIndex = date/id, i.e., the first index should be a time-series identifier (Pandas datetime type) and the second index a cross-section identifier. It should be sorted on id/date.
- Parameters:
data – Panel data DataFrame with index = date/id, sorted on id/date.
freq – Frequency of data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.
base_freq – Frequency of data values. For example, if data has annual values populated monthly, freq = MONTHLY and base_freq = ANNUAL. If None, base_freq = freq.
Attributes
- data
DataFrame with index = date/id, sorted on id/date that stores a panel data. The items of data can be accessed by
Panel.__get_item__(): For aPanelinstancepanel,panel[cols]andpanel[rows, cols]are equivalent topanel.data[col]andpanel.data.loc[rows, cols], respectively.
- freq
Frequency of
data. ANNUAL, QUARTERLY, MONTHLY, or DAILY.
- base_freq
Frequency of
datavalues. ANNUAL, QUARTERLY, MONTHLY, or DAILY.
Methods
Check if a panel data is sorted on id/date.
Get id index name.
Get date index name.
Get id index values.
Get date index values.
Get id group.
Get date group.
Get id group sizes.
Get date group sizes.
Get id group indices.
Get date group indices.
Apply a function to each id group.
Apply a function to each date group.
Populate data.
Merge with another data.
Inspect data.
Filter data.
Get row count per id.
Remove rows.
Shift data.
Get the difference of data.
Get the percentage change of data.
Cumulative returns.
Get future returns.
Apply a function to a rolling window.
Conduct rolling regression.
Make a copy of this object and return it.
Copy from another Panel object.
Save this object to a file.
Load a Panel object from a file.
Clean memory.
- apply_to_dates(data, function, n_ret, *args, data2=None, looper=None)
Apply a function to each date group.
This method groups data by date and applies function to each date group.
- Parameters:
data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the
dataattribute.function – Function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).
n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data. If looper is
apply_to_groups, this argument is ignored.*args – Additional arguments of function.
data2 – DataFrame, Series, ndarray, str, or int. Optional argument when function requires two sets of input data.
looper – Looping function:
apply_to_groups(),apply_to_groups_jit(), orapply_to_groups_reduce_jit(). If looper is None,apply_to_groups_jitis used if function is jitted, otherwise,apply_to_groupsis used.
- Returns:
Concatenated value of the outputs of function.
- apply_to_ids(data, function, n_ret, *args, data2=None, looper=None)
Apply a function to each id group.
This method groups data by id and applies function to each id group.
- Parameters:
data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the
dataattribute.function – Function to apply to groups. Its arguments should be (gbdata, *args) or (gbdata, gbdata2, *args), where gbdata (gbdata2) is a group of data (data2).
n_ret – Number of returns of function. If None, it is assumed to be the same as the column size of data. If looper is
apply_to_groups, this argument is ignored.*args – Additional arguments of function.
data2 – DataFrame, Series, ndarray, str, or int. Optional argument when function requires two sets of input data.
looper – Looping function:
apply_to_groups(),apply_to_groups_jit(), orapply_to_groups_reduce_jit(). If looper is None,apply_to_groups_jitis used if function is jitted, otherwise,apply_to_groupsis used.
- Returns:
Concatenated value of the outputs of function.
- clean_memory()
Clean memory.
Deleting rows/columns of the
dataattribute may not release memory immediately. If aPanelobject consumes unusually large memory, call this function to release memory.
- copy(columns=None, deep=False)
Make a copy of this object and return it.
- Parameters:
columns – Columns to copy. If None, copy all columns.
deep – If True, deep copy.
- Returns:
Copy of this object.
- copy_from(panel, columns=None, deep=False)
Copy from another Panel object.
- Parameters:
panel – A
Panelobject to be copied.columns – Columns of panel to copy. If None, copy all columns.
deep – If True, deep copy.
- cumret(ret, period=1, lag=0)
Cumulative returns.
Compute cumulative returns between t-period and t-lag. If ret is monthly returns, 12-month momentum can be obtained by setting period = 12 and lag = 1. A negative period will generate future returns: e.g., period = -1 and lag = 0 for one-period ahead return; period = -3 and lag = -1 for two-period ahead return starting from t+1.
- Parameters:
ret – Series of returns or a return column name. If ret is a Series, it should have the same length and order as the
dataattribute.period – Target horizon (in base frequency). (+) for past returns and (-) for future returns.
lag – Period (in base frequency) to calculate returns from.
- Returns:
Series of cumulative returns.
- date_idx()
Get date index name.
- Returns:
Date index name.
- date_values()
Get date index values.
- Returns:
Date Index.
- diff(data, n=1)
Get the difference of data.
Calculate difference within each id accounting for data frequency. The period between two data points is determined by
freqandbase_freq. Seeshift()for details.- Parameters:
data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the
dataattribute. If None, data =data.n – Shift size in the base frequency.
- Returns:
Series or DataFrame of differenced data.
- filter(filters, keep_row=False)
Filter data.
Filter the panel data using filters. A filter is a tuple of three elements:
filter[0]: column to apply the filter to.
filter[1]: filter condition: ‘==’, ‘!=’, ‘>’, ‘<’, ‘>=’, ‘<=’, ‘in’, or ‘not in’.
filter[2]: rhs value.
If filters is a list of filters, they are applied sequentially.
- Parameters:
filters – A filter or list of filters.
keep_row – Whether to keep or remove the filtered out rows. If True, the values of the filtered rows are set to nan.
Examples
To remove rows, where the value of column ‘x’ is less than 10,
>>> panel.filter(('x', '>=', 10))
This is equivalent to
>>> panel.data = panel.data[panel.data['x'] >= 10]
- futret(ret, period=1)
Get future returns.
- Parameters:
ret – Series of returns or a return column name. If ret is a Series, it should have the same length and order as the
dataattribute.period – Target horizon (in base frequency).
- Returns:
Series of future returns.
- get_date_group()
Get date group.
Same as
Panel.data.groupby(level=0).- Returns:
Pandas GroupBy object.
- get_date_group_index()
Get date group indices.
- Returns:
List of date group indices.
- get_date_group_size()
Get date group sizes.
- Returns:
Ndarray of date group sizes.
- get_id_group()
Get id group.
Same as
Panel.data.groupby(level=1).- Returns:
Pandas GroupBy object.
- get_id_group_index()
Get id group indices.
- Returns:
List of id group indices.
- get_id_group_size()
Get id group sizes.
- Returns:
Ndarray of id group sizes.
- get_row_count(ascending=True)
Get row count per id.
The row count is a sequential number starting from 0 for each id. If ascending is True (False), its value is 0 on the first (last) date. It can be used, e.g., to remove first (last) n rows of
data.- Parameters:
ascending – If True, the row count increases with date.
- Returns:
Ndarray of row counts.
- id_idx()
Get id index name.
- Returns:
Id index name.
- id_values()
Get id index values.
- Returns:
Id Index.
- inspect_data(columns=None, option=['summary'])
Inspect data.
- Parameters:
columns – List of columns to inspect.
option – List of items to display.
- is_sorted(data=None)
Check if a panel data is sorted on id/date.
- Parameters:
data – Panel DataFrame with index = date/id. If None, data =
data.- Returns:
Bool. True if data is sorted on id/date.
- load(fname=None, fdir=None)
Load a Panel object from a file.
- Parameters:
fname – File name without extension. If None, fname = lower-case class name.
fdir – Directory to load the file from. If None, fdir =
config.output_dir.
- merge(right, on=None, right_on=None, how='left', drop_duplicates='right', suffixes=None, method=None)
Merge with another data.
Merge the
dataattribute with right.- Parameters:
right – Panel, Series, or DataFrame to merge with.
on – (List of) column(s) to merge on. If None, merge on index.
right_on – (List of) column(s) of right to merge on. If None, right_on = on.
how – Merge method: ‘inner’, ‘outer’, ‘left’, or ‘right’.
drop_duplicates – how to handle duplicate columns. ‘left’: keep right, ‘right’: keep left, None: keep both. If None, suffixes should be provided.
suffixes – A tuple of suffixes for duplicate columns, e.g., suffixes=(‘_x’, ‘_y’) will add ‘_x’ and ‘_y’ to the left and right duplicate columns, respectively.
method – None or ‘pandas’. None uses an internal merge algorithm for left-merge; ‘pandas’ uses
pd.merge()internally. If how is not ‘left’, this option is ignored andpd.merge()is always used.
See also
- pct_change(data, n=1, allow_negative_denom=False)
Get the percentage change of data.
Calculate percentage change within each id accounting for data frequency. The period between two data points is determined by
freqandbase_freq. Seeshift()for details.- Parameters:
data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the
dataattribute. If None, data =data.n – Shift size in the base frequency.
allow_negative_denom – If False, set the output to nan when the denominator is not positive.
- Returns:
Series or DataFrame of percentage changes.
- populate(freq=None, method='ffill', limit=None, lag=0, new_date_col=None)
Populate data.
Populate data to freq frequency and shift the populated data by lag period(s).
- Parameters:
freq – Frequency to populate: ANNUAL, QUARTERLY, MONTHLY, or DAILY. If None, freq is set to the data frequency (
freq), and missing dates are added.method – Filling method for newly added rows. ‘ffill’: forward fill, None: nan.
limit – Maximum number of rows to forward-fill.
lag – Minimum periods between new date and original date. If freq = MONTHLY, lag = 4 shifts data by 4 months, which means data are available at least 4 months later.
new_date_idx – Name of the new (populated) date index. If None, use the current date index name. If given, the original date index is kept as a column.
See also
- remove_rows(data, n=1)
Remove rows.
Remove (set to nan) the first n periods of data per id. If n < 0, the last -n periods are removed. Data frequency is accounted for: e.g., if
freq= MONTHLY andbase_freq= ANNUAL, n = 1 removes the first 12 rows per id.- Parameters:
data – DataFrame, Series, or ndarray. It should have the same length and order as the
dataattribute.n – Number of rows (in base frequency) to remove (set to nan).
- Returns:
The data with removed rows set to nan.
- rolling(data, n, function, min_n=None, lag=0)
Apply a function to a rolling window.
For each id, apply function to rolling windows of size n. The rolling window is determined by
freqandbase_freq. For instance, iffreq = MONTHLYandbase_freq = ANNUAL, n = 3 means three-year rolling window and the first window consists of the 1st, 13th, and 25th rows.- Parameters:
data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the
dataattribute. If None, data =data.function – Function to apply: ‘sum’, ‘mean’, ‘std’, or ‘var’.
n – Window size in the base frequency.
min_n – Minimum number of observations in a window. If observations < min_n, result is nan. Default to n.
lag – Lag size in the base frequency. The data is shifted by lag before rolling.
- Returns:
Series or DataFrame. Rolling calculation result.
Note
For other user-defined functions, use
apply_to_ids().Examples
Suppose
fundais aPanelobject withfreq = MONTHLYandbase_freq = ANNUAL(annual data populated monthly), andfunda.datahas a return column, ‘ret’. The past three-year average return starting from one year ago can be calculated (with a condition of at least two non-missing values within the sample window) and saved as ‘avg_ret3y’ as follows:>>> funda['avg_ret3y'] = funda.rolling('ret', 'mean', 3, 2, 1)
- rolling_regression(data, n, min_n=None, add_const=True)
Conduct rolling regression.
Run rolling OLS for each id accounting for the data frequency.
- Parameters:
data – DataFrame, ndarray, or (list of) columns. It should have the same length and order as the
dataattribute. The first column is used as the dependent variable and the rest as the independent variables.n – Window size in the base frequency.
min_n – Minimum number of observations in a window. If observations < min_n, result is nan. Default to n.
add_const – If True, add a constant to the independent variables.
- Returns:
Nx(K+2) ndarray, where N is the length of data and K is the number of independent variables.
First K columns: Coefficients.
K+1-th column: R-squared.
K+2-th column: Idiosyncratic volatility (standard deviation of residuals).
- save(fname=None, fdir=None, columns=None)
Save this object to a file.
The
dataattribute is saved to aconfig.file_formatfile and the other attributes are saved to a json file.- Parameters:
fname – File name without extension. If None, fname = lower-case class name.
fdir – Directory to save the file. If None, fdir =
config.output_dir.columns – List of columns to save. If None, the entire dataset is saved.
- shift(data=None, n=1)
Shift data.
Shift data within each id accounting for data frequency. The shift period is determined by
freqandbase_freq. E.g., iffreq = MONTHLYandbase_freq = ANNUAL, n = 1 means 1-year shift, and data is shifted by 12 (12 months). That is, ifbase_freq = ANNUAL (QUARTERLY),shift(data, n)will always shift data by n years (quarters) regardless of the data frequency.- Parameters:
data – DataFrame, Series, ndarray, or (list of) columns. It should have the same length and order as the
dataattribute. If None, data =data.n – Shift size in the base frequency.
- Returns:
Series or DataFrame of shifted data.
portfolio
This module defines classes for portfolio analysis.
Portfolio class. |
|
Class for a group of portfolios. |
- class pyanomaly.portfolio.Portfolio(name=None, position=None, rf=None, pfval0=1, costfcn=None, keep_position=False)
Portfolio class.
This class makes a portfolio from positions and evaluate it.
Position information is saved in
positionand portfolio information is saved invalue. If positive weights of the positions on a date don’t add up to 1, (1 - sum(positive weights)) will be assumed to be invested in a risk-free asset, and its information is saved infposition. The transaction cost is assumed to be 0 for risk-free assets. Once the portfolio is evaluated by callingeval(),performanceattribute is generated.- Parameters:
name – Portfolio name.
position – Position DataFrame. It should have index = ‘date’ and columns = ‘id’ (security ID), ‘ret’ (return), and ‘wgt’ (portfolio weight). If it has other columns such as price, they will be kept in
position. If it has ‘rf’ (risk-free rate) column, its values are used as risk-free rates.rf – Risk-free rate DataFrame with index = ‘date’ and columns = [‘rf’]. The rf has priority over ‘rf’ column in position. If rf = None and position does not have ‘rf’ column, the risk-free rate is assumed to be 0.
pfval0 – Initial portfolio value. Default to 1.
costfcn –
TransactionCostclass, a transaction cost function, or value. Seecostfcn.keep_position – If False, position information (
position) is deleted after the portfolio is constructed.
Attributes
- name
Portfolio name.
- position
Position DataFrame. Its index is ‘date’ and has the following columns:
Columns from the input
Column
Description
id
Security ID.
ret
Return between t-1 and t.
wgt
Weight at the beginning of t (at t-1 after rebalancing).
Other columns
Any other columns that are in the input position data.
Columns generated internally.
Column
Description
exret
Excess return over risk-free rate between t-1 and t.
val1
Value at t.
val
Value at the beginning of t.
val0
Value at t-1 (before rebalancing). val1 at t-1 = val0 at t.
cost
Transaction cost incurred at the beginning of t.
- value
Portfolio value DataFrame. Its index is ‘date’ and has the following columns:
Column
Description
ret
Return between t-1 and t. This can be either net (excess) return or gross (excess) return depending on the return type. See
eval(). ret = netexret by default.val1
Value at t.
val
Value at the beginning of t.
cost
Transaction cost incurred at the beginning of t.
tover
Turnover incurred at the beginning of t.
netret
Return between t-1 and t, net of transaction cost.
grossret
Gross return between t-1 and t.
netexret
Excess return between t-1 and t, net of transaction cost.
grossexret
Gross excess return between t-1 and t.
lposition
Number of long positions.
sposition
Number of short positions.
tover = sum(abs(position.val - position.val0)) / value.val
The following columns are added to
valueonce the portfolio is evaluated by callingeval().Column
Description
cumret
Cumulative return since the first date.
drawdown
Drawdown.
drawdur
Duration of drawdown in the frequency of data, e.g., if monthly, 3 means 3 months.
drawstart
Beginning date of drawdown.
succdown
Successive down; down without any up during a period.
succdur
Duration of successive down.
succstart
Beginning date of successive down.
- fposition
Risk-free asset position DataFrame. Its index is ‘date’ and has the following columns:
Column
Description
ret
Return between t-1 and t.
wgt
Weight at the beginning of t.
val1
Value at t.
val
Value at the beginning of t.
- performance
Portfolio performance DataFrame. Its column is equal to the portfolio name and has the following indexes:
Index
Description
mean
Mean return.
std
Standard deviation.
sharpe
Sharpe ratio.
cum
Cumulative return.
mdd
Maximum drawdown.
mdd start
Maximum drawdown start date.
mdd end
Maximum drawdown end date.
msd
Maximum successive down.
msd start
Maximum successive down start date.
msd end
Maximum successive down end date.
turnover
Average turnover.
lposotion
Average number of long positions.
sposition
Average number of short positions.
- costfcn
TransactionCostobject, a transaction cost function, or a value. For example, if a constant transaction cost of 20 basis points is assumed, costfcn can be set to 0.002.If costfcn is a
TransactionCostobject,TransactionCost.get_cost()is called to get transaction costs.TransactionCostallows transaction costs varying across time and securities.When costfcn is defined as a function, it should have arguments val (value after rebalancing) and val0 (value before rebalancing) and return the transaction cost. For example, if the transaction cost to buy (sell) is 20 (30) bps, the function can be defined as follows:
def cost_fcn(val, val0): dval = np.abs(val - val0) return 0.002 * dval if val > val0 else 0.003 * dval
Note
If a position exists at t-1 but not at t, it will be added at t with 0 weight. This is to compute the transaction cost.
The ‘val’, ‘val1’, and ‘val0’ in
positionandvaluedo not take transaction costs into account. For the value changes net of transaction costs, use the cumulative return.Methods
Set positions.
Make a portfolio given portfolio returns.
Copy this object.
Set the return to use for portfolio evaluation.
Get returns.
Get cumulative returns.
Get mean return.
Get standard deviation.
Get cumulative return.
Get Sharpe ratio.
Get successive downs.
Get maximum successive down.
Get drawdowns.
Get maximum drawdown.
Evaluate the portfolio.
Evaluate the portfolio repeatedly over a period.
- copy(sdate=None, edate=None)
Copy this object.
Copy this object for the given period.
- Parameters:
sdate – Start date (‘yyyy-mm-dd’).
edate – End date (‘yyyy-mm-dd’).
- Returns:
Portfolioobject.
- cum_return(sdate=None, edate=None, logscale=True)
Get cumulative return.
- Parameters:
sdate – Start date.
edate – End date.
logscale – If True, return log-scale cumulative return.
- Returns:
Cumulative return over the period.
- cum_returns(sdate=None, edate=None, logscale=True, zero_start=False)
Get cumulative returns.
Both sdate and edate are inclusive, i.e., the first cumulative return is the return over sdate-1 and sdate and the last cumulative return is the return over sdate-1 and edate.
- Parameters:
sdate – Start date.
edate – End date.
logscale – If True, return log-scale cumulative returns.
zero_start – If True, a cumulative return of 0 is prepended with date = sdate - 1 day. This is useful when plotting several cumulative returns as all curves will start at 0.
- Returns:
Cumulative return Series with index = ‘date’.
- drawdown(sdate=None, edate=None)
Get drawdowns.
- Parameters:
sdate – Start date.
edate – End date.
- Returns:
Drawdowns. DataFrame with index = ‘date’ and columns = [‘value’, ‘duration’, ‘start’].
- eval(sdate=None, edate=None, logscale=True, annualize_factor=1, return_type='net', percentage=False)
Evaluate the portfolio.
This method evaluates the portfolio over a period and create
performance. It also adds performance-related columns tovalue.- Parameters:
sdate – Start date.
edate – End date.
logscale – If True, cumulative returns are in log-scale.
annualize_factor – ‘mean’, ‘std’, ‘sharpe’, and ‘turnover’ are annualized by this factor, e.g., ‘mean’ is multiplied by annualize_factor and ‘std’ by its square-root. If the data is monthly, the results can be annualized by setting annualize_factor = 12. Default to 1.
return_type – Which return to use for evaluation. See
set_return_type()for available options. Default to ‘net’.percentage – If True, ‘mean’, ‘std’, ‘cum’, ‘mdd’, ‘msd’, and ‘turnover’ are multiplied by 100. Default to False.
- Returns:
- eval_series(sdate=None, edate=None, logscale=True, annualize_factor=1, return_type='net', percentage=False)
Evaluate the portfolio repeatedly over a period.
Evaluate the portfolio repeatedly for the periods [sdate-1, sdate], [sdate-1, sdate+1], …, [sdate-1, edate]. For the description of the arguments, see
eval().- Returns:
Performance for each period. DataFrame with index values equal to the period end dates and columns equal to the indexes of
performance, i.e., a row with index t contains the performance up to t.
- static from_portfolo_return(pfret, val=1)
Make a portfolio given portfolio returns.
If portfolio returns are already known, this method can be used to construct a
Portfoliioobject without position information and evaluate the portfolio.- Parameters:
pfret – Portfolio returns. DataFrame with index = ‘date’ and columns = ‘ret’.
- Returns:
Portfolioobject.
- max_drawdown(sdate=None, edate=None)
Get maximum drawdown.
- Parameters:
sdate – Start date.
edate – End date.
- Returns:
Maximum drawdown. Series with index = [‘value’, ‘duration’, ‘start’].
- max_succdown(sdate=None, edate=None)
Get maximum successive down.
- Parameters:
sdate – Start date.
edate – End date.
- Returns:
Maximum successive down. Series with index = [‘value’, ‘duration’, ‘start’].
- mean_return(sdate=None, edate=None)
Get mean return.
- Parameters:
sdate – Start date.
edate – End date.
- Returns:
Mean return over the period.
- returns(sdate=None, edate=None)
Get returns.
Both sdate and edate are inclusive, i.e., the first return is the return over sdate-1 and sdate.
- Parameters:
sdate – Start date.
edate – End date.
- Returns:
Return Series with index = ‘date’.
- set_position(position, rf=None, pfval0=1, keep_position=False)
Set positions.
This method sets
positionfrom position. Any existing positions will be deleted. For the details of the input arguments, See the class parameters.
- set_return_type(return_type='net')
Set the return to use for portfolio evaluation.
The ‘ret’ and ‘exret’ of
valueare set according to return_type and used to compute portfolio evaluation metrics. Mean, std, and Sharpe ratio usevalue.exretand cumulative return, mdd, and msd usevalue.ret.- Parameters:
return_type – Return to use. ‘net’, ‘gross’, ‘netexret’, ‘grossexret’, ‘netret’, or ‘grossret’.
value.retandvalue.exretare set as follows.
Return_type
Ret
Exret
‘net’
net return
net excess return
‘gross’
gross return
gross excess return
‘netexret’
net excess return
net excess return
‘grossexret’
gross excess return
gross excess return
‘netret’
net return
net return
‘grossret’
gross return
gross return
- sharpe_ratio(sdate=None, edate=None)
Get Sharpe ratio.
- Parameters:
sdate – Start date.
edate – End date.
- Returns:
Sharpe ratio over the period.
- std_return(sdate=None, edate=None)
Get standard deviation.
- Parameters:
sdate – Start date.
edate – End date.
- Returns:
Standard deviation of the returns over the period.
- succdown(sdate=None, edate=None)
Get successive downs.
- Parameters:
sdate – Start date.
edate – End date.
- Returns:
Successive downs. DataFrame with index = ‘date’ and columns = [‘value’, ‘duration’, ‘start’].
- class pyanomaly.portfolio.Portfolios(portfolios=None)
Class for a group of portfolios.
This class can have several portfolios as its members and evaluate them together. This class facilitates portfolio comparison by evaluating them and saving the results in a single DataFrame.
- Parameters:
portfolios – List or dict of
Portfolioobjects to add. If it is a dict, its keys are used as portfolio names. Portfolios can also be added later usingadd()orset().
Attributes
- members
Dict of portfolio members. A
Portfolioobject is added to members with its name as the key. A member portfolio can be accessed using__getitem__():>>> pf1 = Portfolio('pf1') >>> portfolios = Portfolios() >>> portfolios.add(pf1) >>> pf1 = portfolios['pf1'] # This and the next line are equivalent. >>> pf1 = portfolios.members['pf1']
- value
Portfolios’ values. This is a concatenated (outer-joined) DataFrame of the
valueattributes of the members. Its index is ‘date’ and columns are two-level: the first level is the same as the columns ofPortfolio.valueand the second level is the portfolio names.
- performance
Portfolios’ performances. This is a concatenated DataFrame of the
performanceattributes of the members. Its index is the same as the index ofPortfolio.performanceand columns are portfolio names.
Methods
Add a portfolio.
Set portfolios.
Evaluate the portfolios.
- add(portfolio, alias=None)
Add a portfolio.
Add portfolio to
members.- Parameters:
portfolio (Portfolio) – Portfolio to add.
alias – Portfolio alias. If not None, this is used as the portfolio name.
- eval(sdate=None, edate=None, logscale=True, annualize_factor=1, return_type='net', percentage=False)
Evaluate the portfolios.
For the arguments, see
Portfolio.eval().- Returns:
tcost
This module defines classes for transaction costs.
Transaction cost class. |
|
Transaction cost that varies with time and firm size. |
- class pyanomaly.tcost.TimeVaryingCost(me=None, normalize=True)
Bases:
TransactionCostTransaction cost that varies with time and firm size.
This class implements the time-varying transaction costs used in, e.g., Brandt et al. (2009), Hand and Green (2011), DeMiguel et al. (2020), and Han (2021). Transaction cost parameter k is defined as k = y * z, where y decreases linearly from 4.0 in 1974.01 to 1.0 in 2002.01 and remains at 1.0 thereafter, and z = 0.006 - 0.0025 * nme, where nme is a cross-sectionally normalized market equity that has a value between 0 and 1.
The maximum transaction cost is 240 basis points (the smallest firm before 1974) and the minimum transaction cost is 35 basis points (the largest firm after 2002).
We find this assumption is too conservative since the normalized me is sensitive to the largest firm. In 1974, the mean of the normalized me is only 0.0045 and most firms have the transaction cost of 240 basis points, and in 2002, the mean of the normalized me is only 0.0059 and most firms have the transaction cost of 60 basis points.
Using the logarithm of the market equity or capping the me of the largest firms may make more sense.
- Parameters:
me – See
set_params().normalize – See
set_params().
Methods
Set transaction cost parameters.
- set_params(me, normalize=True)
Set transaction cost parameters.
- Parameters:
me – DataFrame or Series of market equity with index = date/id.
normalize – If True, cross-sectionally normalize me so that its values are between 0 and 1. If me is already normalized, set normalize to False.
Note
If the input contains only a subset of all listed stocks, normalizing the market equity can result in over- or underestimation of the transaction costs. For example, if me contains only top 80% of the stocks, the transaction costs will be overestimated. Use all listed stocks in the market or normalize the market equity outside and set normalize = False.
- class pyanomaly.tcost.TransactionCost(**kwargs)
Bases:
objectTransaction cost class.
Transaction cost can be set at the security level. It can also vary over time.
- Parameters:
**kwargs – Transaction cost parameters. See
set_params().
Attributes
- params
Transaction cost parameters. This can be a float number, dict, or DataFrame. See
set_params()for details.
Methods
Set transaction cost parameters.
Get transaction costs.
- get_cost(position)
Get transaction costs.
- Parameters:
position – Portfolio positions.
Portfolioobject calls this function with the argument,Portfolio.position, to get transaction costs. The position argument should have index = ‘date’ and columns = [id, ‘val’, ‘val0’], where ‘val’ is the value after rebalancing and ‘val0’ is the value before rebalancing.- Returns:
Transaction costs. A ndarray with the same length as position.
- set_params(**kwargs)
Set transaction cost parameters.
Parameters can be set either by this method or at class initialization.
- Parameters:
**kwargs –
Transaction cost parameters. The kwargs can have the following keywords:
’cost’: For a constant (scalar) transaction cost.
’buy_fixed`, ‘buy_linear`, ‘buy_quad’, ‘sell_fixed’, ‘sell_linear’, ‘sell_quad’: For a quadratic transaction cost.
’params’: DataFrame. For a transaction cost that varies across securities (and over time).
Examples
A constant transaction cost of 20 basis points.
>>> tc = TransactionCost(cost=0.002)
Asymmetric quadratic cost function.
cost = fixed + linear * Amount + quad * Amount^2
To buy: fixed cost = $5, linear cost = 0.002, and quadratic cost = 0.001
To sell: fixed cost = $0, linear cost = 0.003, and quadratic cost = 0.001
>>> tc = TransactionCost(buy_fixed=5, buy_linear=0.002, buy_quad=0.001, sell_linear=0.003, sell_quad=0.001)
Only non-zero parameters need to be provided.
Transaction costs that vary across securities.
Security 1 (id: 0001): 0.002, security 2 (id: 0002): 0.003
>>> params = pd.DataFrame({'cost': [0.002, 0.003]}, index=pd.Index(['0001', '0002'], name='id')) >>> params cost id 0001 0.002 0002 0.003 >>> tc = TransactionCost(params=params)
Transaction costs that vary across securities and dates.
Security 1 (id: 0001): 0.004 on ‘2000-03-31’, 0.003 on ‘2000-04-30’
Security 2 (id: 0002): 0.005 on ‘2000-03-31’, 0.004 on ‘2000-04-30’
>>> dates = ['2000-03-31', '2000-04-30'] >>> ids = ['0001', '0002'] >>> params = pd.DataFrame(index=pd.MultiIndex.from_product([dates, ids], names=('date', 'id'))) >>> params['cost'] = [0.004, 0.005, 0.003, 0.004] >>> params cost date id 2000-03-31 0001 0.004 0002 0.005 2000-04-30 0001 0.003 0002 0.004 >>> tc = TransactionCost(params=params)
The params DataFrame must have index = id or date/id. It can have columns such as ‘buy_fixed’ instead of ‘cost’ for a more complex transaction cost structure.
util
This module defines utility functions.
Check if a variable is iterable. |
|
Convert a variable to a list. |
|
Get a list of unique elements of a list. |
|
Delete columns of a DataFrame. |
|
Keep columns of a DataFrame. |
|
Check if a variable or its element is zero. |
|
Check if a variable's data type is float. |
|
Check if a variable's data type is numeric (int or float). |
|
Check if an array is a bool array. |
|
Summation treating nan values as zero. |
- pyanomaly.util.drop_columns(data, cols)
Delete columns of a DataFrame.
Columns are deleted in-place.
- Parameters:
data – DataFrame.
cols – List of columns to drop.
- pyanomaly.util.is_bool_array(array)
Check if an array is a bool array.
The array is identified as a bool array if:
its dtype is ‘bool’ or ‘boolean’; or
it contains only True, False, or None.
- Parameters:
array – Series or ndarray.
- Returns:
True if array is a bool array.
- pyanomaly.util.is_float(x)
Check if a variable’s data type is float.
- Parameters:
x – A scalar or array.
- Returns:
Bool. True if the data type of x is float.
- pyanomaly.util.is_int(x)
Check if a variable’s data type is int.
- Parameters:
x – A scalar or array.
- Returns:
Bool. True if the data type of x is int.
- pyanomaly.util.is_iterable(x)
Check if a variable is iterable.
Check if x is iterable. A string, an iterable object, is considered not iterable.
- Parameters:
x – A variable to check.
- Returns:
Bool. True if x is iterable.
- pyanomaly.util.is_numeric(x)
Check if a variable’s data type is numeric (int or float).
- Parameters:
x – A scalar or array.
- Returns:
Bool. True if the data type of x is numeric.
- pyanomaly.util.is_zero(x)
Check if a variable or its element is zero.
A value is considered 0 if it is in the range (-1.e-7, 1.e-7).
- Parameters:
x – A scalar or array.
- Returns:
Bool or an array of bool. True if 0.
- pyanomaly.util.keep_columns(data, cols)
Keep columns of a DataFrame.
Keep cols columns of data and delete the rest in-place. Much more memory-efficient than the following two methods: these methods momentarily consume a lot of memory when data is large.
>>> data = data[cols] >>> data.drop(columns=data.columns.difference(cols), inplace=True)
Use this function when handling a large dataset.
- Parameters:
data – DataFrame.
cols – List of columns to keep.
- pyanomaly.util.nansum1(*args)
Summation treating nan values as zero.
This is similar to
sum()of SAS: nan’s of args are replaced by 0. If all elements are nan, the result is nan.- Parameters:
args – List of Series.
- Returns:
Series. Sum of args
Examples
>>> x = pd.Series([np.nan, 1, 1]) ... y = pd.Series([np.nan, np.nan, 1])
>>> nansum(x, y) 0 NaN 1 1.000 2 2.000 dtype: float64
>>> nansum(x, y, y) 0 NaN 1 1.000 2 3.000
- pyanomaly.util.to_list(x)
Convert a variable to a list.
- Parameters:
x – A scalar or an iterable item.
- Returns:
x converted to list.
Examples
>>> x = 1 ... to_list(x) [1] >>> x = [1, 2] ... to_list(x) [1, 2] >>> x = (1, 2) ... to_list(x) [1, 2]
- pyanomaly.util.unique_list(x, exclude_nan=True)
Get a list of unique elements of a list.
If x contains a list, its elements are considered elements of x, not the list itself.
- Parameters:
x – A list.
exclude_nan – If True, exclude None and np.nan elements.
- Returns:
List of unique elements of x.
Examples
>>> x = [1, 2, 2, None] ... unique_list(x) [1, 2] >>> x = [1, 2, [1, 3], None] ... unique_list(x) [1, 2, 3]
wrdsdata
This module defines WRDS class that is used to download and handle WRDS data.
Class to download/handle WRDS data. |
- class pyanomaly.wrdsdata.WRDS(wrds_username=None)
Class to download/handle WRDS data.
- Parameters:
wrds_username – WRDS username. Required only when downloading data: can be set to None when reading data from files.
Attributes
- db
A
wrdsobject to connect to WRDS database.
Methods for Data Download
Download a table from WRDS library.
Asynchronous download of a WRDS table.
Download crsp.m(d)sf joined with crsp.m(d)senames.
Download delist and dividend info from crsp.m(d)seall.
Download comp.funda.
Download comp.fundq.
Download comp.secd.
Download comp.g_secd.
Download all tables.
Other Methods
Create pgpass file.
Merge m(d)sf with m(d)seall.
Add gvkey to m(d)sf and identify primary stocks.
Create a CRSP-Compustat link table using cusip.
Add gvkey to m(d)sf and identify primary stocks using internal link table.
Create crspm and crspd files.
Get risk-free rate.
Convert non-USD values of funda(q) to USD values.
Save downloaded table to a file.
Read data from a saved table.
Read a file and save it to a csv file.
References
CRSP overview: https://wrds-www.wharton.upenn.edu/pages/support/data-overview/wrds-overview-crsp-us-stock-database/
CRSP-Compustat merge: https://wrds-www.wharton.upenn.edu/pages/support/manuals-and-overviews/crsp/crspcompustat-merged-ccm/wrds-overview-crspcompustat-merged-ccm/
CRSP annual update tables: https://wrds-www.wharton.upenn.edu/data-dictionary/crsp_a_indexes/
- static add_gvkey_to_crsp(sf)
Add gvkey to m(d)sf and identify primary stocks.
The permno and gvkey are mapped using crsp.ccmxpf_linktable.
Primary stocks are identified in the following order.
If linkprim = ‘P’ or ‘C’, set the security as primary.
If permno and gvkey have 1:1 mapping, set the security as primary.
Among the securities with the same gvkey, set the one with the maximum trading volume as primary.
Among the securities with the same permco and missing gvkey, set the one with the maximum trading volume as primary.
- Parameters:
sf – m(d)sf DataFrame with index = date/permno.
- Returns:
m(d)sf with ‘gveky’ and ‘primary’ (primary stock indicator) columns added.
References
https://wrds-www.wharton.upenn.edu/pages/support/research-wrds/macros/wrds-macros-cvccmlnksas/
- static add_gvkey_to_crsp_cusip(sf)
Add gvkey to m(d)sf and identify primary stocks using internal link table.
The permno and gvkey are mapped using crsp_comp_linktable.
Primary stocks are identified in the following order.
If linkprim = True, set the security as primary.
If permno and gvkey have 1:1 mapping, set the security as primary.
Among the securities with the same gvkey, set the one with the maximum trading volume as primary.
Among the securities with the same permco and missing gvkey, set the one with the maximum trading volume as primary.
- Parameters:
sf – m(d)sf DataFrame with index = date/permno.
Note
Compared to using ccmxpf_linktable, about 13% of gvkey’s and 3% of primary’s are different.
- Returns:
m(d)sf with ‘gveky’ and ‘primary’ (primary stock indicator) columns added.
- static convert_fund_currency_to_usd(fund, table='funda')
Convert non-USD values of funda(q) to USD values.
- Parameters:
fund – funda(q) DataFrame with index = datadate/gvkey.
table – ‘funda’ or ‘fundq’: indicator whether fund is funda or fundq.
- Returns:
Converted fund DataFrame.
Note
In Compustat North America, the accounting data can be either in USD and CAD. This is no problem if firm characteristics are generated using only Compustat. However, if data from different sources are mixed, e.g., if CRSP’s market equity (in USD) is combined with Compustat, Compustat data should be converted to USD.
Following JKP, we use compustat.exrt_dly to obtain exchange rates. The exrt_dly starts from 1982-02-01.
- static create_crsp_comp_linktable()
Create a CRSP-Compustat link table using cusip.
This method creates a CRSP-Compustat link table by merging crsp.msf with comp.security on cusip. This method can be used if the user does not have a WRDS subscription for ccmxpf_linktable. The link table has the columns [‘cusip’, ‘gvkey’, ‘permno’, ‘linkdt’, ‘linkenddt’, ‘linkprim’] and is saved to
config.input_dir/crsp_comp_linktable. The linkprim column value is True if a security is primary.Note
The sql in the reference uses historical cusip (ncusip in msenames). However, we use cusip in msf as it renders more matches.
A security is considered primary (linkprim = True) if its cusip is in funda or fundq.
References
- create_pgpass_file()
Create pgpass file.
Need to be called only once (after logging in to WRDS for the first time using passwords). Once pgpass file is created, password is not required when connecting to WRDS.
- download_all(run_in_executer=True)
Download all tables.
Currently, this method downloads the following tables:
comp.funda
comp.fundq
comp.exrt_dly
crsp.msf (merged with crsp.msenames)
crsp.dsf (merged with crsp.dsenames)
crsp.mseall
crsp.dseall
crsp.ccmxpf_linktable
crsp.mcti
ff.factors_monthly
ff.factors_daily
- Parameters:
run_in_executer – If True, download concurrently. Faster (if network speed is high) but memory hungrier.
- download_funda(sdate=None, edate=None, run_in_executer=True)
Download comp.funda.
Downloaded data has index = datadate/gvkey and is sorted on gvkey/datadate.
- Parameters:
sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
run_in_executer – If True, download concurrently. Faster but memory hungrier.
- download_fundq(sdate=None, edate=None, run_in_executer=True)
Download comp.fundq.
Downloaded data has index = datadate/gvkey and is sorted on gvkey/datadate.
- Parameters:
sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
run_in_executer – If True, download concurrently. Faster but memory hungrier.
- download_g_secd(sdate=None, edate=None, run_in_executer=True)
Download comp.g_secd.
Downloaded data has index = datadate/gvkey and is sorted on gvkey/datadate.
- Parameters:
sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
run_in_executer – If True, download concurrently. Faster but memory hungrier.
- download_seall(sdate=None, edate=None, monthly=True, run_in_executer=True)
Download delist and dividend info from crsp.m(d)seall.
Delist can be obtained from either mseall or msedelist. We use mseall since it contains exchcd, which is used when replacing missing dlret. The shrcd and exchcd in mseall are usually those before halting/suspension. If a stock in NYSE is halted, exchcd in msenames can be -2, whereas that in mseall is 1. The downloaded fields are: permno, date, dlret, dlstcd, shrcd, exchcd, distcd, divamt. Downloaded data has index = date/permno and is sorted on permno/date.
- Parameters:
sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
monthly – If True download mseall else dseall.
run_in_executer – If True, download concurrently. Faster but memory hungrier.
- download_secd(sdate=None, edate=None, run_in_executer=True)
Download comp.secd.
Downloaded data has index = datadate/gvkey and is sorted on gvkey/datadate.
- Parameters:
sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
run_in_executer – If True, download concurrently. Faster but memory hungrier.
- download_sf(sdate=None, edate=None, monthly=True, run_in_executer=True)
Download crsp.m(d)sf joined with crsp.m(d)senames.
Downloaded data has index = date/permno and is sorted on permno/date.
- Parameters:
sdate – Start date, e.g., ‘2000-01-01’. Set to None to download all data.
edate – End date. Set to None to download all data.
monthly – If True download msf else dsf.
run_in_executer – If True, download concurrently. Faster but memory hungrier.
- download_table(library, table, obs=-1, offset=0, columns=None, coerce_float=None, date_cols=None, index_col=None, sort_col=None)
Download a table from WRDS library.
This is a wrapping function of
wrds.get_table(). The queried table is saved toconfig.input_dir/library/table.- Parameters:
library – WRDS library. e.g., crsp, comp, …
table – A table in library.
obs – See
wrds.get_table().offset – See
wrds.get_table().columns – See
wrds.get_table().coerce_float – See
wrds.get_table().date_cols – See
wrds.get_table().index_col – (List of) column(s) to be set as index.
sort_col – (List of) column(s) to sort data on.
- download_table_async(library, table, sql=None, date_col=None, sdate=None, edate=None, interval=5, src_tables=None, run_in_executer=True, index_col=None, sort_col=None)
Asynchronous download of a WRDS table.
This method splits the total period into interval years and downloads data of each sub-period asynchronously. If download fails, it can be started from the failed date: already downloaded files will be gathered together. This method allow us to download a large table, e.g., crsp.dsf, reliably without connection timeout and consumes much less memory than
download_table(). The queried table is saved toconfig.input_dir/library/table.- Parameters:
library – WRDS library.
table – WRDS table. If a complete query is given in sql, this can be any name: used only as the file name when saving the data.
sql – String of the fields to select or a complete query statement. See below.
date_col – Date field on which downloaing will be split. Ignored if sql is a complete query statement.
sdate – Start date (‘yyyy-mm-dd’). If None, ‘1900-01-01’.
edate – End date. If None, today.
interval – Sub-period size in years.
src_tables – List of (library, table) that are used in the query. The src_tables are used to get data types of the fields. When data is selected from a single table, library.table, this can be set to None.
run_in_executer – If True, download concurrently. Faster but memory hungrier.
index_col – (List of) column(s) to be set as index.
sort_col – (List of) column(s) to sort data on.
Note
For a small table, this can be slower than
download_table(). Table should have a date field (date_col) to split the period.Examples
Instantiate WRDS.
>>> wrds = WRDS('user_name')
Download crsp.msf.
>>> wrds.download_table_async('crsp', 'msf', date_col='date')
Download ‘permno’, ‘prc’, and ‘exchcd’ fields from crsp.msf.
>>> sql = 'permno, prc, exchcd' >>> wrds.download_table_async('crsp', 'msf', sql, 'date')
Download crsp.msf merged with crsp.msenames. When sql is a complete query statement as below, it should contain ‘WHERE [date_col] BETWEEN {} and {}’ for asynchronous download.
>>> sql = ''' ... SELECT a.*, b.shrcd, b.exchcd, b.siccd ... FROM crsp.msf as a ... LEFT JOIN crsp.msenames as b ... ON a.permno = b.permno ... AND b.namedt <= a.date ... AND a.date <= b.nameendt ... WHERE a.date BETWEEN '{{}}' and '{{}}' ... ORDER BY a.permno, a.date ... ''' >>> src_tables = [('crsp', 'msf'), ('crsp', 'msenames')] >>> wrds.download_table_async('crsp', 'msf', sql, src_tables=src_tables)
- static get_risk_free_rate(sdate=None, edate=None, src='mcti', month_end=False)
Get risk-free rate.
The risk-free rate can be obtained either from crsp.mcti or ff.factors_monthly. The mcti is preferred since the values in factors_monthly have only 4 decimal places. Both risk-free rates are in decimal (not percentage values).
- Parameters:
sdate – Start date.
edate – End date.
src – data source. ‘mcti’: crsp.mcti, ‘ff’: ff.factors_monthly.
month_end – If True, shift dates to the end of the month.
- Returns:
DataFrame of risk-free rates with index = ‘date’ and columns = [‘rf’].
- static merge_sf_with_seall(monthly=True, fill_method=1)
Merge m(d)sf with m(d)seall.
This method adjusts m(d)sf return (‘ret’) using m(d)seall delist return (‘dlret’). The adjusted return replaces ‘ret’ and ‘dlret’ column is added to m(d)sf. For msf, this method also adds cash dividend column, ‘cash_div’, to msf.
- Parameters:
monthly – If True, merge msf with mseall; else, merge dsf with dseall.
fill_method –
Method to fill missing dlret. 0: don’t fill, 1: JKP code, or 2: GHZ code. Default to 1.
fill_method = 1:
dlret = -0.30 if dlstcd is 500 or between 520 and 584.
fill_method = 2:
dlret = -0.35 if dlstcd is 500 or between 520 and 584, and exchcd is 1 or 2.
dlret = -0.55 if dlstcd is 500 or between 520 and 584, and exchcd is 3.
- Returns:
m(d)sf with adjusted return (and cash dividend).
Note
The msenames can be missing when a firm is delisted, resulting in missing shrcd/exchcd in m(d)sf. Missing shrcd and exchcd are filled with the latest values.
References
Delist codes: http://www.crsp.com/products/documentation/delisting-codes
- static preprocess_crsp(use_ccmxpf_linktable=None)
Create crspm and crspd files.
This method calls
merge_sf_with_seall()andadd_gvkey_to_crsp()to add delist return, gveky, and primary indicator to m(d)sf. The result is saved toconfig.input_dir/crspm(d).- Parameters:
use_ccmxpf_linktable – If True, use crsp.ccmxpf_linktable to link CRSP and Compustat; if False, use internally created link table, crsp_comp_linktable. If None, use crsp.ccmxpf_linktable if it exists, otherwise, use crsp_comp_linktable.
- static read_data(table, library=None, index_col=None, sort_col=None, typecast=True)
Read data from a saved table.
The file path is
config.input_dir/library/table. The library argument is redundant: if it is None, all folders underconfig.input_diris searched.- Parameters:
table – File name without extension.
library – Directory.
index_col – (List of) column(s) to be set as index.
sort_col – (List of) column(s) to sort data on.
typecast – If True, cast float to
config.float_typeand object to string after reading from the file.
- Returns:
DataFrame. Data read. Index = index_col.
- static save_as_csv(table, library=None, fpath=None)
Read a file and save it to a csv file.
- Parameters:
table – File name without extension.
library – Directory.
fpath – File path for the csv file. If None, the file is saved to
config.input_dir/library/table.csv.
- static save_data(data, table, library=None, index_col=None, sort_col=None, typecast=True)
Save downloaded table to a file.
The file format can be either pickle (default) or parquet and configured by
set_config(). The data is saved in the following location:If library = None,
config.input_dir/table.Otherwise,
config.input_dir/library/table.
- Parameters:
data – Data to save (DataFrame).
table – File name without extension.
library – Directory.
index_col – (List of) column(s) to be set as index.
sort_col – (List of) column(s) to sort data on.
typecast – If True, cast float to config.float_type and object to string before saving to a file.
Note
A parquet file size can be significantly smaller especially when there are many duplicate values in columns. However, it tends to be slower to read and write and takes significantly more memory in some cases for unknown reasons. To change the file format to parquet, use
set_config(file_format='parquet'). We use parquet or pickle file format to store data as they preserve data types and are much faster to read compared to a csv file. To convert a file to a csv file, usesave_as_csv().