IDCeMPy Package¶
Description¶
IDCeMPy is a Python package that provides functions to fit and assess the performance of the following distinct sets of “inflated” discrete choice models:
Fit the Zero-Inflated Ordered Probit (ZiOP) model without and with correlated errors (ZiOPC model) to evaluate zero-inflated ordered choice outcomes that results from a dual data generating process (d.g.p.).
Fit the Middle-Inflated Ordered Probit (MiOP) model without and with correlated errors (MiOPC) to account for the inflated middle-category in ordered choice measures that relates to a dual d.g.p.
Fit Generalized Inflated Multinomial Logit (GiMNL) models that account for the preponderant and heterogeneous share of observations in the baseline or any lower category in unordered polytomous choice outcomes.
Compute AIC and Log-likelihood statistics and the Vuong Test statistic to assess the performance of each inflated discrete choice model in the package.
IDCeMPy uses Newton numerical optimization methods to estimate the models listed above via Maximum Likelihood Estimation (MLE).
When Should You use IDCeMPy?¶
An excessive (“inflated”) share of observations—stemming from two distinct d.g.p’s—fall into a single choice category in many ordered and unordered polytomous outcome variables. Standard Ordered Probit and Multinomial Logit models cannot account for such category inflation which leads to biased inferences. Examples for such d.g.p’s include:
The inflated zero-category of “no smoking” in ordered measures of self-reported smoking behavior is generated from nonsmokers who never smoke cigarettes and those who smoked previously but temporarily stopped smoking because of high cigarette prices.
The inflated “indifference” middle-category in ordered measures of immigration attitudes includes respondents truly indifferent to immigration and those that choose indifference for social desirability reasons.
The inflated baseline or other lower outcome categories of unordered polytomous outcome measures of vote choice include nonvoters who temporarily abstain from voting and routine nonvoters who always abstain.
IDCeMPy includes the ZIOP(C) models for evaluating zero-inflated ordered choice outcomes that results from a dual d.g.p, the MIOP(C) models that address inflated middle-category ordered outcome measures arising from distinct d.g.p’s, and GIMNL models that account for inflated baseline or other categories for unordered polytomous outcomes.
Each inflated discrete choice model in this package addresses category inflation in one’s discrete outcome—unordered or unordered polytomous—of interest by jointly estimating a binary split-stage equation and an ordered or multinomial discrete choice outcome equation.
Installation¶
The package can be installed in two different ways:
From PyPi:
$ pip install idcempy
From its GitHub Repository:
$ git clone https://github.com/hknd23/idcempy.git
$ cd idcempy
$ python setup.py install
Examples¶
The examples below demonstrate how to use the IDCeMPy package to estimate the inflated discrete choice models ZiOP(C), MiOP(C), and GiMNL.
The example code files, with rough calculation of model run time, are available in the /examples directory.
For each model example below, the run time is available as reference point. The specification used to record the times is Intel Core i7-2600 (3.40GHz Quad core), 16GB RAM.
Please note that for models in the zmiopc module, the run-time for models with correlated errors estimated with zmiopc.iopcmod()
is substantially higher
than their without correlated errors counterparts using zmiopc.iopmod()
. Other factors affecting run-time are the number of observations and the number of covariates.
The examples use the pandas, urllib, and matplotlib packages for importing and visualizing data:
$ pip install pandas
$ pip install matplotlib
$ pip install urllib
The Standard Ordered Probit (OP) model¶
The package also includes zmiopc.opmod()
that estimates a standard Ordered Probit (OP) model.
The OP model does not account for “zero inflation” or “middle inflation,” so it does not have a split-probit stage.
First, import the required libraries and data:
# Import the necessary libraries and package
import pandas as pd
import urllib
from idcempy import zmiopc
# Import the "Youth Tobacco Consumption" dataset.
url='https://github.com/hknd23/zmiopc/blob/main/data/tobacco_cons.csv'
# Define a `Pandas` DataFrame
data = pd.read_csv(url)
The list of variable names for the Independent and Dependent variables needs to be specified:
# Define a list of variable names (strings) X,Y:
# X = Column names of covariates (from `DataFrame`) in the OP equation
# Y = Column name of outcome variable (from `DataFrame`).
X = ['age', 'grade', 'gender_dum']
Y = ['cig_count']
After importing the data and specifying the model, the following code fits the OP model:
# Model estimation:
op_tob = zmiopc.opmod(data, X, Y, method = 'bfgs', weights = 1, offsetx =0)
# data = name of pandas DataFrame
# X = variables in the ordered probit stage.
# Y = dependent variable.
# method = method for optimization. By default set to 'bfgs'
# weights = weights.
# offsetx = offset of X. By Default is zero.
# offsetz = offset of z
The following message will appear when the model has converged:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 4411.710049
Iterations: 10
Function evaluations: 976
Gradient evaluations: 121
The model’s run-time is 37.694 seconds (N= 9624). zmiopc.OpModel
stores results from model estimation and other information in its attributes.
The following line of code to see the estimates of coefficients:
# Print coefficients of the models
print(op_tob.coefs)
Coef SE tscore p 2.5% 97.5%
cut1 1.696175 0.047320 35.844532 0.000000 1.603427 1.788922
cut2 -0.705037 0.031650 -22.276182 0.000000 -0.767071 -0.643004
cut3 -2.304405 0.121410 -18.980329 0.000000 -2.542369 -2.066441
cut4 2.197381 0.235338 9.337141 0.000000 1.736119 2.658643
age -0.070615 0.007581 -9.314701 0.000000 -0.085474 -0.055756
grade 0.233741 0.010336 22.614440 0.000000 0.213483 0.254000
gender_dum 0.020245 0.032263 0.627501 0.530331 -0.042991 0.083482
Log-likelihood, AIC, and Variance-Covariance matrix can be extracted with:
# Print Log-Likelihood
print(op_tob.llik)
# Print AIC
print(op_tob.AIC)
# Print VCOV matrix
print(op_tob.vcov)
The Vuong Test¶
Harris and Zhao (2007) suggest that a variant of the Vuong (1989) Test (with a v statistic) can be used to compare the performance of the ZiOP versus the standard Ordered Probit (OP) model. The Vuong’s test formula is:
where v < -1.96 favors the more general (ZiOP/ZiOPC) model, -1.96 < v < 1.96 lends no support to either model, and v > 1.96 supports the simpler (OP) model.
The OP and ZiOP models must have the same number of observations, and the OP must have the same number of covariates as ZiOP’s OP stage. The statistic below reveals that the OP model is preferred over the ZiOP model.
# Estimate Vuong test. OP model first, ZIOP model specified next in this case
zmiopc.vuong_opiop(op_tob, ziop_tob)
6.624742132792222
The Vuong test can also be implemented to compare the ZiOPC, MiOP and MiOPC models with the OP model.
Generalized Inflated Multinomial Logit (GiMNL) Model¶
The gimnl
module provides gimnl.gimnlmod()
to estimate the General “inflated” Multinomial Logit models (GiMNL) with three outcomes in the dependent variable.
The GiMNL model minimize issues present when unordered polytomous outcome variables have an excessive share and heterogeneous pool of observations in the lower category.
Similar to the models in the zmiopc
module, the first step is to import the libraries and 2004 presidential vote choice dataset.
# Import the module
import pandas as pd
import urllib
from idcempy import gimnl
# Load the dataset
url= 'https://github.com/hknd23/zmiopc/raw/main/data/replicationdata.dta'
# Define a `Pandas` DataFrame
data = pd.read_stata(url)
We the define the list of covariates in the split-stage (z), the multinomial logit-stage (x) and the outcome variable (y). The values of the dependent variable must be represented numerically as “0”, “1”, and “2” to represent each category. To specify the baseline/reference category, users provide a three-element list for the reference argument (e.g [0,1,2]). The first element of the list is the baseline/reference category.
# x = Column names of covariates (from `DataFrame`) in the outcome-stage.
# z = Column names of covariates (from `DataFrame`) in the split-stage.
# y = Column names of outcome variable (from `DataFrame`).
x = ['educ', 'party7', 'agegroup2']
z = ['educ', 'agegroup2']
y = ['vote_turn']
The flexibility of gimnl.gimnlmod()
allows users to customize the baseline and inflated categories.
Users can employ the argument inflatecat with ‘baseline’, ‘second’, or ‘third’ to specify any unordered category as the inflated category (dictated by the distribution) in their unordered-polytomous outcome measure.
If ‘baseline’ is selected, the first element (baseline/reference category) in reference is the inflated outcome.
Likewise, if ‘second’ or ‘third’ is selection, the second or third element will be the inflated outcome. The following code specifies the outcome ‘0’ (Abstain) as both the baseline and inflated category.
# Define order of variables
order = [0, 1, 2]
# Define "inflation" category
inflatecat = "baseline"
# Estimate the model
gimnl_2004vote = gimnl.gimnlmod(data, x, y, z, method = 'bfgs', order, inflatecat)
# data = name of pandas DataFrame.
# x = variables in the ordered stage.
# y = dependent variable.
# z = variables in the inflation stage.
# method = optimization method. Default is 'bfgs'
# order = order of variables.
# inflatecat = inflated category.
The following line of code prints the coefficients of the covariates:
# Print coefficients
print(gimnl_2004vote.coefs)
Coef SE tscore p 2.5% 97.5%
Inflation: int -4.935 2.777 -1.777 0.076 -10.379 0.508
Inflation: educ 1.886 0.293 6.441 0.000 1.312 2.460
Inflation: agegroup2 1.295 0.768 1.685 0.092 -0.211 2.800
1: int -4.180 1.636 -2.556 0.011 -7.387 -0.974
1: educ 0.334 0.185 1.803 0.071 -0.029 0.697
1: party7 0.454 0.057 7.994 0.000 0.343 0.566
1: agegroup2 0.954 0.248 3.842 0.000 0.467 1.441
2: int 0.900 1.564 0.576 0.565 -2.166 3.966
2: educ 0.157 0.203 0.772 0.440 -0.241 0.554
2: party7 -0.577 0.058 -9.928 0.000 -0.691 -0.463
2: agegroup2 0.916 0.235 3.905 0.000 0.456 1.376
The model’s run-time is 16.646 seconds (N= 1341). The results from the model are stored in a gimnlModel
with the following attributes:
coefs: Model coefficients and standard errors.
llik: Log-likelihood.
AIC: Akaike information criterion.
vcov: Variance-covariance matrix.
For example, AIC can be printed as follows.
# Print Log_Likelihood
print(gimnl_2004vote.llik)
# Print AIC
print(gimnl_2004vote.AIC)
# Print VCOV matrix
print(gimnl_2004vote.vcov)
Users can fit a standard three-category Multinomial Logit Model (MNL) by specifying the list of x, y, and baseline (using reference).
#Estimate the model
mnl_2004vote = gimnl.mnlmod(data, x, y, method = 'bfgs')
# data = name of Pandas DataFrame.
# x = variables in MNL stage.
# y = dependent variable
# method = optimization method. Default is 'bfgs'
# Print the coefficients
print(mnl_2004vote.coefs)
Coef SE tscore p 2.5% 97.5%
1: int -4.914 0.164 -29.980 0.000 -5.235 -4.593
1: educ 0.455 0.043 10.542 0.000 0.371 0.540
1: party7 0.462 0.083 5.571 0.000 0.300 0.625
1: agegroup2 0.951 0.029 32.769 0.000 0.894 1.008
2: int 0.172 0.082 2.092 0.036 0.011 0.334
2: educ 0.282 0.031 9.011 0.000 0.221 0.343
2: party7 -0.567 0.085 -6.641 0.000 -0.734 -0.399
2: agegroup2 0.899 0.138 6.514 0.000 0.629 1.170
The MNL model’s run-time is 8.276 seconds. Similar to the GiMNL model, the AIC for the MNL model can also be given by:
# Print Log-Likelihood
print(mnl_2004vote.AIC)
# Print AIC
print(mnl_2004vote.AIC)
# Print VCOV matrix
print(mnl_2004vote.vcov)
Contributions¶
The authors welcome and encourage new contributors to help test IDCeMPy and add new functionality. You can find detailed instructions on “how to contribute” to IDCeMPy here.