--- title: "DisImpact Tutorial" author: "Vinh Nguyen" date: "`r Sys.Date()`" output: prettydoc::html_pretty: theme: architect highlight: github toc: true vignette: > %\VignetteIndexEntry{DisImpact Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction The `DisImpact` R package contains functions that help in determining disproportionate impact (DI) based on the following methodologies: 1. [percentage point gap](https://www.cccco.edu/-/media/CCCCO-Website/About-Us/Divisions/Digital-Innovation-and-Infrastructure/Research/Files/PercentagePointGapMethod2017.ashx) (PPG) method, 2. [proportionality index](https://www.cccco.edu/-/media/CCCCO-Website/Files/DII/guidelines-for-measuring-disproportionate-impact-in-equity-plans-tfa-ada.pdf) method (method #1 in reference), and 3. [80% index](https://www.cccco.edu/-/media/CCCCO-Website/Files/DII/guidelines-for-measuring-disproportionate-impact-in-equity-plans-tfa-ada.pdf) method (method #2 in reference). # Install Package ```R # From CRAN (Official) install.packages('DisImpact') # From github (Development) devtools::install_github('vinhdizzo/DisImpact') ``` # Load Packages ```{r, message=FALSE, warning=FALSE} library(DisImpact) library(dplyr) # Ease in manipulations with data frames ``` # Load toy student equity data To illustrate the functionality of the package, let's load a toy data set: ```{r} # Load fake data set data(student_equity) # Print first few observations head(student_equity) # For description of data set ## ?student_equity ``` For a description of the `student_equity` data set, type `?student_equity` in the R console. The toy data set can be summarized as follows: ```{r} # Summarize toy data dim(student_equity) dSumm <- student_equity %>% group_by(Cohort, Ethnicity) %>% summarize(n=n(), Transfer_Rate=mean(Transfer)) dSumm ## This is a summarized version of the data set ``` # Percentage point gap (PPG) method `di_ppg` is the main work function, and it can take on vectors or column names the tidy way: ```{r} # Vector di_ppg(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame # Tidy and column reference di_ppg(success=Transfer, group=Ethnicity, data=student_equity) %>% as.data.frame ``` For a description of the `di_ppg` function, including both function arguments and returned results, type `?di_ppg` in the R console. Sometimes, one might want to break out the DI calculation by cohort: ```{r} # Cohort di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>% as.data.frame ``` `di_ppg` is also applicable to summarized data; just pass the counts to `success` and group size to `weight`. For example, we use the summarized data set, `dSumm`, and sample size `n`, in the following: ```{r} di_ppg(success=Transfer_Rate*n, group=Ethnicity, cohort=Cohort, weight=n, data=dSumm) %>% as.data.frame ``` By default, `di_ppg` uses the overall success rate as the reference rate for comparison (default: `reference='overall'`). The `reference` argument also accepts `'hpg'` (highest performing group success rate as the reference rate), `'all but current'` (success rate of all groups combined excluding the comparison group), or a group value from `group`. ```{r} # Reference: Highest performing group di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='hpg', data=student_equity) %>% as.data.frame # Reference: All but current (PPG minus 1) di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='all but current', data=student_equity) %>% as.data.frame # Reference: custom group di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='White', data=student_equity) %>% as.data.frame di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference='Asian', data=student_equity) %>% as.data.frame ``` The user could also pass in custom reference points for comparison (e.g., a state-wide rate). `di_ppg` accepts either a single reference point to be used or a vector of reference points, one for each cohort. For the latter, the vector of reference points will be taken to correspond to the `cohort` variable, alphabetically ordered. ```{r} # With custom reference (single) di_ppg(success=Transfer, group=Ethnicity, reference=0.54, data=student_equity) %>% as.data.frame # With custom reference (multiple) di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort, reference=c(0.5, 0.55), data=student_equity) %>% as.data.frame ``` Disproportionate impact using the PPG relies on calculating the margine margin of error (MOE) pertaining around the success rate. The MOE calculated in `di_ppg` has 2 underlying assumptions (defaults): 1. the minimum MOE returned is 0.03, and 2. using 0.50 as the proportion in the margin of error formula, $1.96 \times \sqrt{\hat{p} (1-\hat{p}) / n}$. To override 1, the user could specify `min_moe` in `di_ppg`. To override 2, the user could specify `use_prop_in_moe=TRUE` in `di_ppg`. ```{r} # min_moe di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02) %>% as.data.frame # use_prop_in_moe di_ppg(success=Transfer, group=Ethnicity, data=student_equity, min_moe=0.02, use_prop_in_moe=TRUE) %>% as.data.frame ``` In cases where the proportion is used in calculating MOE, an observed proportion of 0 or 1 would lead to a zero MOE. To account for these scenarios, the user could leverage the `prop_sub_0` and `prop_sub_1` parameters in `di_ppg` and `ppg_moe` as substitutes. These parameters default to `0.5`, which maximizes the MOE (making it more difficult to declare disproportionate impact). ```{r} # Set Native American to have have zero transfers and see what the results di_ppg(success=Transfer, group=Ethnicity, data=student_equity %>% mutate(Transfer=ifelse(Ethnicity=='Native American', 0, Transfer)), use_prop_in_moe=TRUE, prop_sub_0=0.1, prop_sub_1=0.9) %>% as.data.frame ``` # Proportionality index method `di_prop_index` is the main work function for this method, and it can take on vectors or column names the tidy way: ```{r} # Without cohort ## Vector di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame ## Tidy and column reference di_prop_index(success=Transfer, group=Ethnicity, data=student_equity) %>% as.data.frame # With cohort ## Vector di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame ## Tidy and column reference di_prop_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>% as.data.frame ``` For a description of the `di_prop_index` function, including both function arguments and returned results, type `?di_prop_index` in the R console. Note that the referenced document describing this method does not recommend a threshold on the proportionality index for declaring disproportionate impact. The `di_prop_index` function uses `di_prop_index_cutoff=0.8` as the default threshold, which the user could change. ```{r} # Changing threshold for DI di_prop_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_prop_index_cutoff=0.5) %>% as.data.frame ``` # 80% index method `di_80_index` is the main work function for this method, and it can take on vectors or column names the tidy way: ```{r} # Without cohort ## Vector di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity) %>% as.data.frame ## Tidy and column reference di_80_index(success=Transfer, group=Ethnicity, data=student_equity) %>% as.data.frame # With cohort ## Vector di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort) %>% as.data.frame ## Tidy and column reference di_80_index(success=Transfer, group=Ethnicity, cohort=Cohort, data=student_equity) %>% as.data.frame ``` For a description of the `di_80_index` function, including both function arguments and returned results, type `?di_80_index` in the R console. By default, `di_80_index` uses the group with the highest success rate as reference in calculating the index. One could specify the the comparison group using the `reference_group` argument (a value from `group`). ```{r} # Changing reference group di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, reference_group='White') %>% as.data.frame ``` By default, `di_80_index` uses 80% (`di_80_index_cutoff=0.80`) as the default threshold for declaring disproportionate impact. One could override this using another threshold via the `di_80_index_cutoff` argument. ```{r} # Changing threshold for DI di_80_index(success=student_equity$Transfer, group=student_equity$Ethnicity, cohort=student_equity$Cohort, di_80_index_cutoff=0.50) %>% as.data.frame ``` # When dealing with a non-success variable like drop-out or probation All methods and functions implemented in the `DisImpact` package treat outcomes as positive: 1 is desired over 0 (higher rate is better, lower rate indicates disparity). The choice of the name `success` in the functions' arguments is intentional to remind the user of this. Suppose we have a variable that indicates something negative (e.g., a flag for students on academic probation). We could calculate DI on the converse of it by using the `!` (logical negation) operator: ```{r} ## di_ppg(success=!Probation, group=Ethnicity, data=student_equity) %>% ## as.data.frame ## If there were a Probation variable di_ppg(success=!Transfer, group=Ethnicity, data=student_equity) %>% as.data.frame ## Illustrating the point with `!` ``` # Transformations on the fly We can compute the success, group, and cohort variables on the fly: ```{r} # Transform success a <- sample(0:1, size=nrow(student_equity), replace=TRUE, prob=c(0.95, 0.05)) mean(a) di_ppg(success=pmax(Transfer, a), group=Ethnicity, data=student_equity) %>% as.data.frame # Collapse Black and Hispanic di_ppg(success=Transfer, group=ifelse(Ethnicity %in% c('Black', 'Hispanic'), 'Black/Hispanic', Ethnicity), data=student_equity) %>% as.data.frame ``` # Calculate DI for many variables and groups It is often the case that the user desires to calculate disproportionate impact across many outcome variables and many disaggregation/group variables. The function `di_iterate` allows the user to specify a data set and the various variables to iterate across: ```{r} # Multiple group variables di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall') %>% as.data.frame # Multiple group variables and different reference groups bind_rows( di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups='overall') , di_iterate(data=student_equity, success_vars=c('Transfer'), group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort'), ppg_reference_groups=c('White', 'Male'), include_non_disagg_results=FALSE) # include_non_disagg_results = FALSE: Already have this scenario in Overall run ) ``` There is a separate [vignette](Scaling-DI-Calculations.html) that explains how one might leverage `di_iterate` for rapid dashboard development and deployment with disaggregation and disproportionate impact features. # Appendix: R and R Package Versions This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch. ```{r} sessionInfo() ```