Package 'DisImpact'

Title: Calculates Disproportionate Impact When Binary Success Data are Disaggregated by Subgroups
Description: Implements methods for calculating disproportionate impact: the percentage point gap, proportionality index, and the 80% index. California Community Colleges Chancellor's Office (2017). Percentage Point Gap Method. <https://www.cccco.edu/-/media/CCCCO-Website/About-Us/Divisions/Digital-Innovation-and-Infrastructure/Research/Files/PercentagePointGapMethod2017.ashx>. California Community Colleges Chancellor's Office (2014). Guidelines for Measuring Disproportionate Impact in Equity Plans. <https://www.cccco.edu/-/media/CCCCO-Website/Files/DII/guidelines-for-measuring-disproportionate-impact-in-equity-plans-tfa-ada.pdf>.
Authors: Vinh Nguyen [aut, cre]
Maintainer: Vinh Nguyen <[email protected]>
License: GPL-3
Version: 0.0.22.9000
Built: 2025-01-27 04:47:31 UTC
Source: https://github.com/vinhdizzo/disimpact

Help Index


Calculate disproportionate impact per the 80% index

Description

Calculate disproportionate impact per the 80% index method.

Usage

di_80_index(
  success,
  group,
  cohort,
  weight,
  data,
  di_80_index_cutoff = 0.8,
  reference_group = "hpg",
  check_valid_reference = TRUE
)

Arguments

success

A vector of success indicators (1/0 or TRUE/FALSE) or an unquoted reference (name) to a column in data if it is specified. It could also be a vector of counts, in which case weight should also be specified (group size).

group

A vector of group names of the same length as success or an unquoted reference (name) to a column in data if it is specified.

cohort

(Optional) A vector of cohort names of the same length as success or an unquoted reference (name) to a column in data if it is specified. disproportionate impact is calculated for every group within each cohort. When cohort is not specified, then the analysis assumes a single cohort.

weight

(Optional) A vector of case weights of the same length as success or an unquoted reference (name) to a column in data if it is specified. If success consists of counts instead of success indicators (1/0), then weight should also be specified to indicate the group size.

data

(Optional) A data frame containing the variables of interest. If data is specified, then success, group, and cohort will be searched within it.

di_80_index_cutoff

A numeric value between 0 and 1 that is used to determine disproportionate impact if the index comparing the success rate of the current group to the reference group falls below this threshold; defaults to 0.80.

reference_group

The reference group value in group that each group should be compared to in order to determine disproportionate impact. By default (='hpg'), the group with the highest success rate is used as reference. The user could also specify a value of 'overall' to use the overall rate as the reference for comparison, or 'all but current' to use the combined success rate of all other groups excluding the current group for each comparison.

check_valid_reference

Check whether reference_group is a valid value; defaults to TRUE. This argument exists to be used in di_iterate as when iterating DI calculations, there may be some scenarios where a specified reference group does not contain any students.

Details

This function determines disproportionate impact based on the 80% index method, as described in this reference from the California Community Colleges Chancellor's Office. It assumes that a higher rate is good ("success"). For rates that are deemed negative (eg, rate of drop-outs, high is bad), then consider looking at the converse of the non-success (eg, non drop-outs, high is good) instead in order to leverage this function properly.

Value

A data frame consisting of:

  • cohort (if used),

  • group,

  • n (sample size),

  • success (number of successes for the cohort-group),

  • pct (proportion of successes for the cohort-group),

  • reference_group (the reference group used to compare and determine disproportionate impact),

  • reference (the reference rate used for comparison, corresponding to reference_group),

  • di_80_index (ratio of pct to the reference),

  • di_indicator (1 if di_80_index < di_80_index_cutoff),

  • success_needed_not_di (the number of additional successes needed in order to no longer be considered disproportionately impacted as compared to the reference), and

  • success_needed_full_parity (the number of additional successes needed in order to achieve full parity with the reference).

References

California Community Colleges Chancellor's Office (2014). Guidelines for Measuring Disproportionate Impact in Equity Plans.

Examples

library(dplyr)
data(student_equity)
di_80_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

Calculates disproportionate impact using multiple methods for data stored in a data.table object.

Description

Calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for data stored in a data.table object. This is the workhorse function leveraged by the di_iterate_dt function.

Usage

di_calc_dt(
  dt,
  success_var,
  group_var,
  cohort_var = "",
  weight_var = NULL,
  ppg_reference_group = "overall",
  min_moe = 0.03,
  use_prop_in_moe = FALSE,
  prop_sub_0 = 0.5,
  prop_sub_1 = 0.5,
  di_prop_index_cutoff = 0.8,
  di_80_index_cutoff = 0.8,
  di_80_index_reference_group = "hpg",
  filter_subset = ""
)

Arguments

dt

A data frame of class data.table. If the object is not a data table, one could surround the object with as.data.table.

success_var

A character value specifying the success variable name.

group_var

A character value specifying the group (disaggregation) variable name.

cohort_var

(Optional) A character value specifying the cohort variable. If not specified, then a single cohort is assumed (defaults to an empty string, '').

weight_var

(Optional) A character variable specifying the weight variable if the input data set is summarized (i.e., the the success variables specified in success_vars contain count of successes). Weight here corresponds to the denominator when calculating the success rate. Defaults to NULL for an input data set where each row describes an individual.

ppg_reference_group

Either 'overall', 'hpg', 'all but current', or a character value specifying a group from group_var to be used as the reference group for comparison using percentage point gap method.

min_moe

The minimum margin of error to be used in the PPG calculation; see di_ppg.

use_prop_in_moe

(TRUE or FALSE) Whether the estimated proportions should be used in the margin of error calculation by the PPG; see di_ppg.

prop_sub_0

Default is 0.50; see di_ppg.

prop_sub_1

Default is 0.50; see di_ppg.

di_prop_index_cutoff

Threshold used for determining disproportionate impact using the proportionality index; see di_prop_index; defaults to 0.80.

di_80_index_cutoff

Threshold used for determining disproportionate impact using the 80% index; see di_80_index; defaults to 0.80.

di_80_index_reference_group

Either 'overall', 'hpg', 'all but current', or a character value specifying a group from group_var to be used as the reference group for comparison using 80% index.

filter_subset

A character value such as "Ethnicity == 'White' & Gender == 'M'" used in the i argument (filtering rows via dt[i, j, by]) to filter data in dt. The character value is parsed using eval(parse(text=filter_subset)). Defaults to '' for no filtering.

Value

A data.table object with summarized results.


Generate SQL code that calculates disproportionate impact using multiple methods for a specified table.

Description

Generate SQL code that calculates disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for a specified table name, success variable, group variable, and cohort variable. This is the workhorse function leveraged by the di_iterate_sql function.

Usage

di_calc_sql(
  db_table_name,
  success_var,
  group_var,
  cohort_var = "",
  weight_var = 1,
  ppg_reference_group = "overall",
  min_moe = 0.03,
  use_prop_in_moe = FALSE,
  prop_sub_0 = 0.5,
  prop_sub_1 = 0.5,
  di_prop_index_cutoff = 0.8,
  di_80_index_cutoff = 0.8,
  di_80_index_reference_group = "hpg",
  before_with_statement = "",
  after_with_statement = "",
  end_of_select_statement = "",
  where_statement = "",
  select_statement_add = ""
)

Arguments

db_table_name

A character value specifying a database table name.

success_var

A character value specifying the success variable name.

group_var

A character value specifying the group (disaggregation) variable name.

cohort_var

(Optional) A character value specifying the cohort variable. If not specified, then a single cohort is assumed (defaults to an empty string, '').

weight_var

(Optional) A character variable specifying the weight variable if the input data set is summarized (i.e., the the success variables specified in success_vars contain count of successes). Weight here corresponds to the denominator when calculating the success rate. Defaults to a numeric 1 which treats each row as an individual.

ppg_reference_group

Either 'overall', 'hpg', 'all but current', or a character value specifying a group from group_var to be used as the reference group for comparison using the percentage point gap method.

min_moe

The minimum margin of error to be used in the PPG calculation; see di_ppg.

use_prop_in_moe

(TRUE or FALSE) Whether the estimated proportions should be used in the margin of error calculation by the PPG; see di_ppg.

prop_sub_0

Default is 0.50; see di_ppg.

prop_sub_1

Default is 0.50; see di_ppg.

di_prop_index_cutoff

Threshold used for determining disproportionate impact using the proportionality index; see di_prop_index; defaults to 0.80.

di_80_index_cutoff

Threshold used for determining disproportionate impact using the 80% index; see di_80_index; defaults to 0.80.

di_80_index_reference_group

Either 'overall', 'hpg', 'all but current', or a character value specifying a group from group_var to be used as the reference group for comparison using 80% index.

before_with_statement

Character value to be added to the SQL query to allow for modification. Defaults to '' (empty string).

after_with_statement

Character value to be added to the SQL query to allow for modification. Defaults to '' (empty string).

end_of_select_statement

Character value to be added to the SQL query to allow for modification. Defaults to '' (empty string).

where_statement

Character value to be added to the SQL query to allow for modification. Defaults to '' (empty string).

select_statement_add

Character value to be added to the SQL query to allow for modification. Defaults to '' (empty string).

Value

A character value (SQL query) that could be executed on a database.


Iteratively calculate disproportionate impact using multiple method for many variables.

Description

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for many success variables, disaggregation variables, and scenarios.

Usage

di_iterate(
  data,
  success_vars,
  group_vars,
  cohort_vars = NULL,
  scenario_repeat_by_vars = NULL,
  exclude_scenario_df = NULL,
  weight_var = NULL,
  include_non_disagg_results = TRUE,
  ppg_reference_groups = "overall",
  min_moe = 0.03,
  use_prop_in_moe = FALSE,
  prop_sub_0 = 0.5,
  prop_sub_1 = 0.5,
  di_prop_index_cutoff = 0.8,
  di_80_index_cutoff = 0.8,
  di_80_index_reference_groups = "hpg",
  check_valid_reference = TRUE,
  parallel = FALSE,
  parallel_n_cores = parallel::detectCores(),
  parallel_split_to_disk = FALSE
)

Arguments

data

A data frame for which to iterate DI calculations for a set of variables.

success_vars

A character vector of success variable names to iterate across.

group_vars

A character vector of group (disaggregation) variable names to iterate across.

cohort_vars

(Optional) A character vector of the same length as success_vars to indicate the cohort variable to be used for each variable specified in success_vars. A vector of length 1 could be specified, in which case the same cohort variable is used for each success variable. If not specified, then a single cohort is assumed for all success variables.

scenario_repeat_by_vars

(Optional) A character vector of variables to repeat DI calculations for across all combination of these variables. For example, the following variables could be specified:

  • Ed Goal: Degree/Transfer, Shot-term Career, Non-credit

  • First time college student: Yes, No

  • Full-time status: Yes, No

Each combination of these variables (eg, full time, first time college students with an ed goal of degree/transfer as one combination) would constitute an iteration / sample for which to calculate disproportionate impact for outcomes listed in success_vars and for the disaggregation variables listed in group_vars. The overall rate of success for full time, first time college students with an ed goal of degree/transfer would just include these students and not others. Each variable specified is also collapsed to an '- All' group so that the combinations also reflect all students of a particular category. The total number of combinations for the previous example would be (+1 representing the all category): (3 + 1) x (2 + 1) x (2 + 1) = 36.

exclude_scenario_df

(Optional) A data frame with variables that match scenario_repeat_by_vars for specifying the combinations to exclude from DI calculations. Following the example specified above, one could choose to exclude part-time non-credit students from consideration.

weight_var

(Optional) A character variable specifying the weight variable if the input data set is summarized (i.e., the the success variables specified in success_vars contain count of successes). Weight here corresponds to the denominator when calculating the success rate. Defaults to NULL for an input data set where each row describes each individual.

include_non_disagg_results

A logical variable specifying whether or not the non-disaggregated results should be returned; defaults to TRUE. When TRUE, a new variable `- None` is added to the data set with a single data value '- All', and this variable is added group_vars as a disaggregation/group variable. The user would want these results returned to review non-disaggregated results.

ppg_reference_groups

Either 'overall', 'hpg', 'all but current', or a character vector of the same length as group_vars that indicates the reference group value for each group variable in group_vars when determining disproportionate impact using the percentage point gap method.

min_moe

The minimum margin of error to be used in the PPG calculation, passed to di_ppg.

use_prop_in_moe

Whether the estimated proportions should be used in the margin of error calculation by the PPG, passed to di_ppg.

prop_sub_0

passed to di_ppg; defaults to 0.50.

prop_sub_1

passed to di_ppg; defaults to 0.50.

di_prop_index_cutoff

Threshold used for determining disproportionate impact using the proportionality index; passed to di_prop_index; defaults to 0.80.

di_80_index_cutoff

Threshold used for determining disproportionate impact using the 80% index; passed to di_80_index; defaults to 0.80.

di_80_index_reference_groups

Either 'overall', 'hpg', 'all but current', or a character vector of the same length as group_vars that indicates the reference group value for each group variable in group_vars when determining disproportionate impact using the 80% index.

check_valid_reference

Check whether ppg_reference_groups and di_80_index_reference_groups contain valid values; defaults to TRUE.

parallel

If TRUE, then perform calculations in parallel based on the scenarios specified by scenario_repeat_by_vars. Defaults to FALSE. Parallel execution is based on the parallel package included in base R, using parLapply on Windows and mclapply on POSIX-based systems (Linux/Mac).

parallel_n_cores

The number of CPU cores to use if parallel=TRUE. Defaults to the maximum number CPU cores on the system.

parallel_split_to_disk

If TRUE and parallel=TRUE, then create intermediate data sets for each scenario generated by scenario_repeat_by_vars, write them to disk, and import the required data set when necessary for each scenario executing in parallel. This feature is useful when the data set specified by data is very large and parallel execution is desired for speed in order to reduce the likelihood of consuming all the system's memory and crashing. Note that there is an overhead I/O cost on speed when this feature is used. Defaults to FALSE.

Details

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for all combinations of success_vars, group_vars, and cohort_vars, for each combination of subgroups specified by scenario_repeat_by_vars.

Value

A summarized data set (data frame) consisting of:

  • success_variable (elements of success_vars),

  • disaggregation (elements of group_vars),

  • cohort (values corresponding to the variables specified in cohort_vars,

  • di_indicator_ppg (1 if there is disproportionate impact per the percentage point gap method, 0 otherwise),

  • di_indicator_prop_index (1 if there is disproportionate impact per the proportionality index, 0 otherwise),

  • di_indicator_80_index (1 if there is disproportionate impact per the 80% index, 0 otherwise), and

  • other relevant fields returned from di_ppg, di_prop_index, and di_80_index.

Examples

library(dplyr)
data(student_equity)
# Multiple group variables
di_iterate(data=student_equity, success_vars=c('Transfer')
  , group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort')
  , ppg_reference_groups='overall')

Iteratively calculate disproportionate impact using multiple method for many variables, using data.table and collapse.

Description

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for many success variables, disaggregation variables, and scenarios, using data.table and collapse.

Usage

di_iterate_dt(
  dt,
  success_vars,
  group_vars,
  cohort_vars = NULL,
  scenario_repeat_by_vars = NULL,
  exclude_scenario_df = NULL,
  weight_var = NULL,
  include_non_disagg_results = TRUE,
  ppg_reference_groups = "overall",
  min_moe = 0.03,
  use_prop_in_moe = FALSE,
  prop_sub_0 = 0.5,
  prop_sub_1 = 0.5,
  di_prop_index_cutoff = 0.8,
  di_80_index_cutoff = 0.8,
  di_80_index_reference_groups = "hpg",
  check_valid_reference = TRUE,
  parallel = FALSE,
  parallel_n_cores = parallel::detectCores()/2
)

Arguments

dt

A data frame of class data.table. If the object is not a data table, one could surround the object with as.data.table.

success_vars

A character vector of success variable names to iterate across.

group_vars

A character vector of group (disaggregation) variable names to iterate across.

cohort_vars

(Optional) A character vector of the same length as success_vars to indicate the cohort variable to be used for each variable specified in success_vars. A vector of length 1 could be specified, in which case the same cohort variable is used for each success variable. If not specified, then a single cohort is assumed for all success variables (defaults to NULL).

scenario_repeat_by_vars

(Optional) A character vector of variables to repeat DI calculations for across all combination of these variables. For example, the following variables could be specified:

  • Ed Goal: Degree/Transfer, Shot-term Career, Non-credit

  • First time college student: Yes, No

  • Full-time status: Yes, No

Each combination of these variables (eg, full time, first time college students with an ed goal of degree/transfer as one combination) would constitute an iteration / sample for which to calculate disproportionate impact for outcomes listed in success_vars and for the disaggregation variables listed in group_vars. The overall rate of success for full time, first time college students with an ed goal of degree/transfer would just include these students and not others. Each variable specified is also collapsed to an '- All' group so that the combinations also reflect all students of a particular category. The total number of combinations for the previous example would be (+1 representing the all category): (3 + 1) x (2 + 1) x (2 + 1) = 36.

exclude_scenario_df

(Optional) A data frame with variables that match scenario_repeat_by_vars for specifying the combinations to exclude from DI calculations. Following the example specified above, one could choose to exclude part-time non-credit students from consideration.

weight_var

(Optional) A character variable specifying the weight variable if the input data set is summarized (i.e., the the success variables specified in success_vars contain count of successes). Weight here corresponds to the denominator when calculating the success rate. Defaults to NULL for an input data set where each row describes an individual.

include_non_disagg_results

A logical variable specifying whether or not the non-disaggregated results should be returned; defaults to TRUE. When TRUE, a new variable `- None` is added to the data set with a single data value '- All', and this variable is added to group_vars as a disaggregation/group variable. The user would want these results returned to review non-disaggregated results.

ppg_reference_groups

Either 'overall', 'hpg', 'all but current', or a character vector of the same length as group_vars that indicates the reference group value for each group variable in group_vars when determining disproportionate impact using the percentage point gap method.

min_moe

The minimum margin of error to be used in the PPG calculation; see di_ppg.

use_prop_in_moe

(TRUE or FALSE) Whether the estimated proportions should be used in the margin of error calculation by the PPG; see di_ppg.

prop_sub_0

Default is 0.50; see di_ppg.

prop_sub_1

Default is 0.50; see di_ppg.

di_prop_index_cutoff

Threshold used for determining disproportionate impact using the proportionality index; see di_prop_index; defaults to 0.80.

di_80_index_cutoff

Threshold used for determining disproportionate impact using the 80% index; see di_80_index; defaults to 0.80.

di_80_index_reference_groups

Either 'overall', 'hpg', 'all but current', or a character vector of the same length as group_vars that indicates the reference group value for each group variable in group_vars when determining disproportionate impact using the 80% index.

check_valid_reference

(TRUE or FALSE) Check whether ppg_reference_groups and di_80_index_reference_groups contain valid values; defaults to TRUE.

parallel

If TRUE, then perform calculations in parallel. Defaults to FALSE. Parallel execution is based on the parallel package included in base R, using parLapply on Windows and mclapply on POSIX-based systems (Linux/Mac).

parallel_n_cores

The number of CPU cores to use if parallel=TRUE. Defaults to half of the maximum number of CPU cores on the system.

Details

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for all combinations of success_vars, group_vars, and cohort_vars, for each combination of subgroups specified by scenario_repeat_by_vars, using data.table and collapse.

Value

A summarized data set of class data.table, with variables as described in di_iterate.


Iteratively calculate disproportionate impact using multiple methods for a long and summarized data set

Description

Calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for a "long" and summarized data set with many success variables and disaggregation variables, where the success counts and disaggregation groups are stored in a single column or variable for each.

Usage

di_iterate_on_long(
  data,
  num_var,
  denom_var,
  disagg_var_col,
  group_var_col,
  disagg_var_col_2 = NULL,
  group_var_col_2 = NULL,
  cohort_var_col = NULL,
  summarize_by_vars = NULL,
  custom_reference_group_flag_var = NULL,
  ...
)

Arguments

data

A data frame for which to iterate DI calculations for a set of variables.

num_var

A variable name (character value) from data where the variable stores success counts (the numerator in success rates). Success rates are calculated by aggregating num_var and denom_var for each unique combination of values in disagg_var_col, group_var_col, disagg_var_col_2, group_var_col_2, cohort_var_col, and summarize_by_vars. If such combinations are unique (single row), then rows are not collapsed.

denom_var

A variable name (character value) from data where the variable stores the group size (the denominator in success rates).

disagg_var_col

A variable name (character value) from data where the variable stores the different disaggregation scenarios. The disaggregation variable could include such values as 'Ethnicity', 'Age Group', and 'Foster Youth', corresponding to three disaggregation scenarios.

group_var_col

A variable name (character value) from data where the variable stores the group name for each group within a level of disaggregation specified in disagg_var_col. For example, the group names could include 'Asian', 'White', 'Black', 'Latinx', 'Native American', and 'Other' for a disaggregation on ethnicity; 'Under 18', '18-21', '22-25', and '25+' for an age group disaggregation; and 'Yes' and 'No' for a foster youth status disaggregation.

disagg_var_col_2

(Optional) A variable name (character value) from data where the variable stores an optional second disaggregation variable, which allows for the intersectionality of variables listed in disagg_var_col and disagg_var_col_2. The second disaggregation variable could describe something not in disagg_var_col_2, such as 'Gender', which would require all groups described in group_var_col to be broken out by gender.

group_var_col_2

(Optional) A variable name (character value) from data where the variable stores the group name for each group within a second level of disaggregation specified in disagg_var_col_2. For example, the group names could include 'Male', 'Female', 'Non-binary', and 'Unknown' if 'Gender' is a value in the variable disagg_var_col_2.

cohort_var_col

(Optional) A variable name (character value) from data where the variable stores the cohort label for the data described in each row.

summarize_by_vars

(Optional) A character vector of variable names in data for which num_var and denom_var are used for aggregation to calculate success rates for the dispropotionate impact (DI) analysis set up by disagg_var_col, group_var_col, disagg_var_col_2, and group_var_col_2. For example, summarize_by_vars=c('Outcome') could specify a single variable/column that describes the outcome or metric in num_var, where the outcome values might include 'Completion of Transfer-Level Math', 'Completion of Transfer-Level English','Transfer', 'Associate Degree'.

custom_reference_group_flag_var

(Optional) A variable name (character value) from data where the variable flags the row or group that should be used as the reference group (1 if row is a reference group, 0 otherwise) for comparison in the percentage point gap method and the 80% index method. When this argument is used, then the ppg_reference_groups and di_80_index_reference_groups arguments should not be specified.

...

(Optional) Other arguments such as ppg_reference_groups, min_moe, use_prop_in_moe, prop_sub_0, prop_sub_1, di_prop_index_cutoff, di_80_index_cutoff, di_80_index_reference_groups, and check_valid_reference from di_iterate.

Details

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for all combinations of success_vars, group_vars, and cohort_vars, for each combination of subgroups specified by scenario_repeat_by_vars.

Value

A summarized data set (data frame) consisting of:

  • variables specified by summarize_by_vars, disagg_var_col, group_var_col, disagg_var_col_2, and group_var_col_2,

  • di_indicator_ppg (1 if there is disproportionate impact per the percentage point gap method, 0 otherwise),

  • di_indicator_prop_index (1 if there is disproportionate impact per the proportionality index, 0 otherwise),

  • di_indicator_80_index (1 if there is disproportionate impact per the 80% index, 0 otherwise), and

  • other relevant fields returned from di_ppg, di_prop_index, and di_80_index.

Examples

library(dplyr)
data(ssm_cohort)
di_iterate_on_long(data=ssm_cohort %>% filter(missingFlag==0) # remove missing data
  , num_var='value', denom_var='denom'
  , disagg_var_col='disagg1', group_var_col='subgroup1'
  , cohort_var_col='academicYear', summarize_by_vars=c('categoryLabel')
  , ppg_reference_groups='all but current' # PPG-1
  , di_80_index_reference_groups='all but current')

Iteratively calculate disproportionate impact using multiple methods for many variables, using SQL.

Description

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for many success variables, disaggregation variables, and scenarios, using SQL (for data stored in a database or in a parquet data file).

Usage

di_iterate_sql(
  db_conn,
  db_table_name,
  success_vars,
  group_vars,
  cohort_vars = NULL,
  scenario_repeat_by_vars = NULL,
  exclude_scenario_df = NULL,
  weight_var = NULL,
  include_non_disagg_results = TRUE,
  ppg_reference_groups = "overall",
  min_moe = 0.03,
  use_prop_in_moe = FALSE,
  prop_sub_0 = 0.5,
  prop_sub_1 = 0.5,
  di_prop_index_cutoff = 0.8,
  di_80_index_cutoff = 0.8,
  di_80_index_reference_groups = "hpg",
  check_valid_reference = TRUE,
  parallel = FALSE,
  parallel_n_cores = parallel::detectCores()/2,
  mssql_flag = FALSE,
  return_what = "data",
  staging_table = paste0("DisImpact_Staging_", paste0(sample(1:9, size = 5, replace =
    TRUE), collapse = "")),
  drop_staging_table = TRUE
)

Arguments

db_conn

A database connection object, returned by dbConnect.

db_table_name

A character value specifying a database table name.

success_vars

A character vector of success variable names to iterate across.

group_vars

A character vector of group (disaggregation) variable names to iterate across.

cohort_vars

(Optional) A character vector of the same length as success_vars to indicate the cohort variable to be used for each variable specified in success_vars. A vector of length 1 could be specified, in which case the same cohort variable is used for each success variable. If not specified, then a single cohort is assumed for all success variables (defaults to NULL).

scenario_repeat_by_vars

(Optional) A character vector of variables to repeat DI calculations for across all combination of these variables. For example, the following variables could be specified:

  • Ed Goal: Degree/Transfer, Shot-term Career, Non-credit

  • First time college student: Yes, No

  • Full-time status: Yes, No

Each combination of these variables (eg, full time, first time college students with an ed goal of degree/transfer as one combination) would constitute an iteration / sample for which to calculate disproportionate impact for outcomes listed in success_vars and for the disaggregation variables listed in group_vars. The overall rate of success for full time, first time college students with an ed goal of degree/transfer would just include these students and not others. Each variable specified is also collapsed to an '- All' group so that the combinations also reflect all students of a particular category. The total number of combinations for the previous example would be (+1 representing the all category): (3 + 1) x (2 + 1) x (2 + 1) = 36.

exclude_scenario_df

(Optional) A data frame with variables that match scenario_repeat_by_vars for specifying the combinations to exclude from DI calculations. Following the example specified above, one could choose to exclude part-time non-credit students from consideration.

weight_var

(Optional) A character variable specifying the weight variable if the input data set is summarized (i.e., the the success variables specified in success_vars contain count of successes). Weight here corresponds to the denominator when calculating the success rate. Defaults to NULL for an input data set where each row describes an individual.

include_non_disagg_results

A logical variable specifying whether or not the non-disaggregated results should be returned; defaults to TRUE. When TRUE, a new variable `- None` is added to the data set with a single data value '- All', and this variable is added to group_vars as a disaggregation/group variable. The user would want these results returned to review non-disaggregated results.

ppg_reference_groups

Either 'overall', 'hpg', 'all but current', or a character vector of the same length as group_vars that indicates the reference group value for each group variable in group_vars when determining disproportionate impact using the percentage point gap method.

min_moe

The minimum margin of error to be used in the PPG calculation; see di_ppg.

use_prop_in_moe

(TRUE or FALSE) Whether the estimated proportions should be used in the margin of error calculation by the PPG; see di_ppg.

prop_sub_0

Default is 0.50; see di_ppg.

prop_sub_1

Default is 0.50; see di_ppg.

di_prop_index_cutoff

Threshold used for determining disproportionate impact using the proportionality index; see di_prop_index; defaults to 0.80.

di_80_index_cutoff

Threshold used for determining disproportionate impact using the 80% index; see di_80_index; defaults to 0.80.

di_80_index_reference_groups

Either 'overall', 'hpg', 'all but current', or a character vector of the same length as group_vars that indicates the reference group value for each group variable in group_vars when determining disproportionate impact using the 80% index.

check_valid_reference

(TRUE or FALSE) Check whether ppg_reference_groups and di_80_index_reference_groups contain valid values; defaults to TRUE.

parallel

If TRUE, then perform calculations in parallel. The parallel feature is only supported when db_table_name is a path to a parquet file ('/path/to/data.parquet') and that db_conn is a connection to a duckdb database (e.g., dbConnect(duckdb(), dbdir=':memory:')). Defaults to FALSE.

parallel_n_cores

The number of CPU cores to use if parallel=TRUE. Defaults to half of the maximum number of CPU cores on the system.

mssql_flag

User-specified logical flag (TRUE or FALSE) that indicates if the MS SQL Server variant of the SQL language should be used.

return_what

A character value specifying the return value for the function call. For 'data', the function will return a long data frame with the disproportionate calculations and relevant statistics, after the calculations are performed on the SQL database engine. For 'SQL', a list object of individual queries will be returned for the user to execute elsewhere. Defaults to 'data'.

staging_table

A character value indicating the name of the staging or results table in the database for storing the disproportionate impact calculations.

drop_staging_table

TRUE/FALSE A logical flag indicating whether or not the staging table specified in staging_table should be dropped in the database after the results are returned to R; defaults to TRUE.

Details

Iteratively calculate disproportionate impact via the percentage point gap (PPG), proportionality index, and 80% index methods for all combinations of success_vars, group_vars, and cohort_vars, for each combination of subgroups specified by scenario_repeat_by_vars, using SQL (calculations done on the database engine or duckdb for parquet files).

Value

When return_what='data' (default), a long data frame is returned (see the return value for di_iterate). When return_what='SQL' (default), a list object where each element is a query (character value) is returned.


Calculate disproportionate impact per the percentage point gap (PPG) method.

Description

Calculate disproportionate impact per the percentage point gap (PPG) method.

Usage

di_ppg(
  success,
  group,
  cohort,
  weight,
  reference = c("overall", "hpg", "all but current", unique(group)),
  data,
  min_moe = 0.03,
  use_prop_in_moe = FALSE,
  prop_sub_0 = 0.5,
  prop_sub_1 = 0.5,
  check_valid_reference = TRUE
)

Arguments

success

A vector of success indicators (1/0 or TRUE/FALSE) or an unquoted reference (name) to a column in data if it is specified. It could also be a vector of counts, in which case weight (group size) should also be specified.

group

A vector of group names of the same length as success or an unquoted reference (name) to a column in data if it is specified.

cohort

(Optional) A vector of cohort names of the same length as success or an unquoted reference (name) to a column in data if it is specified. Disproportionate impact is calculated for every group within each cohort. When cohort is not specified, then the analysis assumes a single cohort.

weight

(Optional) A vector of case weights of the same length as success or an unquoted reference (name) to a column in data if it is specified. If success consists of counts instead of success indicators (1/0), then weight should also be specified to indicate the group size.

reference

Either 'overall' (default), 'hpg' (highest performing group), 'all but current' (success rate of everyone excluding the comparison group; also known as 'ppg minus 1'), a value from group (specifying a reference group), a single proportion (eg, 0.50), or a vector of proportions (one for each cohort). Reference is used as a point of comparison for disproportionate impact for each group. When cohort is specified:

  • 'overall' will use the overall success rate of each cohort group as the reference;

  • 'hpg' will use the highest performing group in each cohort as reference;

  • 'all but current' will use the calculated success rate of each cohort group excluding the comparison group

  • the success rate of the specified reference group from group in each cohort will be used;

  • the specified proportion will be used for all cohorts;

  • the specified vector of proportions will refer to the reference point for each cohort in alphabetical order (so the number of proportions should equal to the number of unique cohorts).

data

(Optional) A data frame containing the variables of interest. If data is specified, then success, group, and cohort will be searched within it.

min_moe

The minimum margin of error (MOE) to be used in the calculation of disproportionate impact and is passed to ppg_moe. Defaults to 0.03.

use_prop_in_moe

A logical value indicating whether or not the MOE formula should use the observed success rates (TRUE). Defaults to FALSE, which uses 0.50 as the proportion in the MOE formula. If TRUE, the success rates are passed to the proportion argument of ppg_moe.

prop_sub_0

For cases where proportion is 0, substitute with prop_sub_0 (defaults to 0.5) to account for the zero MOE. This is relevant only when use_prop_in_moe=TRUE.

prop_sub_1

For cases where proportion is 1, substitute with prop_sub_1 (defaults to 0.5) to account for the zero MOE. This is relevant only when use_prop_in_moe=TRUE.

check_valid_reference

Check whether reference is a valid value; defaults to TRUE. This argument exists to be used in di_iterate as when iterating DI calculations, there may be some scenarios where a specified reference group does not contain any students.

Details

This function determines disproportionate impact based on the percentage point gap (PPG) method, as described in this reference from the California Community Colleges Chancellor's Office. It assumes that a higher rate is good ("success"). For rates that are deemed negative (eg, rate of drop-outs, high is bad), then consider looking at the converse of the non-success (eg, non drop-outs, high is good) instead in order to leverage this function properly. Note that the margin of error (MOE) is calculated using using 1.96*sqrt(0.25^2/n), with a min_moe used as the minimum by default.

Value

A data frame consisting of:

  • cohort (if used),

  • group,

  • n (sample size),

  • success (number of successes for the cohort-group),

  • pct (proportion of successes for the cohort-group),

  • reference_group (reference group used in DI calculation),

  • reference (reference value used in DI calculation),

  • moe (margin of error),

  • pct_lo (lower 95% confidence limit for pct),

  • pct_hi (upper 95% confidence limit for pct),

  • di_indicator (1 if there is disproportionate impact, ie, when pct_hi <= reference),

  • success_needed_not_di (the number of additional successes needed in order to no longer be considered disproportionately impacted as compared to the reference), and

  • success_needed_full_parity (the number of additional successes needed in order to achieve full parity with the reference).

References

California Community Colleges Chancellor's Office (2017). Percentage Point Gap Method.

Examples

library(dplyr)
data(student_equity)
# Vector
di_ppg(success=student_equity$Transfer
  , group=student_equity$Ethnicity) %>% as.data.frame
# Tidy and column reference
di_ppg(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame
# Cohort
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort
 , data=student_equity) %>%
  as.data.frame
# With custom reference (single)
di_ppg(success=Transfer, group=Ethnicity, reference=0.54
  , data=student_equity) %>%
  as.data.frame
# With custom reference (multiple)
di_ppg(success=Transfer, group=Ethnicity, cohort=Cohort
  , reference=c(0.5, 0.55), data=student_equity) %>%
  as.data.frame
# min_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity
  , min_moe=0.02) %>%
  as.data.frame
# use_prop_in_moe
di_ppg(success=Transfer, group=Ethnicity, data=student_equity
  , min_moe=0.02
  , use_prop_in_moe=TRUE) %>%
  as.data.frame

Iteratively calculate disproportionate impact via the percentage point gap (PPG) method for many variables.

Description

Iteratively calculate disproportionate impact via the percentage point gap (PPG) method for many disaggregation variables.

Usage

di_ppg_iterate(
  data,
  success_vars,
  group_vars,
  cohort_vars,
  reference_groups,
  repeat_by_vars = NULL,
  weight_var = NULL,
  min_moe = 0.03,
  use_prop_in_moe = FALSE,
  prop_sub_0 = 0.5,
  prop_sub_1 = 0.5
)

Arguments

data

A data frame for which to iterate DI calculation for a set of variables.

success_vars

A character vector of success variable names to iterate across.

group_vars

A character vector of group (disaggregation) variable names to iterate across.

cohort_vars

A character vector of cohort variable names to iterate across.

reference_groups

Either 'overall', 'hpg', or a character vector of the same length as 'group_vars' that indicates the reference group value for each group variable in 'group_vars'.

repeat_by_vars

A character vector of variables to repeat DI calculations for across all combination of these variables, including '- All' as a group for each variable. The reference rate used for DI comparison differs for every combination of the variables listed here.

weight_var

A character scalar specifying the weight variable if the input data set is summarized (ie, the the success variables specified in 'success_vars' contain count of successes). Weight here corresponds to the denominator when calculating the success rate. Defaults to 'NULL' for an input data set where each row describes each individual.

min_moe

The minimum margin of error to be used in the PPG calculation, passed to 'di_ppg'.

use_prop_in_moe

Whether the estimated proportions should be used in the margin of error calculation by the PPG, passed to 'di_ppg'.

prop_sub_0

Passed to 'di_ppg'.

prop_sub_1

Passed to 'di_ppg'.

Details

Iteratively calculate disproportionate impact via the percentage point gap (PPG) method for all combinations of 'success_vars', 'group_vars', and 'cohort_vars', for each combination of subgroups specified by 'repeat_by_vars'.

Value

A data frame with all relevant returned fields from 'di_ppg' plus 'success_variable' (elements of 'success_vars'), 'disaggregation' (elements of 'group_vars'), and 'reference_group' (elements of 'reference_groups').

Examples

library(dplyr)
data(student_equity)
# Multiple group variables
di_ppg_iterate(data=student_equity, success_vars=c('Transfer')
  , group_vars=c('Ethnicity', 'Gender'), cohort_vars=c('Cohort')
  , reference_groups='overall')

Calculate disproportionate impact per the proportionality index (PI) method.

Description

Calculate disproportionate impact per the proportionality index (PI) method.

Usage

di_prop_index(success, group, cohort, weight, data, di_prop_index_cutoff = 0.8)

Arguments

success

A vector of success indicators (1/0 or TRUE/FALSE) or an unquoted reference (name) to a column in data if it is specified. It could also be a vector of counts, in which case weight should also be specified (group size).

group

A vector of group names of the same length as success or an unquoted reference (name) to a column in data if it is specified.

cohort

(Optional) A vector of cohort names of the same length as success or an unquoted reference (name) to a column in data if it is specified. disproportionate impact is calculated for every group within each cohort. When cohort is not specified, then the analysis assumes a single cohort.

weight

(Optional) A vector of case weights of the same length as success or an unquoted reference (name) to a column in data if it is specified. If success consists of counts instead of success indicators (1/0), then weight should also be specified to indicate the group size.

data

(Optional) A data frame containing the variables of interest. If data is specified, then success, group, and cohort will be searched within it.

di_prop_index_cutoff

A numeric value between 0 and 1 that is used to determine disproportionate impact if the proportionality index falls below this threshold; defaults to 0.80.

Details

This function determines disproportionate impact based on the proportionality index (PI) method, as described in this reference from the California Community Colleges Chancellor's Office. It assumes that a higher rate is good ("success"). For rates that are deemed negative (eg, rate of drop-outs, high is bad), then consider looking at the converse of the non-success (eg, non drop-outs, high is good) instead in order to leverage this function properly.

Value

A data frame consisting of:

  • cohort (if used),

  • group,

  • n (sample size),

  • success (number of successes for the cohort-group),

  • pct_success (proportion of successes attributed to the group within the cohort),

  • pct_group (proportion of sample attributed to the group within the cohort),

  • di_prop_index (ratio of pct_success to pct_group),

  • di_indicator (1 if di_prop_index < di_prop_index_cutoff), and

  • success_needed_not_di (the number of additional successes needed in order to no longer be considered disproportionately impacted as compared to the reference), and

  • success_needed_full_parity (the number of additional successes needed in order to achieve full parity with the reference).

When di_prop_index < 1, then there are signs of disproportionate impact.

References

California Community Colleges Chancellor's Office (2014). Guidelines for Measuring Disproportionate Impact in Equity Plans.

Examples

library(dplyr)
data(student_equity)
di_prop_index(success=Transfer, group=Ethnicity, data=student_equity) %>%
  as.data.frame

Margin of error for the PPG

Description

Calculate the margin of error (MOE) for the percentage point gap (PPG) method.

Usage

ppg_moe(n, proportion, min_moe = 0.03, prop_sub_0 = 0.5, prop_sub_1 = 0.5)

Arguments

n

Sample size for the group of interest.

proportion

(Optional) The proportion of successes for the group of interest. If specified, then the proportion is used in the MOE formula. Otherwise, a default proportion of 0.50 is used (conservative and yields the maximum MOE).

min_moe

The minimum MOE returned even if the sample size is large. Defaults to 0.03. This equates to a minimum threshold gap for declaring disproportionate impact.

prop_sub_0

For cases where 'proportion' is 0, substitute with prop_sub_0 (defaults to 0.5) to account for the zero MOE.

prop_sub_1

For cases where 'proportion' is 1, substitute with prop_sub_1 (defaults to 0.5) to account for the zero MOE.

Value

The margin of error for the PPG given the specified sample size.

References

California Community Colleges Chancellor's Office (2017). Percentage Point Gap Method.

Examples

ppg_moe(n=800)
ppg_moe(n=c(200, 800, 1000, 2000))
ppg_moe(n=800, proportion=0.20)
ppg_moe(n=800, proportion=0.20, min_moe=0)
ppg_moe(n=c(200, 800, 1000, 2000), min_moe=0.01)

Long summarized disaggregated data set

Description

Sample data downloaded from the California Community College's Chancellor's Office Student Success Metrics dashboard.

Usage

data(ssm_cohort)

Format

A data frame with summarized data:

value

Success count (numerator).

denom

Group size (denominator).

categoryLabel

Metric or outcome.

academicYear

Academic year for given data.

disagg1

Different levels of disaggregation.

subgroup1

Groups corresponding to each disaggregation in disagg1.

disagg2

Second level of disaggregation: 'None' or 'Gender'.

subgroup2

Groups corresponding to each disaggregation in disagg2.

cohort

Not actually a cohort, but the time-window for the outcome in categoryLabel.

localeName

College name.

metricID

ID for current metric.

title

Title of visualization.

categoryID

ID for categoryLabel.

perc

value / denom.

dataType

All are 'Percent'.

missingFlag

1 if missing.

ferpaFlag

1 if FERPA-suppressed.

X20

Ignore.

description

Ignore.

source

Ignore.

Examples

data(ssm_cohort)

Fake data on student equity

Description

Data randomly generated to illustrate the use of the package.

Usage

data(student_equity)

Format

A data frame with 20,000 rows:

Ethnicity

ethnicity (one of: Asian, Black, Hispanic, Multi-Ethnicity, Native American, White).

Gender

gender (one of: Male, Female, Other).

Cohort

year student first enrolled in any credit course at the institution (one of: 2017, 2018).

Transfer

1 or 0 indicating whether or not a student transferred within 2 years of first enrollment (Cohort).

Cohort_Math

year student first enrolled in a math course at the institution; could be NA if the student have not attempted math.

Math

1 or 0 indicating whether or not a student completed transfer-level math within 1 year of their first math attempt (Cohort_Math); could be NA if the student have not attempted math.

Cohort_English

year student first enrolled in a math course at the institution; could be NA if the student have not attempted math.

English

1 or 0 indicating whether or not a student completed transfer-level English within 1 year of their first math attempt (Cohort_English); could be NA if the student have not attempted English.

Ed_Goal

student's educational goal (one of: Deg/Transfer, Other).

College_Status

student's educational status (one of: First-time College, Other).

Student_ID

student's unique identifier.

EthnicityFlag_Asian

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Asian.

EthnicityFlag_Black

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Black.

EthnicityFlag_Hispanic

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Hispanic.

EthnicityFlag_NativeAmerican

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Native American.

EthnicityFlag_PacificIslander

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Pacific Islander.

EthnicityFlag_White

1 (yes) or 0 (no) indicating whether or not a student self-identifies as White.

EthnicityFlag_Carribean

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Carribean.

EthnicityFlag_EastAsian

1 (yes) or 0 (no) indicating whether or not a student self-identifies as East Asian.

EthnicityFlag_SouthEastAsian

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Southeast Asian.

EthnicityFlag_SouthWestAsianNorthAfrican

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Southwest Asian / North African (SWANA).

EthnicityFlag_AANAPI

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Asian-American or Native American Pacific Islander (AANAPI).

EthnicityFlag_Unknown

1 (yes) or 0 (no) indicating whether or not a student self-identifies as Unknown.

EthnicityFlag_TwoorMoreRaces

1 (yes) or 0 (no) indicating whether or not a student self-identifies as two or more races.

Examples

data(student_equity)

Helper function: Surround character values with double quotes if not present.

Description

Function used internally by di_calc_sql and di_iterate_sql to surround variable names by double quotes in SQL queries in order to support non-alphanumeric characters in variable names.

Usage

surround_quote_if_needed(value)

Arguments

value

A character vector.

Value

A character vector with double quotes surrounding value if the first and last characters of value aren't yet double quotes. For value that is already surrounded by double quotes, nothing is changed.