--- title: "Variable Types" author: "Win-Vector LLC" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Variable Types} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- 'vtreat' is a package that prepares arbitrary data frames into clean data frames that are ready for analysis (usually supervised learning). A clean data frame: - Only has numeric columns (other than the outcome). - Has no Infinite/NA/NaN values in the effective variable columns. To effect this encoding 'vtreat' replaces original variables or columns with new derived variables. In this note we will use variables and columns as interchangeable concepts. This note describes the current family of 'vtreat' derived variable types. 'vtreat' usage splits into three main cases: * When the target to predict is categorical. * When the target to predict is numeric. * When there is no supplied target to predict. In all cases vtreat variable names are built by appending a notation onto the original user supplied column name. In all cases the easiest way to examine the derived variables is to look at the `scoreFrame` component of the returned treatment plan. We will outline each of these situations below: ## When the target to predict is categorical An example categorical variable treatment is demonstrated below: ```{r categoricalexample, tidy=FALSE} library(vtreat) dTrainC <- data.frame(x=c('a','a','a','b','b',NA), z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE), stringsAsFactors = FALSE) treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE) scoreColsToPrint <- c('origName','varName','code','rsq','sig','extraModelDegrees') print(treatmentsC$scoreFrame[,scoreColsToPrint]) ``` For each user supplied variable or column (in this case `x` and `z`) 'vtreat' proposes derived or treated variables. The mapping from original variable name to derived variable name is given by comparing the columns `origName` and `varName`. One can map facts about the new variables back to the original variables as follows: ```{r map} # Build a map from vtreat names back to reasonable display names vmap <- as.list(treatmentsC$scoreFrame$origName) names(vmap) <- treatmentsC$scoreFrame$varName print(vmap['x_catB']) # Map significances back to original variables aggregate(sig~origName,data=treatmentsC$scoreFrame,FUN=min) ``` In the `scoreFrame` the `sig` column is the significance of the single variable logistic regression using the named variable (plus a constant term), and the `rsq` column is the "pseudo-r-squared" or portion of deviance explained (please see [here](https://win-vector.com/2011/09/14/the-simpler-derivation-of-logistic-regression/) for some notes). Essentially a derived variable name is built by concatenating an original variable name and a treatment type (also recorded in the `code` column for convenience). The codes give the different 'vtreat' variable types (or really meanings, as all derived variables are numeric). For categorical targets the possible variable types are as follows: * **clean** : a numeric variable passed through with all NA/NaN/infinite values replaced with either zero or mean value of the non-NA/NaN/infinite examples of the variable. * **is\_Bad** : a companion to the 'clean' treatment. 'is\_Bad' is an indicator that indicates a value replacement has occurred. For many noisy datasets this column can be more informative than the clean column! * **lev** : a 0/1 indicator indicating a particular value of a categorical variable was present. For example `x_lev_x.a` is 1 when the original `x` variable had a value of "a". These indicators are essentially variables representing explicit encoding of levels as dummy variables. In some cases a special level code is used to represent pooled rare values. * **cat\_B** : a single variable Bayesian model of the change in logit-odds in outcome from mean distribution conditioned on the observed value of the original variable. In our example: `x_catB = logit(P[y==target|x]) - logit(P[y==target])`. This encoding is especially useful for categorical variables that have a large number of levels, but be aware it can obscure degrees of freedom if not used properly. * **cat\_P** : a "prevalence fact" about a categorical level. Tells us if the original level was rare or common. Probably not good for direct use in a model, but possibly useful for meta-analysis on the variable. ## When the target to predict is numeric An example numeric variable treatment is demonstrated below: ```{r numericexample, tidy=FALSE} library(vtreat) dTrainN <- data.frame(x=c('a','a','a','b','b',NA), z=c(1,2,3,4,NA,6),y=as.numeric(c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)), stringsAsFactors = FALSE) treatmentsN <- designTreatmentsN(dTrainN,colnames(dTrainN),'y') print(treatmentsN$scoreFrame[,scoreColsToPrint]) ``` The treatment of numeric targets is similar to that of categorical targets. In the numeric case the possible derived variable types are: * **clean** : a numeric variable passed through with all NA/NaN/infinite values replaced with either zero or mean value of the non-NA/NaN/infinite examples of the variable. * **is\_Bad** : a companion to the 'clean' treatment. 'is\_Bad' is an indicator that indicates a value replacement has occurred. For many noisy datasets this column can be more informative than the clean column! * **lev** : a 0/1 indicator indicating a particular value of a categorical variable was present. For example `x_lev_x.a` is 1 when the original `x` variable had a value of "a". These indicators are essentially variables representing explicit encoding of levels as dummy variables. In some cases a special level code is used to represent pooled rare values. * **cat\_N** : a single variable regression model of the difference in outcome expectation conditioned on the observed value of the original variable. In our example: `x_catN = E[y|x] - E[y]`. This encoding is especially useful for categorical variables that have a large number of levels, but be aware it can obscure degrees of freedom if not used properly. * **cat\_P** : a "prevalence fact" about a categorical level. Tells us if the original level was rare or common. Tells us if the original level was rare or common. Probably not good for direct use in a model, but possibly useful for met-analysis on the variable. * **cat\_D** : a "deviation fact" about a categorical level tells us if 'y' is concentrated or diffuse when conditioned on the observed level of the original categorical variable. Probably not good for direct use in a model, but possibly useful for meta-analysis on the variable. Note: for categorical targets we don't need `cat\_D` variables as this information is already encoded in `cat\_B` variables. In the `scoreFrame` the `sig` column is the significance of the single variable linear regression using the named variable (plus a constant term), and the `rsq` column is the "r-squared" or portion of variance explained (please see [here](https://win-vector.com/2011/11/21/correlation-and-r-squared/)) for some notes). ## When there is no supplied target to predict An example "no target" variable treatment is demonstrated below: ```{r notargetexample, tidy=FALSE} library(vtreat) dTrainZ <- data.frame(x=c('a','a','a','b','b',NA), z=c(1,2,3,4,NA,6), stringsAsFactors = FALSE) treatmentsZ <- designTreatmentsZ(dTrainZ,colnames(dTrainZ)) print(treatmentsZ$scoreFrame[, c('origName','varName','code','extraModelDegrees')]) ``` Note: because there is no user supplied target the `scoreFrame` significance columns are not meaningful, and are populated only for regularity of code interface. Also indicator variables are only formed by `designTreatmentsZ` for `vtreat` 0.5.28 or newer. Beyond that the no-target treatments are similar to the earlier treatments. Possible derived variable types in this case are: * **clean** : a numeric variable passed through with all NA/NaN/infinite values replaced with either zero or mean value of the non-NA/NaN/infinite examples of the variable. * **is\_Bad** : a companion to the 'clean' treatment. 'is\_Bad' is an indicator that indicates a value replacement has occurred. For many noisy datasets this column can be more informative than the clean column! * **lev** : a 0/1 indicator indicating a particular value of a categorical variable was present. For example `x_lev_x.a` is 1 when the original `x` variable had a value of "a". These indicators are essentially variables representing explicit encoding of levels as dummy variables. In some cases a special level code is used to represent pooled rare values. * **cat\_P** : a "prevalence fact" about a categorical level. Tells us if the original level was rare or common. Probably not good for direct use in a model, but possibly useful for meta-analysis on the variable. ## Restricting to Specific Variable Types Both `designTreatmentsX` and `prepare` functions take an argument called `codeRestriction` that restricts the type of variable that is created. For example, you may not want to create `catP` and `catD` variables for a regression problem. ```{r restrict1} dTrainN <- data.frame(x=c('a','a','a','b','b',NA), z=c(1,2,3,4,NA,6),y=as.numeric(c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)), stringsAsFactors = FALSE) treatmentsN <- designTreatmentsN(dTrainN,colnames(dTrainN),'y', codeRestriction = c('lev', 'catN', 'clean', 'isBAD'), verbose=FALSE) # no catP or catD variables print(treatmentsN$scoreFrame[,scoreColsToPrint]) ``` Conversely, even if you have created a treatment plan for a particular type of variable, you may subsequently decide not to use it. For example, perhaps you only want to use indicator variables and not the `catN` variable for modeling. You can use `codeRestriction` in `prepare()`. ```{r restrict2} dTreated = prepare(treatmentsN, dTrainN, codeRestriction = c('lev','clean', 'isBAD')) # no catN variables head(dTreated) ``` `varRestriction` works similarly, only you must list the explicit variables to use. See the example below. ## Overall Variables that "do not move" (don't take on at least two values during treatment design) or don't achieve at least a minimal significance are suppressed. The `catB`/`catN` variables are essentially single variable models and are very useful for re-encoding categorical variables that take on a very large number of values (such as zip-codes). The intended use of 'vtreat' is as follows: * Data is split into three non-overlapping portions * One portion is used to "design treatments" (we sometime informally call this calibration). * Another portion is used to train a model. * The remaining portion is used to evaluate the model. 'vtreat' attempts to compute "out of sample" significances for each variable effect ( the `sig` column in `scoreFrame`) through cross-validation techniques. 'vtreat' is primarily intended to be "y-aware" processing. Of particular interest is using `vtreat::prepare()` with `scale=TRUE` which tries to put most columns in 'y-effect' units. This can be an important pre-processing step before attempting dimension reduction (such as principal components methods). The vtreat user should pick which sorts of variables they are want and also filter on estimated significance. Doing this looks like the following: ```{r selectvars} dTrainN <- data.frame(x=c('a','a','a','b','b',NA), z=c(1,2,3,4,NA,6),y=as.numeric(c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)), stringsAsFactors = FALSE) treatmentsN <- designTreatmentsN(dTrainN,colnames(dTrainN),'y', codeRestriction = c('lev', 'catN', 'clean', 'isBAD'), verbose=FALSE) print(treatmentsN$scoreFrame[,scoreColsToPrint]) pruneSig <- 1.0 # don't filter on significance for this tiny example vScoreFrame <- treatmentsN$scoreFrame varsToUse <- vScoreFrame$varName[(vScoreFrame$sig<=pruneSig)] print(varsToUse) origVarNames <- sort(unique(vScoreFrame$origName[vScoreFrame$varName %in% varsToUse])) print(origVarNames) # prepare a treated data frame using only the "significant" variables dTreated = prepare(treatmentsN, dTrainN, varRestriction = varsToUse) head(dTreated) ``` We strongly suggest using the standard variables coded as 'lev', 'clean', and 'isBad'; and the "y aware" variables coded as 'catN' and 'catB'. The non sub-model variables ('catP' and 'catD') can be useful (possibly as interactions or guards on the corresponding 'catN' and 'catB' variables) but also encode distributional facts about the data that may or may not be appropriate depending on your problem domain. When displaying variables to end users we suggest using the original names and the min significance seen on any derived variable: ```{r displayvars} origVarNames <- sort(unique(vScoreFrame$origName[vScoreFrame$varName %in% varsToUse])) print(origVarNames) origVarSigs <- vScoreFrame[vScoreFrame$varName %in% varsToUse,] aggregate(sig~origName,data=origVarSigs,FUN=min) ``` ## Links * ['vtreat' on Github'](https://github.com/WinVector/vtreat) * ['vtreat' on CRAN](https://cran.r-project.org/package=vtreat)