---
title: "vtreat significance"
author: "John Mount, Nina Zumel"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{vtreat significance}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

`vtreat::prepare` includes a required argument `pruneSig` that (if not NULL) is used to prune variables.  Obviously significance depends on training set size (so is not an intrinsic property of just the variables) and there are issues of bias in the estimate (which vtreat attempts to eliminate by estimating significance of complex sub-model variables on cross-validated or out of sample data). As always there is a question of what to set a significance control to.

Our advice is the following pragmatic:  

Use variable filtering on wide datasets (datasets with many columns or variables).  Most machine learning algorithms can not defend themselves against large numbers of noise variables (including those algorithms that have cross-validation procedures built in).  Examples are given [here](https://win-vector.com/2014/02/01/bad-bayes-an-example-of-why-you-need-hold-out-testing/).

As an upper bound think of setting `pruneSig` below _1/numberOfColumns_.  Setting `pruneSig` to _1/numberOfColumns_ means that (in expectation) only a constant number of pure noise variables (variables with no actual relation to the outcome we are trying to predict) should create columns.  This means (under some assumptions, and in expectation) we expect only a bounded number of noisy columns to be exposed to downstream statistical and machine learning algorithms (which they can presumably handle).

As a lower bound think of what sort of good variables get thrown out at a given setting of `pruneSig`.  For example suppose our problem is categorization in a data set with _n/2_ positive examples and _n/2_ negative examples.   Consider the observed significance of a rare indicator variable that is on _k_ times in training and is only on for positive instances.  A random variable that is on _k_ times would achieve this purity with probability $2^{-k}$, so we expect it to have a _-log(significance)_ in the ballpark of _k_.  So a `pruneSig` of $2^{-k}$ will filter all such variables out (be they good or bad).  Thus if you want levels or indicators that are on only a _z_ fraction of the time on a training set of size _n_ you want `pruneSig` >> $2^{-z*n}$.

Example:

```{r}
signk <- function(n,k) {
  sigTab <- data.frame(y=c(rep(TRUE,n/2),rep(FALSE,n/2)),v=FALSE)
  sigTab[seq_len(k),'v'] <- TRUE
  vtreat::designTreatmentsC(sigTab,'v','y',TRUE,verbose=FALSE)$scoreFrame[1,'sig']
}
sigTab <- data.frame(k=c(1,2,3,4,5,10,20,50,100))
# If you want to see a rare but perfect indicator of positive class
# that's only on k times out of 1000, this is the lower bound on pruneSig
sigTab$sigEst = vapply(sigTab$k,function(k) signk(1000,k),numeric(1)) 
sigTab$minusLogSig = -log(sigTab$sigEst) # we expect this to be approximately k
print(sigTab)
```

For a data set with 100 variables (and 1000 rows), you might want to set `pruneSig` <= 0.01 to limit the number of pure noise variables that enter the model. Note that this value is smaller than the lower bounds given above for $k < 5$. This means that in a data set of this width and length, you may not be able to detect rare but perfect indicators that occur fewer than 5 times.  You would have a chance of using such rare indicators in a _catN_ or _catB_ effects coded variable.

Below we design a data frame with a perfect categorical variable (completely determines
the outcome y) where each level occurs exactly 2 times.  The individual levels are
insignificant, but we can still extract a significant _catB_ effect coded variable.

```{r}
set.seed(3346)
n <- 1000
k <- 4
d <- data.frame(y=rbinom(n,size=1,prob=0.5)>0)
d$catVarNoise <- rep(paste0('lev',sprintf("%03d",1:floor(n/k))),(k+1))[1:n]
d$catVarPerfect <- paste0(d$catVar,substr(as.character(d$y),1,1))
d <- d[order(d$catVarPerfect),]
head(d)

treatmentsC <- vtreat::designTreatmentsC(d,c('catVarNoise','catVarPerfect'),'y',TRUE)


# Estimate effect significance (not coefficient significance).
estSigGLM <- function(xVar,yVar,numberOfHiddenDegrees=0) {
  d <- data.frame(x=xVar,y=yVar,stringsAsFactors = FALSE)
  model <- stats::glm(stats::as.formula('y~x'),
                      data=d,
                      family=stats::binomial(link='logit'))
  delta_deviance <- model$null.deviance - model$deviance
  delta_df <- model$df.null - model$df.residual + numberOfHiddenDegrees
  pRsq <- 1.0 - model$deviance/model$null.deviance
  sig <- stats::pchisq(delta_deviance, delta_df, lower.tail=FALSE)
  sig
}

prepD <- vtreat::prepare(treatmentsC,d,pruneSig=c())
```

vtreat produces good variable significances using out of sample simulation
(cross frames).

```{r scoreframe}
print(treatmentsC$scoreFrame[,c('varName','rsq','sig','extraModelDegrees')])
```

For categorical targets we have in the `scoreFrame` the `sig` column is the significance of the single variable logistic regression using the named variable (plus a constant term), and the `rsq` column is the "pseudo-r-squared" or portion of deviance explained (please see [here](https://win-vector.com/2011/09/14/the-simpler-derivation-of-logistic-regression/) for some notes).  For numeric targets the `sig` column is the significance of the single variable linear regression using the named variable (plus a constant term), and the `rsq` column is the "r-squared" or portion of variance explained (please see [here](https://win-vector.com/2011/11/21/correlation-and-r-squared/)) for some notes).

Signal carrying complex variables can score as significant, even those composed of
rare levels.

```{r scoresignal}
summary(glm(y~d$catVarPerfect=='lev001T',data=d,family=binomial))
estSigGLM(prepD$catVarPerfect_catB,prepD$y,0) # wrong est
estSigGLM(prepD$catVarPerfect_catB,prepD$y,
          numberOfHiddenDegrees=length(unique(d$catVarPerfect))-1)
```

Noise variables (those without a relation to outcome) are also scored
correctly as long was we account for the degrees of freedom.

```{r scorenoise}
summary(glm(y~d$catVarNoise=='lev001',data=d,family=binomial))
estSigGLM(prepD$catVarNoise_catB,prepD$y,0) # wrong est
estSigGLM(prepD$catVarNoise_catB,prepD$y,
          numberOfHiddenDegrees=length(unique(d$catVarNoise))-1)
```