Title: | A Statistically Sound 'data.frame' Processor/Conditioner |
---|---|
Description: | A 'data.frame' processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. 'vtreat' prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems 'vtreat' defends against: 'Inf', 'NA', too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). Reference: "'vtreat': a data.frame Processor for Predictive Modeling", Zumel, Mount, 2016, <DOI:10.5281/zenodo.1173313>. |
Authors: | John Mount [aut, cre], Nina Zumel [aut], Win-Vector LLC [cph] |
Maintainer: | John Mount <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.6.5 |
Built: | 2024-11-10 03:44:51 UTC |
Source: | https://github.com/winvector/vtreat |
A 'data.frame' processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. 'vtreat' prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems 'vtreat' defends against: 'Inf', 'NA', too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). 'vtreat::prepare' should be used as you would use 'model.matrix'.
For more information:
vignette('vtreat', package='vtreat')
vignette(package='vtreat')
Website: https://github.com/WinVector/vtreat
Maintainer: John Mount [email protected]
Authors:
Nina Zumel [email protected]
Other contributors:
Win-Vector LLC [copyright holder]
Useful links:
Report bugs at https://github.com/WinVector/vtreat/issues
Apply first argument to second as a transform.
apply_transform(vps, dframe, ..., parallelCluster = NULL)
apply_transform(vps, dframe, ..., parallelCluster = NULL)
vps |
vtreat pipe step, object defining transform. |
dframe |
data.frame, data to transform |
... |
not used, forces later arguments to bind by name. |
parallelCluster |
optional, parallel cluster to run on. |
transformed dframe
Convert vtreatment plans into a sequence of rquery operations.
as_rquery_plan(treatmentplans, ..., var_restriction = NULL)
as_rquery_plan(treatmentplans, ..., var_restriction = NULL)
treatmentplans |
vtreat treatment plan or list of vtreat treatment plan sharing same outcome and outcome type. |
... |
not used, force any later arguments to bind to names. |
var_restriction |
character, if not null restrict to producing these variables. |
list(optree_generator (ordered list of functions), temp_tables (named list of tables))
if(requireNamespace("rquery", quietly = TRUE)) { dTrainC <- data.frame(x= c('a', 'a', 'a', 'b' ,NA , 'b'), z= c(1, 2, NA, 4, 5, 6), y= c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE), stringsAsFactors = FALSE) dTrainC$id <- seq_len(nrow(dTrainC)) treatmentsC <- designTreatmentsC(dTrainC, c("x", "z"), 'y', TRUE) print(prepare(treatmentsC, dTrainC)) rqplan <- as_rquery_plan(list(treatmentsC)) ops <- flatten_fn_list(rquery::local_td(dTrainC), rqplan$optree_generators) cat(format(ops)) if(requireNamespace("rqdatatable", quietly = TRUE)) { treated <- rqdatatable::ex_data_table(ops, tables = rqplan$tables) print(treated[]) } if(requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)) { db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") source_data <- rquery::rq_copy_to(db, "dTrainC", dTrainC, overwrite = TRUE, temporary = TRUE) rest <- rquery_prepare(db, rqplan, source_data, "dTreatedC", extracols = "id") resd <- DBI::dbReadTable(db, rest$table_name) print(resd) rquery::rq_remove_table(db, source_data$table_name) rquery::rq_remove_table(db, rest$table_name) DBI::dbDisconnect(db) } }
if(requireNamespace("rquery", quietly = TRUE)) { dTrainC <- data.frame(x= c('a', 'a', 'a', 'b' ,NA , 'b'), z= c(1, 2, NA, 4, 5, 6), y= c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE), stringsAsFactors = FALSE) dTrainC$id <- seq_len(nrow(dTrainC)) treatmentsC <- designTreatmentsC(dTrainC, c("x", "z"), 'y', TRUE) print(prepare(treatmentsC, dTrainC)) rqplan <- as_rquery_plan(list(treatmentsC)) ops <- flatten_fn_list(rquery::local_td(dTrainC), rqplan$optree_generators) cat(format(ops)) if(requireNamespace("rqdatatable", quietly = TRUE)) { treated <- rqdatatable::ex_data_table(ops, tables = rqplan$tables) print(treated[]) } if(requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)) { db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") source_data <- rquery::rq_copy_to(db, "dTrainC", dTrainC, overwrite = TRUE, temporary = TRUE) rest <- rquery_prepare(db, rqplan, source_data, "dTreatedC", extracols = "id") resd <- DBI::dbReadTable(db, rest$table_name) print(resd) rquery::rq_remove_table(db, source_data$table_name) rquery::rq_remove_table(db, rest$table_name) DBI::dbDisconnect(db) } }
Hold settings and results for binomial classification data preparation.
BinomialOutcomeTreatment( ..., var_list, outcome_name, outcome_target = TRUE, cols_to_copy = NULL, params = NULL, imputation_map = NULL )
BinomialOutcomeTreatment( ..., var_list, outcome_name, outcome_target = TRUE, cols_to_copy = NULL, params = NULL, imputation_map = NULL )
... |
not used, force arguments to be specified by name. |
var_list |
Names of columns to treat (effective variables). |
outcome_name |
Name of column holding outcome variable. |
outcome_target |
Value/level of outcome to be considered "success", and there must be a cut such that |
cols_to_copy |
list of extra columns to copy. |
params |
parameters list from |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
Please see
https://github.com/WinVector/vtreat/blob/main/Examples/fit_transform/fit_transform_api.md,
mkCrossFrameCExperiment
,
designTreatmentsC
, and
prepare.treatmentplan
for details.
Return a carve-up of seq_len(nRows). Very useful for any sort of nested model situation (such as data prep, stacking, or super-learning).
buildEvalSets( nRows, ..., dframe = NULL, y = NULL, splitFunction = NULL, nSplits = 3 )
buildEvalSets( nRows, ..., dframe = NULL, y = NULL, splitFunction = NULL, nSplits = 3 )
nRows |
scalar, >=1 number of rows to sample from. |
... |
no additional arguments, declared to forced named binding of later arguments. |
dframe |
(optional) original data.frame, passed to user splitFunction. |
y |
(optional) numeric vector, outcome variable (possibly to stratify on), passed to user splitFunction. |
splitFunction |
(optional) function taking arguments nSplits,nRows,dframe, and y; returning a user desired split. |
nSplits |
integer, target number of splits. |
Also sets attribute "splitmethod" on return value that describes how the split was performed. attr(returnValue,'splitmethod') is one of: 'notsplit' (data was not split; corner cases like single row data sets), 'oneway' (leave one out holdout), 'kwaycross' (a simple partition), 'userfunction' (user supplied function was actually used), or a user specified attribute. Any user desired properties (such as stratification on y, or preservation of groups designated by original data row numbers) may not apply unless you see that 'userfunction' has been used.
The intent is the user splitFunction only needs to handle "easy cases" and maintain user invariants. If the user splitFunction returns NULL, throws, or returns an unacceptable carve-up then vtreat::buildEvalSets returns its own eval set plan. The signature of splitFunction should be splitFunction(nRows,nSplits,dframe,y) where nSplits is the number of pieces we want in the carve-up, nRows is the number of rows to split, dframe is the original dataframe (useful for any group control variables), and y is a numeric vector representing outcome (useful for outcome stratification).
Note that buildEvalSets may not always return a partition (such as one row dataframes), or if the user split function chooses to make rows eligible for application a different number of times.
list of lists where the app portion of the sub-lists is a disjoint carve-up of seq_len(nRows) and each list as a train portion disjoint from app.
kWayCrossValidation
, kWayStratifiedY
, and makekWayCrossValidationGroupedByColumn
# use buildEvalSets(200) # longer example # helper fns # fit models using experiment plan to estimate out of sample behavior fitModelAndApply <- function(trainData,applicaitonData) { model <- lm(y~x,data=trainData) predict(model,newdata=applicaitonData) } simulateOutOfSampleTrainEval <- function(d,fitApplyFn) { eSets <- buildEvalSets(nrow(d)) evals <- lapply(eSets, function(ei) { fitApplyFn(d[ei$train,],d[ei$app,]) }) pred <- numeric(nrow(d)) for(eii in seq_len(length(eSets))) { pred[eSets[[eii]]$app] <- evals[[eii]] } pred } # run the experiment set.seed(2352356) # example data d <- data.frame(x=rnorm(5),y=rnorm(5), outOfSampleEst=NA,inSampleEst=NA) # fit model on all data d$inSampleEst <- fitModelAndApply(d,d) # compute in-sample R^2 (above zero, falsely shows a # relation until we adjust for degrees of freedom) 1-sum((d$y-d$inSampleEst)^2)/sum((d$y-mean(d$y))^2) d$outOfSampleEst <- simulateOutOfSampleTrainEval(d,fitModelAndApply) # compute out-sample R^2 (not positive, # evidence of no relation) 1-sum((d$y-d$outOfSampleEst)^2)/sum((d$y-mean(d$y))^2)
# use buildEvalSets(200) # longer example # helper fns # fit models using experiment plan to estimate out of sample behavior fitModelAndApply <- function(trainData,applicaitonData) { model <- lm(y~x,data=trainData) predict(model,newdata=applicaitonData) } simulateOutOfSampleTrainEval <- function(d,fitApplyFn) { eSets <- buildEvalSets(nrow(d)) evals <- lapply(eSets, function(ei) { fitApplyFn(d[ei$train,],d[ei$app,]) }) pred <- numeric(nrow(d)) for(eii in seq_len(length(eSets))) { pred[eSets[[eii]]$app] <- evals[[eii]] } pred } # run the experiment set.seed(2352356) # example data d <- data.frame(x=rnorm(5),y=rnorm(5), outOfSampleEst=NA,inSampleEst=NA) # fit model on all data d$inSampleEst <- fitModelAndApply(d,d) # compute in-sample R^2 (above zero, falsely shows a # relation until we adjust for degrees of freedom) 1-sum((d$y-d$inSampleEst)^2)/sum((d$y-mean(d$y))^2) d$outOfSampleEst <- simulateOutOfSampleTrainEval(d,fitModelAndApply) # compute out-sample R^2 (not positive, # evidence of no relation) 1-sum((d$y-d$outOfSampleEst)^2)/sum((d$y-mean(d$y))^2)
Center and scale a set of variables. Other columns are passed through.
center_scale(d, center, scale)
center_scale(d, center, scale)
d |
data.frame to work with |
center |
named vector of variables to center |
scale |
named vector of variables to scale |
d with centered and scaled columns altered
d <- data.frame(x = 1:5, y = c('a', 'a', 'b', 'b', 'b')) vars_to_transform = "x" t <- base::scale(as.matrix(d[, vars_to_transform, drop = FALSE]), center = TRUE, scale = TRUE) t centering <- attr(t, "scaled:center") scaling <- attr(t, "scaled:scale") center_scale(d, center = centering, scale = scaling)
d <- data.frame(x = 1:5, y = c('a', 'a', 'b', 'b', 'b')) vars_to_transform = "x" t <- base::scale(as.matrix(d[, vars_to_transform, drop = FALSE]), center = TRUE, scale = TRUE) t centering <- attr(t, "scaled:center") scaling <- attr(t, "scaled:scale") center_scale(d, center = centering, scale = scaling)
A list of settings and values for vtreat binomial classification fitting.
Please see
https://github.com/WinVector/vtreat/blob/main/Examples/fit_transform/fit_transform_api.md,
mkCrossFrameCExperiment
,
designTreatmentsC
, and
prepare.treatmentplan
for details.
classification_parameters(user_params = NULL)
classification_parameters(user_params = NULL)
user_params |
list of user overrides. |
filled out parameter list
Design a simple treatment plan to indicate missingingness and perform simple imputation.
design_missingness_treatment( dframe, ..., varlist = colnames(dframe), invalid_mark = "_invalid_", drop_constant_columns = FALSE, missingness_imputation = NULL, imputation_map = NULL )
design_missingness_treatment( dframe, ..., varlist = colnames(dframe), invalid_mark = "_invalid_", drop_constant_columns = FALSE, missingness_imputation = NULL, imputation_map = NULL )
dframe |
data.frame to drive design. |
... |
not used, forces later arguments to bind by name. |
varlist |
character, names of columns to process. |
invalid_mark |
character, name to use for NA levels and novel levels. |
drop_constant_columns |
logical, if TRUE drop columns that do not vary from the treatment plan. |
missingness_imputation |
function of signature f(values: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric), simple missing value imputers. |
simple treatment plan.
d <- wrapr::build_frame( "x1", "x2", "x3" | 1 , 4 , "A" | NA , 5 , "B" | 3 , 6 , NA ) plan <- design_missingness_treatment(d) prepare(plan, d) prepare(plan, data.frame(x1=NA, x2=NA, x3="E"))
d <- wrapr::build_frame( "x1", "x2", "x3" | 1 , 4 , "A" | NA , 5 , "B" | 3 , 6 , NA ) plan <- design_missingness_treatment(d) prepare(plan, d) prepare(plan, data.frame(x1=NA, x2=NA, x3="E"))
Function to design variable treatments for binary prediction of a
categorical outcome. Data frame is assumed to have only atomic columns
except for dates (which are converted to numeric). Note: re-encoding high cardinality
categorical variables can introduce undesirable nested model bias, for such data consider
using mkCrossFrameCExperiment
.
designTreatmentsC( dframe, varlist, outcomename, outcometarget = TRUE, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = NULL, collarProb = 0, codeRestriction = NULL, customCoders = NULL, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
designTreatmentsC( dframe, varlist, outcomename, outcometarget = TRUE, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = NULL, collarProb = 0, codeRestriction = NULL, customCoders = NULL, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
outcomename |
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values. |
outcometarget |
Value/level of outcome to be considered "success", and there must be a cut such that dframe[[outcomename]]==outcometarget at least twice and dframe[[outcomename]]!=outcometarget at least twice. |
... |
no additional arguments, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction |
what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders |
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar >=2 number of cross validation splits use in rescoring complex variables. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
catScaling |
optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods (when parallel cluster is set). |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
The main fields are mostly vectors with names (all with the same names in the same order):
- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame - #' - sig : an estimate significance of effect
See the vtreat vignette for a bit more detail and a worked example.
Columns that do not vary are not passed through.
Note: re-encoding high cardinality on training data can introduce nested model bias, consider using mkCrossFrameCExperiment
instead.
treatment plan (for use with prepare)
prepare.treatmentplan
, designTreatmentsN
, designTreatmentsZ
, mkCrossFrameCExperiment
dTrainC <- data.frame(x=c('a','a','a','b','b','b'), z=c(1,2,3,4,5,6), y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)) dTestC <- data.frame(x=c('a','b','c',NA), z=c(10,20,30,NA)) treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE) dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=0.99)
dTrainC <- data.frame(x=c('a','a','a','b','b','b'), z=c(1,2,3,4,5,6), y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)) dTestC <- data.frame(x=c('a','b','c',NA), z=c(10,20,30,NA)) treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE) dTestCTreated <- prepare(treatmentsC,dTestC,pruneSig=0.99)
Function to design variable treatments for binary prediction of a
numeric outcome. Data frame is assumed to have only atomic columns
except for dates (which are converted to numeric).
Note: each column is processed independently of all others.
Note: re-encoding high cardinality on training data
categorical variables can introduce undesirable nested model bias, for such data consider
using mkCrossFrameNExperiment
.
designTreatmentsN( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = NULL, collarProb = 0, codeRestriction = NULL, customCoders = NULL, splitFunction = NULL, ncross = 3, forceSplit = FALSE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
designTreatmentsN( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = NULL, collarProb = 0, codeRestriction = NULL, customCoders = NULL, splitFunction = NULL, ncross = 3, forceSplit = FALSE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
outcomename |
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice. |
... |
no additional arguments, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction |
what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders |
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar >=2 number of cross validation splits use in rescoring complex variables. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods (when parallel cluster is set). |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
The main fields are mostly vectors with names (all with the same names in the same order):
- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame - sig : an estimate significance of effect
See the vtreat vignette for a bit more detail and a worked example.
Columns that do not vary are not passed through.
treatment plan (for use with prepare)
prepare.treatmentplan
, designTreatmentsC
, designTreatmentsZ
, mkCrossFrameNExperiment
dTrainN <- data.frame(x=c('a','a','a','a','b','b','b'), z=c(1,2,3,4,5,6,7),y=c(0,0,0,1,0,1,1)) dTestN <- data.frame(x=c('a','b','c',NA), z=c(10,20,30,NA)) treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y') dTestNTreated <- prepare(treatmentsN,dTestN,pruneSig=0.99)
dTrainN <- data.frame(x=c('a','a','a','a','b','b','b'), z=c(1,2,3,4,5,6,7),y=c(0,0,0,1,0,1,1)) dTestN <- data.frame(x=c('a','b','c',NA), z=c(10,20,30,NA)) treatmentsN = designTreatmentsN(dTrainN,colnames(dTrainN),'y') dTestNTreated <- prepare(treatmentsN,dTestN,pruneSig=0.99)
Data frame is assumed to have only atomic columns except for dates (which are converted to numeric). Note: each column is processed independently of all others.
designTreatmentsZ( dframe, varlist, ..., minFraction = 0, weights = c(), rareCount = 0, collarProb = 0, codeRestriction = NULL, customCoders = NULL, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
designTreatmentsZ( dframe, varlist, ..., minFraction = 0, weights = c(), rareCount = 0, collarProb = 0, codeRestriction = NULL, customCoders = NULL, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
... |
no additional arguments, declared to forced named binding of later arguments |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
weights |
optional training weights for each row |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction |
what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders |
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods (if parallel cluster is set). |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
The main fields are mostly vectors with names (all with the same names in the same order):
- vars : (character array without names) names of variables (in same order as names on the other diagnostic vectors) - varMoves : logical TRUE if the variable varied during hold out scoring, only variables that move will be in the treated frame
See the vtreat vignette for a bit more detail and a worked example.
Columns that do not vary are not passed through.
treatment plan (for use with prepare)
prepare.treatmentplan
, designTreatmentsC
, designTreatmentsN
dTrainZ <- data.frame(x=c('a','a','a','a','b','b',NA,'e','e'), z=c(1,2,3,4,5,6,7,NA,9)) dTestZ <- data.frame(x=c('a','x','c',NA), z=c(10,20,30,NA)) treatmentsZ = designTreatmentsZ(dTrainZ, colnames(dTrainZ), rareCount=0) dTrainZTreated <- prepare(treatmentsZ, dTrainZ) dTestZTreated <- prepare(treatmentsZ, dTestZ)
dTrainZ <- data.frame(x=c('a','a','a','a','b','b',NA,'e','e'), z=c(1,2,3,4,5,6,7,NA,9)) dTestZ <- data.frame(x=c('a','x','c',NA), z=c(10,20,30,NA)) treatmentsZ = designTreatmentsZ(dTrainZ, colnames(dTrainZ), rareCount=0) dTrainZTreated <- prepare(treatmentsZ, dTrainZ) dTestZTreated <- prepare(treatmentsZ, dTestZ)
Update the state of first argument to have learned or fit from second argument.
fit(vps, dframe, ..., weights = NULL, parallelCluster = NULL)
fit(vps, dframe, ..., weights = NULL, parallelCluster = NULL)
vps |
vtreat pipe step, object specifying fit |
dframe |
data.frame, data to fit from. |
... |
not used, forces later arguments to bind by name. |
weights |
optional, per-dframe data weights. |
parallelCluster |
optional, parallel cluster to run on. |
Note: input vps is not altered, fit is in returned value.
new fit object
Update the state of first argument to have learned or fit from second argument, and compute a cross validated example of such a transform.
fit_prepare(vps, dframe, ..., weights = NULL, parallelCluster = NULL)
fit_prepare(vps, dframe, ..., weights = NULL, parallelCluster = NULL)
vps |
vtreat pipe step, object specifying fit. |
dframe |
data.frame, data to fit from. |
... |
not used, forces later arguments to bind by name. |
weights |
optional, per-dframe data weights. |
parallelCluster |
optional, parallel cluster to run on. |
Note: input vps is not altered, fit is in returned list.
@return named list containing: treatments and cross_frame
Update the state of first argument to have learned or fit from second argument, and compute a cross validated example of such a transform.
fit_transform(vps, dframe, ..., weights = NULL, parallelCluster = NULL)
fit_transform(vps, dframe, ..., weights = NULL, parallelCluster = NULL)
vps |
vtreat pipe step, object specifying fit. |
dframe |
data.frame, data to fit from. |
... |
not used, forces later arguments to bind by name. |
weights |
optional, per-dframe data weights. |
parallelCluster |
optional, parallel cluster to run on. |
Note: input vps is not altered, fit is in returned list.
@return named list containing: treatments and cross_frame
Display treatment plan.
## S3 method for class 'vtreatment' format(x, ...)
## S3 method for class 'vtreatment' format(x, ...)
x |
treatment plan |
... |
additional args (to match general signature). |
Return previously fit feature names.
get_feature_names(vps)
get_feature_names(vps)
vps |
vtreat pipe step, mutable object to read from. |
feature names
Return previously fit score frame.
get_score_frame(vps)
get_score_frame(vps)
vps |
vtreat pipe step, mutable object to read from. |
score frame
Return previously fit transform.
get_transform(vps)
get_transform(vps)
vps |
vtreat pipe step, mutable object to read from. |
transform
read application labels off a split plan.
getSplitPlanAppLabels(nRow, plan)
getSplitPlanAppLabels(nRow, plan)
nRow |
number of rows in original data.frame. |
plan |
split plan |
vector of labels
kWayCrossValidation
, kWayStratifiedY
, and makekWayCrossValidationGroupedByColumn
plan <- kWayStratifiedY(3,2,NULL,NULL) getSplitPlanAppLabels(3,plan)
plan <- kWayStratifiedY(3,2,NULL,NULL) getSplitPlanAppLabels(3,plan)
k-fold cross validation, a splitFunction in the sense of vtreat::buildEvalSets
kWayCrossValidation(nRows, nSplits, dframe, y)
kWayCrossValidation(nRows, nSplits, dframe, y)
nRows |
number of rows to split (>1). |
nSplits |
number of groups to split into (>1,<=nRows). |
dframe |
original data frame (ignored). |
y |
numeric outcome variable (ignored). |
split plan
kWayCrossValidation(7,2,NULL,NULL)
kWayCrossValidation(7,2,NULL,NULL)
k-fold cross validation stratified on y, a splitFunction in the sense of vtreat::buildEvalSets
kWayStratifiedY(nRows, nSplits, dframe, y)
kWayStratifiedY(nRows, nSplits, dframe, y)
nRows |
number of rows to split (>1) |
nSplits |
number of groups to split into (<nRows,>1). |
dframe |
original data frame (ignored). |
y |
numeric outcome variable try to have equidistributed in each split. |
split plan
set.seed(23255) d <- data.frame(y=sin(1:100)) pStrat <- kWayStratifiedY(nrow(d),5,d,d$y) problemAppPlan(nrow(d),5,pStrat,TRUE) d$stratGroup <- vtreat::getSplitPlanAppLabels(nrow(d),pStrat) pSimple <- kWayCrossValidation(nrow(d),5,d,d$y) problemAppPlan(nrow(d),5,pSimple,TRUE) d$simpleGroup <- vtreat::getSplitPlanAppLabels(nrow(d),pSimple) summary(tapply(d$y,d$simpleGroup,mean)) summary(tapply(d$y,d$stratGroup,mean))
set.seed(23255) d <- data.frame(y=sin(1:100)) pStrat <- kWayStratifiedY(nrow(d),5,d,d$y) problemAppPlan(nrow(d),5,pStrat,TRUE) d$stratGroup <- vtreat::getSplitPlanAppLabels(nrow(d),pStrat) pSimple <- kWayCrossValidation(nrow(d),5,d,d$y) problemAppPlan(nrow(d),5,pSimple,TRUE) d$simpleGroup <- vtreat::getSplitPlanAppLabels(nrow(d),pSimple) summary(tapply(d$y,d$simpleGroup,mean)) summary(tapply(d$y,d$stratGroup,mean))
Build a k-fold cross validation sample where training sets are the same size as the original data, and built by sampling disjoint from test/application sets (sampled with replacement).
kWayStratifiedYReplace(nRows, nSplits, dframe, y)
kWayStratifiedYReplace(nRows, nSplits, dframe, y)
nRows |
number of rows to split (>1) |
nSplits |
number of groups to split into (<nRows,>1). |
dframe |
original data frame (ignored). |
y |
numeric outcome variable try to have equidistributed in each split. |
split plan
set.seed(23255) d <- data.frame(y=sin(1:100)) pStrat <- kWayStratifiedYReplace(nrow(d),5,d,d$y)
set.seed(23255) d <- data.frame(y=sin(1:100)) pStrat <- kWayStratifiedYReplace(nrow(d),5,d,d$y)
Make a categorical input custom coder.
makeCustomCoderCat( ..., customCode, coder, codeSeq, v, vcolin, zoY, zC, zTarget, weights = NULL, catScaling = FALSE )
makeCustomCoderCat( ..., customCode, coder, codeSeq, v, vcolin, zoY, zC, zTarget, weights = NULL, catScaling = FALSE )
... |
not used, force arguments to be set by name |
customCode |
code name |
coder |
user supplied variable re-coder (see vignette for type signature) |
codeSeq |
argments to custom coder |
v |
variable name |
vcolin |
data column, character |
zoY |
outcome column as numeric |
zC |
if classification outcome column as character |
zTarget |
if classification target class |
weights |
per-row weights |
catScaling |
optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling. |
wrapped custom coder
Make a numeric input custom coder.
makeCustomCoderNum( ..., customCode, coder, codeSeq, v, vcolin, zoY, zC, zTarget, weights = NULL, catScaling = FALSE )
makeCustomCoderNum( ..., customCode, coder, codeSeq, v, vcolin, zoY, zC, zTarget, weights = NULL, catScaling = FALSE )
... |
not used, force arguments to be set by name |
customCode |
code name |
coder |
user supplied variable re-coder (see vignette for type signature) |
codeSeq |
argments to custom coder |
v |
variable name |
vcolin |
data column, numeric |
zoY |
outcome column as numeric |
zC |
if classification outcome column as character |
zTarget |
if classification target class |
weights |
per-row weights |
catScaling |
optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling. |
wrapped custom coder
Build a k-fold cross validation splitter, respecting (never splitting) groupingColumn.
makekWayCrossValidationGroupedByColumn(groupingColumnName)
makekWayCrossValidationGroupedByColumn(groupingColumnName)
groupingColumnName |
name of column to group by. |
splitting function in the sense of vtreat::buildEvalSets.
d <- data.frame(y=sin(1:100)) d$group <- floor(seq_len(nrow(d))/5) splitter <- makekWayCrossValidationGroupedByColumn('group') split <- splitter(nrow(d),5,d,d$y) d$splitLabel <- vtreat::getSplitPlanAppLabels(nrow(d),split) rowSums(table(d$group,d$splitLabel)>0)
d <- data.frame(y=sin(1:100)) d$group <- floor(seq_len(nrow(d))/5) splitter <- makekWayCrossValidationGroupedByColumn('group') split <- splitter(nrow(d),5,d,d$y) d$splitLabel <- vtreat::getSplitPlanAppLabels(nrow(d),split) rowSums(table(d$group,d$splitLabel)>0)
Builds a designTreatmentsC
treatment plan and a data frame prepared
from dframe
that is "cross" in the sense each row is treated using a treatment
plan built from a subset of dframe disjoint from the given row.
The goal is to try to and supply a method of breaking nested model bias other than splitting
into calibration, training, test sets.
mkCrossFrameCExperiment( dframe, varlist, outcomename, outcometarget, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
mkCrossFrameCExperiment( dframe, varlist, outcomename, outcometarget, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
outcomename |
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values. |
outcometarget |
Value/level of outcome to be considered "success", and there must be a cut such that dframe[[outcomename]]==outcometarget at least twice and dframe[[outcomename]]!=outcometarget at least twice. |
... |
no additional arguments, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction |
what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders |
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
scale |
optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar>=2 number of cross-validation rounds to design. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
catScaling |
optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
named list containing: treatments, crossFrame, crossWeights, method, and evalSets
designTreatmentsC
, designTreatmentsN
, prepare.treatmentplan
# categorical example set.seed(23525) # we set up our raw training and application data dTrainC <- data.frame( x = c('a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, NA, 6, NA), y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)) dTestC <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsC # and dTrainCTreated unpack[ treatmentsC = treatments, dTrainCTreated = crossFrame ] <- mkCrossFrameCExperiment( dframe = dTrainC, varlist = setdiff(colnames(dTrainC), 'y'), outcomename = 'y', outcometarget = TRUE, verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.) # the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainCTreated %.>% head(.) %.>% print(.) # Any future application data is prepared with # the prepare method. dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL) dTestCTreated %.>% head(.) %.>% print(.)
# categorical example set.seed(23525) # we set up our raw training and application data dTrainC <- data.frame( x = c('a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, NA, 6, NA), y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)) dTestC <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsC # and dTrainCTreated unpack[ treatmentsC = treatments, dTrainCTreated = crossFrame ] <- mkCrossFrameCExperiment( dframe = dTrainC, varlist = setdiff(colnames(dTrainC), 'y'), outcomename = 'y', outcometarget = TRUE, verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.) # the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainCTreated %.>% head(.) %.>% print(.) # Any future application data is prepared with # the prepare method. dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL) dTestCTreated %.>% head(.) %.>% print(.)
Please see vignette("MultiClassVtreat", package = "vtreat")
https://winvector.github.io/vtreat/articles/MultiClassVtreat.html.
mkCrossFrameMExperiment( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = vtreat::kWayCrossValidation, ncross = 3, forceSplit = FALSE, catScaling = FALSE, y_dependent_treatments = c("catB"), verbose = FALSE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
mkCrossFrameMExperiment( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = vtreat::kWayCrossValidation, ncross = 3, forceSplit = FALSE, catScaling = FALSE, y_dependent_treatments = c("catB"), verbose = FALSE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe |
data to learn from |
varlist |
character, vector of indpendent variable column names. |
outcomename |
character, name of outcome column. |
... |
not used, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction |
what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders |
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
scale |
optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar>=2 number of cross-validation rounds to design. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
catScaling |
optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling. |
y_dependent_treatments |
character what treatment types to build per-outcome level. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
a names list containing cross_frame, treat_m, score_frame, and fit_obj_id
# numeric example set.seed(23525) # we set up our raw training and application data dTrainM <- data.frame( x = c('a', 'a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, 5, NA, 7, NA), y = c(0, 0, 0, 1, 0, 1, 2, 1)) dTestM <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsM, # dTrainMTreated, and score_frame unpack[ treatmentsM = treat_m, dTrainMTreated = cross_frame, score_frame = score_frame ] <- mkCrossFrameMExperiment( dframe = dTrainM, varlist = setdiff(colnames(dTrainM), 'y'), outcomename = 'y', verbose = FALSE) # the score_frame relates new # derived variables to original columns score_frame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'outcome_level')] %.>% print(.) # the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainMTreated %.>% head(.) %.>% print(.) # Any future application data is prepared with # the prepare method. dTestMTreated <- prepare(treatmentsM, dTestM, pruneSig=NULL) dTestMTreated %.>% head(.) %.>% print(.)
# numeric example set.seed(23525) # we set up our raw training and application data dTrainM <- data.frame( x = c('a', 'a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, 5, NA, 7, NA), y = c(0, 0, 0, 1, 0, 1, 2, 1)) dTestM <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsM, # dTrainMTreated, and score_frame unpack[ treatmentsM = treat_m, dTrainMTreated = cross_frame, score_frame = score_frame ] <- mkCrossFrameMExperiment( dframe = dTrainM, varlist = setdiff(colnames(dTrainM), 'y'), outcomename = 'y', verbose = FALSE) # the score_frame relates new # derived variables to original columns score_frame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'outcome_level')] %.>% print(.) # the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainMTreated %.>% head(.) %.>% print(.) # Any future application data is prepared with # the prepare method. dTestMTreated <- prepare(treatmentsM, dTestM, pruneSig=NULL) dTestMTreated %.>% head(.) %.>% print(.)
Builds a designTreatmentsN
treatment plan and a data frame prepared
from dframe
that is "cross" in the sense each row is treated using a treatment
plan built from a subset of dframe disjoint from the given row.
The goal is to try to and supply a method of breaking nested model bias other than splitting
into calibration, training, test sets.
mkCrossFrameNExperiment( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
mkCrossFrameNExperiment( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, codeRestriction = NULL, customCoders = NULL, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE, missingness_imputation = NULL, imputation_map = NULL )
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
outcomename |
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice. |
... |
no additional arguments, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
codeRestriction |
what types of variables to produce (character array of level codes, NULL means no restriction). |
customCoders |
map from code names to custom categorical variable encoding functions (please see https://github.com/WinVector/vtreat/blob/main/extras/CustomLevelCoders.md). |
scale |
optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar>=2 number of cross-validation rounds to design. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
named list containing: treatments, crossFrame, crossWeights, method, and evalSets
designTreatmentsC
, designTreatmentsN
, prepare.treatmentplan
# numeric example set.seed(23525) # we set up our raw training and application data dTrainN <- data.frame( x = c('a', 'a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, 5, NA, 7, NA), y = c(0, 0, 0, 1, 0, 1, 1, 1)) dTestN <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsN # and dTrainNTreated unpack[ treatmentsN = treatments, dTrainNTreated = crossFrame ] <- mkCrossFrameNExperiment( dframe = dTrainN, varlist = setdiff(colnames(dTrainN), 'y'), outcomename = 'y', verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.) # the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainNTreated %.>% head(.) %.>% print(.) # Any future application data is prepared with # the prepare method. dTestNTreated <- prepare(treatmentsN, dTestN, pruneSig=NULL) dTestNTreated %.>% head(.) %.>% print(.)
# numeric example set.seed(23525) # we set up our raw training and application data dTrainN <- data.frame( x = c('a', 'a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, 5, NA, 7, NA), y = c(0, 0, 0, 1, 0, 1, 1, 1)) dTestN <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsN # and dTrainNTreated unpack[ treatmentsN = treatments, dTrainNTreated = crossFrame ] <- mkCrossFrameNExperiment( dframe = dTrainN, varlist = setdiff(colnames(dTrainN), 'y'), outcomename = 'y', verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.) # the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainNTreated %.>% head(.) %.>% print(.) # Any future application data is prepared with # the prepare method. dTestNTreated <- prepare(treatmentsN, dTestN, pruneSig=NULL) dTestNTreated %.>% head(.) %.>% print(.)
A list of settings and values for vtreat multinomial classification fitting.
Please see
https://github.com/WinVector/vtreat/blob/main/Examples/fit_transform/fit_transform_api.md,
mkCrossFrameMExperiment
and
prepare.multinomial_plan
for details.
multinomial_parameters(user_params = NULL)
multinomial_parameters(user_params = NULL)
user_params |
list of user overrides. |
filled out parameter list
Hold settings and results for multinomial classification data preparation.
MultinomialOutcomeTreatment( ..., var_list, outcome_name, cols_to_copy = NULL, params = NULL, imputation_map = NULL )
MultinomialOutcomeTreatment( ..., var_list, outcome_name, cols_to_copy = NULL, params = NULL, imputation_map = NULL )
... |
not used, force arguments to be specified by name. |
var_list |
Names of columns to treat (effective variables). |
outcome_name |
Name of column holding outcome variable. |
cols_to_copy |
list of extra columns to copy. |
params |
parameters list from |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
Please see
https://github.com/WinVector/vtreat/blob/main/Examples/fit_transform/fit_transform_api.md,
mkCrossFrameMExperiment
and
prepare.multinomial_plan
for details.
Note: there currently is no designTreatmentsM
,
so MultinomialOutcomeTreatment$fit()
is implemented in terms
of MultinomialOutcomeTreatment$fit_transform()
Report new/novel appearances of character values.
novel_value_summary(dframe, trackedValues)
novel_value_summary(dframe, trackedValues)
dframe |
Data frame to inspect. |
trackedValues |
optional named list mapping variables to know values, allows warnings upon novel level appearances (see |
frame of novel occurrences
prepare.treatmentplan
, track_values
set.seed(23525) zip <- c(NA, paste('z', 1:10, sep = "_")) N <- 10 d <- data.frame(zip = sample(zip, N, replace=TRUE), zip2 = sample(zip, N, replace=TRUE), y = runif(N)) dSample <- d[1:5, , drop = FALSE] trackedValues <- track_values(dSample, c("zip", "zip2")) novel_value_summary(d, trackedValues)
set.seed(23525) zip <- c(NA, paste('z', 1:10, sep = "_")) N <- 10 d <- data.frame(zip = sample(zip, N, replace=TRUE), zip2 = sample(zip, N, replace=TRUE), y = runif(N)) dSample <- d[1:5, , drop = FALSE] trackedValues <- track_values(dSample, c("zip", "zip2")) novel_value_summary(d, trackedValues)
Hold settings and results for regression data preparation.
NumericOutcomeTreatment( ..., var_list, outcome_name, cols_to_copy = NULL, params = NULL, imputation_map = NULL )
NumericOutcomeTreatment( ..., var_list, outcome_name, cols_to_copy = NULL, params = NULL, imputation_map = NULL )
... |
not used, force arguments to be specified by name. |
var_list |
Names of columns to treat (effective variables). |
outcome_name |
Name of column holding outcome variable. |
cols_to_copy |
list of extra columns to copy. |
params |
parameters list from |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
Please see
https://github.com/WinVector/vtreat/blob/main/Examples/fit_transform/fit_transform_api.md,
mkCrossFrameNExperiment
,
designTreatmentsN
, and
prepare.treatmentplan
for details.
Note one way holdout can leak target expected values, so it should not be preferred in nested modeling situations. Also, doesn't respect nSplits.
oneWayHoldout(nRows, nSplits, dframe, y)
oneWayHoldout(nRows, nSplits, dframe, y)
nRows |
number of rows to split (integer >1). |
nSplits |
number of groups to split into (ignored). |
dframe |
original data frame (ignored). |
y |
numeric outcome variable (ignored). |
split plan
oneWayHoldout(3,NULL,NULL,NULL)
oneWayHoldout(3,NULL,NULL,NULL)
Add columns from new_frame into old_frame, replacing any columns with matching names in orig_frame with values from new_frame.
patch_columns_into_frame(orig_frame, new_frame)
patch_columns_into_frame(orig_frame, new_frame)
orig_frame |
data.frame to patch into. |
new_frame |
data.frame to take replacement columns from. |
patched data.frame
orig_frame <- data.frame(x = 1, y = 2) new_frame <- data.frame(y = 3, z = 4) patch_columns_into_frame(orig_frame, new_frame)
orig_frame <- data.frame(x = 1, y = 2) new_frame <- data.frame(y = 3, z = 4) patch_columns_into_frame(orig_frame, new_frame)
Pre-computed cross-plan (so same split happens each time).
pre_comp_xval(nRows, nSplits, splitplan)
pre_comp_xval(nRows, nSplits, splitplan)
nRows |
number of rows to split (integer >1). |
nSplits |
number of groups to split into (ignored). |
splitplan |
split plan to actually use |
splitplan
p1 <- oneWayHoldout(3,NULL,NULL,NULL) p2 <- pre_comp_xval(3, 3, p1) p2(3, 3)
p1 <- oneWayHoldout(3,NULL,NULL,NULL) p2 <- pre_comp_xval(3, 3, p1) p2(3, 3)
Apply treatments and restrict to useful variables.
prepare(treatmentplan, dframe, ...)
prepare(treatmentplan, dframe, ...)
treatmentplan |
Plan built by designTreantmentsC() or designTreatmentsN() |
dframe |
Data frame to be treated |
... |
no additional arguments, declared to forced named binding of later arguments |
prepare.treatmentplan
, prepare.simple_plan
, prepare.multinomial_plan
Please see vignette("MultiClassVtreat", package = "vtreat")
https://winvector.github.io/vtreat/articles/MultiClassVtreat.html.
## S3 method for class 'multinomial_plan' prepare( treatmentplan, dframe, ..., pruneSig = NULL, scale = FALSE, doCollar = FALSE, varRestriction = NULL, codeRestriction = NULL, trackedValues = NULL, extracols = NULL, parallelCluster = NULL, use_parallel = TRUE, check_for_duplicate_frames = TRUE )
## S3 method for class 'multinomial_plan' prepare( treatmentplan, dframe, ..., pruneSig = NULL, scale = FALSE, doCollar = FALSE, varRestriction = NULL, codeRestriction = NULL, trackedValues = NULL, extracols = NULL, parallelCluster = NULL, use_parallel = TRUE, check_for_duplicate_frames = TRUE )
treatmentplan |
multinomial_plan from mkCrossFrameMExperiment. |
dframe |
new data to process. |
... |
not used, declared to forced named binding of later arguments |
pruneSig |
suppress variables with significance above this level |
scale |
optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for variables with significant less than 1) slope 1 when regressed (lm for regression problems/glm for classification problems) against outcome. |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
varRestriction |
optional list of treated variable names to restrict to |
codeRestriction |
optional list of treated variable codes to restrict to |
trackedValues |
optional named list mapping variables to know values, allows warnings upon novel level appearances (see |
extracols |
extra columns to copy. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
check_for_duplicate_frames |
logical, if TRUE check if we called prepare on same data.frame as design step. |
prepared data frame.
mkCrossFrameMExperiment
, prepare
Prepare a simple treatment.
## S3 method for class 'simple_plan' prepare(treatmentplan, dframe, ...)
## S3 method for class 'simple_plan' prepare(treatmentplan, dframe, ...)
treatmentplan |
A simple treatment plan. |
dframe |
data.frame to be treated. |
... |
not used, present for S3 signature consistency. |
design_missingness_treatment
, prepare
d <- wrapr::build_frame( "x1", "x2", "x3" | 1 , 4 , "A" | NA , 5 , "B" | 3 , 6 , NA ) plan <- design_missingness_treatment(d) prepare(plan, d) prepare(plan, data.frame(x1=NA, x2=NA, x3="E"))
d <- wrapr::build_frame( "x1", "x2", "x3" | 1 , 4 , "A" | NA , 5 , "B" | 3 , 6 , NA ) plan <- design_missingness_treatment(d) prepare(plan, d) prepare(plan, data.frame(x1=NA, x2=NA, x3="E"))
Use a treatment plan to prepare a data frame for analysis. The
resulting frame will have new effective variables that are numeric
and free of NaN/NA. If the outcome column is present it will be copied over.
The intent is that these frames are compatible with more machine learning
techniques, and avoid a lot of corner cases (NA,NaN, novel levels, too many levels).
Note: each column is processed independently of all others. Also copies over outcome if present.
Note: treatmentplan's are not meant for long-term storage, a warning is issued if the version of
vtreat that produced the plan differs from the version running prepare()
.
## S3 method for class 'treatmentplan' prepare( treatmentplan, dframe, ..., pruneSig = NULL, scale = FALSE, doCollar = FALSE, varRestriction = NULL, codeRestriction = NULL, trackedValues = NULL, extracols = NULL, parallelCluster = NULL, use_parallel = TRUE, check_for_duplicate_frames = TRUE )
## S3 method for class 'treatmentplan' prepare( treatmentplan, dframe, ..., pruneSig = NULL, scale = FALSE, doCollar = FALSE, varRestriction = NULL, codeRestriction = NULL, trackedValues = NULL, extracols = NULL, parallelCluster = NULL, use_parallel = TRUE, check_for_duplicate_frames = TRUE )
treatmentplan |
Plan built by designTreantmentsC() or designTreatmentsN() |
dframe |
Data frame to be treated |
... |
no additional arguments, declared to forced named binding of later arguments |
pruneSig |
suppress variables with significance above this level |
scale |
optional if TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for variables with significant less than 1) slope 1 when regressed (lm for regression problems/glm for classification problems) against outcome. |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
varRestriction |
optional list of treated variable names to restrict to |
codeRestriction |
optional list of treated variable codes to restrict to |
trackedValues |
optional named list mapping variables to know values, allows warnings upon novel level appearances (see |
extracols |
extra columns to copy. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
check_for_duplicate_frames |
logical, if TRUE check if we called prepare on same data.frame as design step. |
treated data frame (all columns numeric- without NA, NaN)
mkCrossFrameCExperiment
, mkCrossFrameNExperiment
, designTreatmentsC
designTreatmentsN
designTreatmentsZ
, prepare
# categorical example set.seed(23525) # we set up our raw training and application data dTrainC <- data.frame( x = c('a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, NA, 6, NA), y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)) dTestC <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsC # and dTrainCTreated unpack[ treatmentsC = treatments, dTrainCTreated = crossFrame ] <- mkCrossFrameCExperiment( dframe = dTrainC, varlist = setdiff(colnames(dTrainC), 'y'), outcomename = 'y', outcometarget = TRUE, verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.) # the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainCTreated %.>% head(.) %.>% print(.) # Any future application data is prepared with # the prepare method. dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL) dTestCTreated %.>% head(.) %.>% print(.)
# categorical example set.seed(23525) # we set up our raw training and application data dTrainC <- data.frame( x = c('a', 'a', 'a', 'b', 'b', NA, NA), z = c(1, 2, 3, 4, NA, 6, NA), y = c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)) dTestC <- data.frame( x = c('a', 'b', 'c', NA), z = c(10, 20, 30, NA)) # we perform a vtreat cross frame experiment # and unpack the results into treatmentsC # and dTrainCTreated unpack[ treatmentsC = treatments, dTrainCTreated = crossFrame ] <- mkCrossFrameCExperiment( dframe = dTrainC, varlist = setdiff(colnames(dTrainC), 'y'), outcomename = 'y', outcometarget = TRUE, verbose = FALSE) # the treatments include a score frame relating new # derived variables to original columns treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>% print(.) # the treated frame is a "cross frame" which # is a transform of the training data built # as if the treatment were learned on a different # disjoint training set to avoid nested model # bias and over-fit. dTrainCTreated %.>% head(.) %.>% print(.) # Any future application data is prepared with # the prepare method. dTestCTreated <- prepare(treatmentsC, dTestC, pruneSig=NULL) dTestCTreated %.>% head(.) %.>% print(.)
Print treatmentplan.
## S3 method for class 'multinomial_plan' print(x, ...)
## S3 method for class 'multinomial_plan' print(x, ...)
x |
treatmentplan |
... |
additional args (to match general signature). |
Print treatmentplan.
## S3 method for class 'simple_plan' print(x, ...)
## S3 method for class 'simple_plan' print(x, ...)
x |
treatmentplan |
... |
additional args (to match general signature). |
Print treatmentplan.
## S3 method for class 'treatmentplan' print(x, ...)
## S3 method for class 'treatmentplan' print(x, ...)
x |
treatmentplan |
... |
additional args (to match general signature). |
designTreatmentsC
, designTreatmentsN
, designTreatmentsZ
, prepare.treatmentplan
Print treatmentplan.
## S3 method for class 'vtreatment' print(x, ...)
## S3 method for class 'vtreatment' print(x, ...)
x |
treatmentplan |
... |
additional args (to match general signature). |
designTreatmentsC
, designTreatmentsN
, designTreatmentsZ
, prepare.treatmentplan
check if appPlan is a good carve-up of 1:nRows into nSplits groups
problemAppPlan(nRows, nSplits, appPlan, strictCheck)
problemAppPlan(nRows, nSplits, appPlan, strictCheck)
nRows |
number of rows to carve-up |
nSplits |
number of sets to carve-up into |
appPlan |
carve-up to critique |
strictCheck |
logical, if true expect application data to be a carve-up and training data to be a maximal partition and to match nSplits. |
problem with carve-up (null if good)
kWayCrossValidation
, kWayStratifiedY
, and makekWayCrossValidationGroupedByColumn
plan <- kWayStratifiedY(3,2,NULL,NULL) problemAppPlan(3,3,plan,TRUE)
plan <- kWayStratifiedY(3,2,NULL,NULL) problemAppPlan(3,3,plan,TRUE)
A list of settings and values for vtreat regression fitting.
Please see
https://github.com/WinVector/vtreat/blob/main/Examples/fit_transform/fit_transform_api.md,
mkCrossFrameCExperiment
,
designTreatmentsC
, and
mkCrossFrameNExperiment
,
designTreatmentsN
,
prepare.treatmentplan
for details.
regression_parameters(user_params = NULL)
regression_parameters(user_params = NULL)
user_params |
list of user overrides. |
filled out parameter list
Materialize a treated data frame remotely.
rquery_prepare( db, rqplan, data_source, result_table_name, ..., extracols = NULL, temporary = FALSE, overwrite = TRUE, attempt_nan_inf_mapping = FALSE, col_sample = NULL, return_ops = FALSE ) materialize_treated( db, rqplan, data_source, result_table_name, ..., extracols = NULL, temporary = FALSE, overwrite = TRUE, attempt_nan_inf_mapping = FALSE, col_sample = NULL, return_ops = FALSE )
rquery_prepare( db, rqplan, data_source, result_table_name, ..., extracols = NULL, temporary = FALSE, overwrite = TRUE, attempt_nan_inf_mapping = FALSE, col_sample = NULL, return_ops = FALSE ) materialize_treated( db, rqplan, data_source, result_table_name, ..., extracols = NULL, temporary = FALSE, overwrite = TRUE, attempt_nan_inf_mapping = FALSE, col_sample = NULL, return_ops = FALSE )
db |
a db handle. |
rqplan |
an query plan produced by as_rquery_plan(). |
data_source |
relop, data source (usually a relop_table_source). |
result_table_name |
character, table name to land result in |
... |
force later arguments to bind by name. |
extracols |
extra columns to copy. |
temporary |
logical, if TRUE try to make result temporary. |
overwrite |
logical, if TRUE try to overwrite result. |
attempt_nan_inf_mapping |
logical, if TRUE attempt to map NaN and Infnity to NA/NULL (goot on PostgreSQL, not on Spark). |
col_sample |
sample of data to determine column types. |
return_ops |
logical, if TRUE return operator tree instead of materializing. |
description of treated table.
materialize_treated()
: old name for rquery_prepare function
as_rquery_plan
, rqdatatable_prepare
Return a vector of length y that is a piecewise function of x. This vector is picked as close to y (by square-distance) as possible for a set of x-only determined cut-points. Cross-validates for a good number of segments.
solve_piecewise(varName, x, y, w = NULL)
solve_piecewise(varName, x, y, w = NULL)
varName |
character, name of variable |
x |
numeric input (not empty, no NAs). |
y |
numeric or castable to such (same length as x no NAs), output to match |
w |
numeric positive, same length as x (weights, can be NULL) |
segmented y prediction
Return a vector of length y that is a piecewise function of x. This vector is picked as close to y (by square-distance) as possible for a set of x-only determined cut-points. Cross-validates for a good number of segments.
solve_piecewisec(varName, x, y, w = NULL)
solve_piecewisec(varName, x, y, w = NULL)
varName |
character, name of variable |
x |
numeric input (not empty, no NAs). |
y |
numeric or castable to such (same length as x no NAs), output to match |
w |
numeric positive, same length as x (weights, can be NULL) |
segmented y prediction
Return a spline approximation of data.
spline_variable(varName, x, y, w = NULL)
spline_variable(varName, x, y, w = NULL)
varName |
character, name of variable |
x |
numeric input (not empty, no NAs). |
y |
numeric or castable to such (same length as x no NAs), output to match |
w |
numeric positive, same length as x (weights, can be NULL) |
spline y prediction
Return a spline approximation of the change in log odds.
spline_variablec(varName, x, y, w = NULL)
spline_variablec(varName, x, y, w = NULL)
varName |
character, name of variable |
x |
numeric input (not empty, no NAs). |
y |
numeric or castable to such (same length as x no NAs), output to match |
w |
numeric positive, same length as x (weights, can be NULL) |
spline y prediction
Build a square moving average window (KNN in 1d). This is a high-frequency feature.
square_window(varName, x, y, w = NULL)
square_window(varName, x, y, w = NULL)
varName |
character, name of variable |
x |
numeric input (not empty, no NAs). |
y |
numeric or castable to such (same length as x no NAs), output to match |
w |
numeric positive, same length as x (weights, can be NULL) IGNORED |
segmented y prediction
d <- data.frame(x = c(NA, 1:6), y = c(0, 0, 0, 1, 1, 0, 0)) square_window("v", d$x, d$y)
d <- data.frame(x = c(NA, 1:6), y = c(0, 0, 0, 1, 1, 0, 0)) square_window("v", d$x, d$y)
Build a square moving average window (KNN in 1d). This is a high-frequency feature. Approximation of the change in log odds.
square_windowc(varName, x, y, w = NULL)
square_windowc(varName, x, y, w = NULL)
varName |
character, name of variable |
x |
numeric input (not empty, no NAs). |
y |
numeric or castable to such (same length as x no NAs), output to match |
w |
numeric positive, same length as x (weights, can be NULL) IGNORED |
segmented y prediction
d <- data.frame(x = c(NA, 1:6), y = c(0, 0, 0, 1, 1, 0, 0)) square_window("v", d$x, d$y)
d <- data.frame(x = c(NA, 1:6), y = c(0, 0, 0, 1, 1, 0, 0)) square_window("v", d$x, d$y)
Builds lists of observed unique character values of varlist variables from the data frame.
track_values(dframe, varlist)
track_values(dframe, varlist)
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
named list of values seen.
prepare.treatmentplan
, novel_value_summary
set.seed(23525) zip <- c(NA, paste('z', 1:100, sep = "_")) N <- 500 d <- data.frame(zip = sample(zip, N, replace=TRUE), zip2 = sample(zip, N, replace=TRUE), y = runif(N)) dSample <- d[1:300, , drop = FALSE] tplan <- designTreatmentsN(dSample, c("zip", "zip2"), "y", verbose = FALSE) trackedValues <- track_values(dSample, c("zip", "zip2")) # don't normally want to catch warnings, # doing it here as this is an example # and must not have unhandled warnings. tryCatch( prepare(tplan, d, trackedValues = trackedValues), warning = function(w) { cat(paste(w, collapse = "\n")) })
set.seed(23525) zip <- c(NA, paste('z', 1:100, sep = "_")) N <- 500 d <- data.frame(zip = sample(zip, N, replace=TRUE), zip2 = sample(zip, N, replace=TRUE), y = runif(N)) dSample <- d[1:300, , drop = FALSE] tplan <- designTreatmentsN(dSample, c("zip", "zip2"), "y", verbose = FALSE) trackedValues <- track_values(dSample, c("zip", "zip2")) # don't normally want to catch warnings, # doing it here as this is an example # and must not have unhandled warnings. tryCatch( prepare(tplan, d, trackedValues = trackedValues), warning = function(w) { cat(paste(w, collapse = "\n")) })
A list of settings and values for vtreat unsupervised fitting.
Please see
https://github.com/WinVector/vtreat/blob/main/Examples/fit_transform/fit_transform_api.md,
designTreatmentsZ
, and
prepare.treatmentplan
for details.
unsupervised_parameters(user_params = NULL)
unsupervised_parameters(user_params = NULL)
user_params |
list of user overrides. |
filled out parameter list
Hold settings and results for unsupervised data preparation.
UnsupervisedTreatment( ..., var_list, cols_to_copy = NULL, params = NULL, imputation_map = NULL )
UnsupervisedTreatment( ..., var_list, cols_to_copy = NULL, params = NULL, imputation_map = NULL )
... |
not used, force arguments to be specified by name. |
var_list |
Names of columns to treat (effective variables). |
cols_to_copy |
list of extra columns to copy. |
params |
parameters list from |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
Please see
https://github.com/WinVector/vtreat/blob/main/Examples/fit_transform/fit_transform_api.md,
designTreatmentsZ
and
prepare.treatmentplan
for details.
Note: for UnsupervisedTreatment
fit_transform(d)
is implemented
as fit(d)$transform(d)
.
Value variables for prediction a categorical outcome.
value_variables_C( dframe, varlist, outcomename, outcometarget, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = FALSE, parallelCluster = NULL, use_parallel = TRUE, customCoders = list(c.PiecewiseV.num = vtreat::solve_piecewisec, n.PiecewiseV.num = vtreat::solve_piecewise, c.knearest.num = vtreat::square_windowc, n.knearest.num = vtreat::square_window), codeRestriction = c("PiecewiseV", "knearest", "clean", "isBAD", "catB", "catP"), missingness_imputation = NULL, imputation_map = NULL )
value_variables_C( dframe, varlist, outcomename, outcometarget, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, catScaling = TRUE, verbose = FALSE, parallelCluster = NULL, use_parallel = TRUE, customCoders = list(c.PiecewiseV.num = vtreat::solve_piecewisec, n.PiecewiseV.num = vtreat::solve_piecewise, c.knearest.num = vtreat::square_windowc, n.knearest.num = vtreat::square_window), codeRestriction = c("PiecewiseV", "knearest", "clean", "isBAD", "catB", "catP"), missingness_imputation = NULL, imputation_map = NULL )
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
outcomename |
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values. |
outcometarget |
Value/level of outcome to be considered "success", and there must be a cut such that dframe[[outcomename]]==outcometarget at least twice and dframe[[outcomename]]!=outcometarget at least twice. |
... |
no additional arguments, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
scale |
optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar>=2 number of cross-validation rounds to design. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
catScaling |
optional, if TRUE use glm() linkspace, if FALSE use lm() for scaling. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
customCoders |
additional coders to use for variable importance estimate. |
codeRestriction |
codes to restrict to for variable importance estimate. |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
table of variable valuations
Value variables for prediction a numeric outcome.
value_variables_N( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, verbose = FALSE, parallelCluster = NULL, use_parallel = TRUE, customCoders = list(c.PiecewiseV.num = vtreat::solve_piecewisec, n.PiecewiseV.num = vtreat::solve_piecewise, c.knearest.num = vtreat::square_windowc, n.knearest.num = vtreat::square_window), codeRestriction = c("PiecewiseV", "knearest", "clean", "isBAD", "catB", "catP"), missingness_imputation = NULL, imputation_map = NULL )
value_variables_N( dframe, varlist, outcomename, ..., weights = c(), minFraction = 0.02, smFactor = 0, rareCount = 0, rareSig = 1, collarProb = 0, scale = FALSE, doCollar = FALSE, splitFunction = NULL, ncross = 3, forceSplit = FALSE, verbose = FALSE, parallelCluster = NULL, use_parallel = TRUE, customCoders = list(c.PiecewiseV.num = vtreat::solve_piecewisec, n.PiecewiseV.num = vtreat::solve_piecewise, c.knearest.num = vtreat::square_windowc, n.knearest.num = vtreat::square_window), codeRestriction = c("PiecewiseV", "knearest", "clean", "isBAD", "catB", "catP"), missingness_imputation = NULL, imputation_map = NULL )
dframe |
Data frame to learn treatments from (training data), must have at least 1 row. |
varlist |
Names of columns to treat (effective variables). |
outcomename |
Name of column holding outcome variable. dframe[[outcomename]] must be only finite non-missing values and there must be a cut such that dframe[[outcomename]] is both above the cut at least twice and below the cut at least twice. |
... |
no additional arguments, declared to forced named binding of later arguments |
weights |
optional training weights for each row |
minFraction |
optional minimum frequency a categorical level must have to be converted to an indicator column. |
smFactor |
optional smoothing factor for impact coding models. |
rareCount |
optional integer, allow levels with this count or below to be pooled into a shared rare-level. Defaults to 0 or off. |
rareSig |
optional numeric, suppress levels from pooling at this significance value greater. Defaults to NULL or off. |
collarProb |
what fraction of the data (pseudo-probability) to collar data at if doCollar is set during |
scale |
optional if TRUE replace numeric variables with regression ("move to outcome-scale"). |
doCollar |
optional if TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design. |
splitFunction |
(optional) see vtreat::buildEvalSets . |
ncross |
optional scalar>=2 number of cross-validation rounds to design. |
forceSplit |
logical, if TRUE force cross-validated significance calculations on all variables. |
verbose |
if TRUE print progress. |
parallelCluster |
(optional) a cluster object created by package parallel or package snow. |
use_parallel |
logical, if TRUE use parallel methods. |
customCoders |
additional coders to use for variable importance estimate. |
codeRestriction |
codes to restrict to for variable importance estimate. |
missingness_imputation |
function of signature f(values: numeric, weights: numeric), simple missing value imputer. |
imputation_map |
map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers. |
table of variable valuations
Return variable evaluations.
variable_values(sf)
variable_values(sf)
sf |
scoreFrame from from vtreat treatments |
per-original varaible evaluations
New treated variable names from a treatmentplan$treatment item.
vnames(x)
vnames(x)
x |
vtreatment item |
designTreatmentsC
designTreatmentsN
designTreatmentsZ
Original variable name from a treatmentplan$treatment item.
vorig(x)
vorig(x)
x |
vtreatment item. |
designTreatmentsC
designTreatmentsN
designTreatmentsZ