--- title: "vtreat grouping example" author: "Nina Zumel, Nate Sutton" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{vtreat grouping example} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.width = 7) ``` ```{r echo=FALSE, message=FALSE, warning=FALSE} library(vtreat) set.seed(23255) have_rqdatatable = requireNamespace("rqdatatable", quietly=TRUE) if(have_rqdatatable) { library(rqdatatable) } ``` ```{r echo=FALSE, message=FALSE, warning=FALSE} # # takes the frame (d) and the outcome column (d$conc) # from the global environment # showGroupingBehavior = function(groupcol, title) { print(title) # display means of each group print("Group means:") means = tapply(d$conc, d[[groupcol]], mean) print(means) print(paste("Standard deviation of group means:", sd(means))) } ``` This vignette shows an example use of _y_-stratified sampling with a grouping restriction in `vtreat`. For this example, we will use the `Theosph` dataset: data from an experiment on the pharmacokinetics of theophylline. We will demonstrate the desired effects of _y_-stratification while also respecting a grouping constraint. ## The Data First, let's look at the data. ```{r data} # panel data for concentration in multiple subjects d <- datasets::Theoph head(d) summary(d) ``` We have twelve subjects, who each received a dose of the anti-asthma drug theophylline. The theophylline concentration in the patients' blood was then measured at eleven points during the next 25 hours. Most of the patients got about the same dose, although the dose information reported in the dataset is normalized by weight. ## Partitioning the Data for Modeling Suppose we wanted to fit a model to analyze how a patient's weight affects how theophylline is metabolized, and validate that model with three-fold cross-validation. It would be important that all readings from a given patient stay in the same fold. We might also want the population in each fold to have similar distributions of theophylline concentrations curves. Recall that the goal of _y_-stratification is to insure that all samples from the data have as close to identical _y_ distributions as possible. This becomes more difficult when we also have to obey a grouping constraint. Let's look at three ways of splitting the data into folds. First, we will split the data arbitrarily into three groups, using the modulo of the Subject id to do the splitting. ```{r} # a somewhat arbitrary split of patients subnum = as.numeric(as.character(d$Subject)) d$modSplit = as.factor(subnum %% 3) ``` We can verify that this split preserves groups, by looking at the table of subject observations in each fold. Each subject should only appear in a single fold. ```{r} print(table(Subject=d$Subject, groupid=d$modSplit)) ``` Now let's try the standard _y_ stratification in `vtreat`. ```{r} # stratify by outcome only # forces concentration to be equivalent pStrat <- kWayStratifiedY(nrow(d),3,d,d$conc) attr(pStrat, "splitmethod") d$stratSplit <- vtreat::getSplitPlanAppLabels(nrow(d),pStrat) print(table(Subject=d$Subject, groupid=d$stratSplit)) ``` We can see this partition didn't preserve the `Subject` grouping. Finally, we can try `vtreat`'s group-preserving split, which also tries to _y_-stratify as much as possible (by stratifying on the mean *y* observation from each group). ```{r} # stratify by patient and outcome # allows concentration to vary amoung individual patients splitter <- makekWayCrossValidationGroupedByColumn('Subject') split <- splitter(nrow(d),3,d,d$conc) attr(split, "splitmethod") d$subjectSplit <- vtreat::getSplitPlanAppLabels(nrow(d),split) print(table(Subject=d$Subject, groupid=d$subjectSplit)) ``` This is again a subject-preserving partition. We can compare the mean theophylline concentration and the average pharmacokinetic profile for each fold, for both of the subject-preserving partitions. We see that the stratification reduces some of the variation between folds. ### Arbitrary Partition ```{r echo=FALSE} showGroupingBehavior("modSplit", "Arbitrary grouping") ``` ### Group-preserving, _y_-stratified Partition ```{r echo=FALSE} showGroupingBehavior("subjectSplit", "Group by patient, stratify on y") ```