Package 'WVPlots' reference manual

Title:	Common Plots for Analysis
Description:	Select data analysis plots, under a standardized calling interface implemented on top of 'ggplot2' and 'plotly'. Plots of interest include: 'ROC', gain curve, scatter plot with marginal distributions, conditioned scatter plot with marginal densities, box and stem with matching theoretical distribution, and density with matching theoretical distribution.
Authors:	John Mount [aut, cre], Nina Zumel [aut], Win-Vector LLC [cph]
Maintainer:	John Mount <[email protected]>
License:	GPL-2 \| GPL-3
Version:	1.3.8
Built:	2025-03-03 05:04:23 UTC
Source:	https://github.com/winvector/wvplots

WVPlots: Common Plots for Analysis

Description

Select data analysis plots, under a standardized calling interface implemented on top of ggplot2 and plotly. Plots of interest include: ROC, gain curve, scatter plot with marginal distributions, conditioned scatter plot with marginal densities. box and stem with matching theoretical distribution, density with matching theoretical distribution.

Details

For more information:

vignette(package='WVPlots')
RShowDoc('WVPlots_examples',package='WVPlots')
Website: https://github.com/WinVector/WVPlots

Author(s)

Maintainer: John Mount [email protected]

Authors:

Nina Zumel [email protected]

Other contributors:

Win-Vector LLC [copyright holder]

Plot a scatter plot of a binary variable with smoothing curve.

Description

Plot the scatter plot of a binary variable with a smoothing curve.

Usage

BinaryYScatterPlot(
  frame,
  xvar,
  yvar,
  title,
  ...,
  se = FALSE,
  use_glm = TRUE,
  point_color = "black",
  smooth_color = "blue"
)
BinaryYScatterPlot(
  frame,
  xvar,
  yvar,
  title,
  ...,
  se = FALSE,
  use_glm = TRUE,
  point_color = "black",
  smooth_color = "blue"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent column in frame
`yvar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`se`	if TRUE, add error bars (defaults to FALSE). Ignored if useGLM is TRUE
`use_glm`	if TRUE, "smooths" with a one-variable logistic regression (defaults to TRUE)
`point_color`	color for points
`smooth_color`	color for smoothing line

Details

The points are jittered for legibility. By default, a logistic regression fit is used, so that the smoothing curve represents the probability of y == 1 (as fit by the logistic regression). If use_glm is set to FALSE, a standard smoothing curve (either loess or a spline fit) is used.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(x=x,y=y,yC=y>=as.numeric(quantile(y,probs=0.8)))
frm$absY <- abs(frm$y)
frm$posY = frm$y > 0
frm$costX = 1
WVPlots::BinaryYScatterPlot(frm, "x", "posY",
   title="Example 'Probability of Y' Plot")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(x=x,y=y,yC=y>=as.numeric(quantile(y,probs=0.8)))
frm$absY <- abs(frm$y)
frm$posY = frm$y > 0
frm$costX = 1
WVPlots::BinaryYScatterPlot(frm, "x", "posY",
   title="Example 'Probability of Y' Plot")

Plot a Cleveland dot plot.

Description

Plot counts of a categorical variable.

Usage

ClevelandDotPlot(
  frm,
  xvar,
  title,
  ...,
  sort = -1,
  limit_n = NULL,
  stem = TRUE,
  color = "black"
)
ClevelandDotPlot(
  frm,
  xvar,
  title,
  ...,
  sort = -1,
  limit_n = NULL,
  stem = TRUE,
  color = "black"
)

Arguments

`frm`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`sort`	if TRUE sort data
`limit_n`	if not NULL number of items to plot
`stem`	if TRUE add stems/whiskers to plot
`color`	color for points and stems

Details

Assumes that xvar is a factor or can be coerced to one (character or integral).

sort < 0 sorts the factor levels in decreasing order (most frequent level first)
sort > 0 sorts the factor levels in increasing order (good when used in conjunction with coord_flip())
sort = 0 leaves the factor levels in "natural order" – usually alphabetical
stem = FALSE will plot only the dots, without the stem to the y=0 line.
limit_n = NULL plots all the levels, N an integer limits to the top N most populous levels

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
# discrete variable: letters of the alphabet
# frequencies of letters in English
# source: http://en.algoritmy.net/article/40379/Letter-frequency-English
letterFreqs = c(8.167, 1.492, 2.782, 4.253, 12.702, 2.228,
                2.015, 6.094, 6.966, 0.153, 0.772, 4.025, 2.406, 6.749, 7.507, 1.929,
                0.095, 5.987, 6.327, 9.056, 2.758, 0.978, 2.360, 0.150, 1.974, 0.074)
letterFreqs = letterFreqs/100
letterFrame = data.frame(letter = letters, freq=letterFreqs)
# now let's generate letters according to their letter frequencies
N = 1000
randomDraws = data.frame(draw=1:N,
  letter=sample(letterFrame$letter, size=N,
  replace=TRUE, prob=letterFrame$freq))
WVPlots::ClevelandDotPlot(randomDraws, "letter",
  title = "Example Cleveland-style dot plot")

# # Note the use of sort = 0. Also note that the graph omits counts
# # with no occurrences (5, and 7)
# WVPlots::ClevelandDotPlot(mtcars, "carb", sort = 0, "Example of counting integer values")

# # For counting integer values while including counts with no occurrences,
# # use Discrete Distribution.
# WVPlots::DiscreteDistribution(mtcars, "carb", "Better way to count integer values")
if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
# discrete variable: letters of the alphabet
# frequencies of letters in English
# source: http://en.algoritmy.net/article/40379/Letter-frequency-English
letterFreqs = c(8.167, 1.492, 2.782, 4.253, 12.702, 2.228,
                2.015, 6.094, 6.966, 0.153, 0.772, 4.025, 2.406, 6.749, 7.507, 1.929,
                0.095, 5.987, 6.327, 9.056, 2.758, 0.978, 2.360, 0.150, 1.974, 0.074)
letterFreqs = letterFreqs/100
letterFrame = data.frame(letter = letters, freq=letterFreqs)
# now let's generate letters according to their letter frequencies
N = 1000
randomDraws = data.frame(draw=1:N,
  letter=sample(letterFrame$letter, size=N,
  replace=TRUE, prob=letterFrame$freq))
WVPlots::ClevelandDotPlot(randomDraws, "letter",
  title = "Example Cleveland-style dot plot")

# # Note the use of sort = 0. Also note that the graph omits counts
# # with no occurrences (5, and 7)
# WVPlots::ClevelandDotPlot(mtcars, "carb", sort = 0, "Example of counting integer values")

# # For counting integer values while including counts with no occurrences,
# # use Discrete Distribution.
# WVPlots::DiscreteDistribution(mtcars, "carb", "Better way to count integer values")

Plot a scatter plot with smoothing line.

Description

Plot a scatter plot with a smoothing line; the smoothing window is aligned either left, center or right.

Usage

ConditionalSmoothedScatterPlot(
  frame,
  xvar,
  yvar,
  groupvar = NULL,
  title = "ConditionalSmoothedScatterPlot",
  ...,
  k = 3,
  align = "center",
  point_color = "black",
  point_alpha = 0.2,
  smooth_color = "black",
  palette = "Dark2"
)
ConditionalSmoothedScatterPlot(
  frame,
  xvar,
  yvar,
  groupvar = NULL,
  title = "ConditionalSmoothedScatterPlot",
  ...,
  k = 3,
  align = "center",
  point_color = "black",
  point_alpha = 0.2,
  smooth_color = "black",
  palette = "Dark2"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent column in frame. Assumed to be regularly spaced
`yvar`	name of the dependent (output or result to be modeled) column in frame
`groupvar`	name of the grouping column in frame. Can be NULL for an unconditional plot
`title`	title for plot
`...`	no unnamed argument, added to force named binding of later arguments.
`k`	width of smoothing window. Must be odd for a center-aligned plot. Defaults to 3
`align`	smoothing window alignment: 'center', 'left', or 'right'. Defaults to 'center'
`point_color`	color of points, when groupvar is NULL. Set to NULL to turn off points.
`point_alpha`	alpha/opaqueness of points.
`smooth_color`	color of smoothing line, when groupvar is NULL
`palette`	name of Brewer palette, when groupvar is non-NULL (can be NULL)

Details

xvar is the continuous independent variable and yvar is the dependent binary variable. Smoothing is by a square window of width k.

If palette is NULL, and groupvar is non-NULL, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_fill_manual.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

y = c(1,2,3,4,5,10,15,18,20,25)
x = seq_len(length(y))
df = data.frame(x=x, y=y, group=x>5)
WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", NULL,
   title="left smooth, one group", align="left")
# WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", "group",
#    title="left smooth, two groups", align="left")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

y = c(1,2,3,4,5,10,15,18,20,25)
x = seq_len(length(y))
df = data.frame(x=x, y=y, group=x>5)
WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", NULL,
   title="left smooth, one group", align="left")
# WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", "group",
#    title="left smooth, two groups", align="left")

Plot distribution of a single discrete numerical variable.

Description

Similar to calling ClevelandDotPlot with sort = 0 on a numerical x variable that takes on a discrete set of values.

Usage

DiscreteDistribution(frm, xvar, title, ..., stem = TRUE, color = "black")
DiscreteDistribution(frm, xvar, title, ..., stem = TRUE, color = "black")

Arguments

`frm`	data frame to get values from
`xvar`	numeric: name of the variable whose distribution is to be plotted
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`stem`	if TRUE add whisker/stems to plot
`color`	color of points and stems

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

frmx = data.frame(x = rbinom(1000, 20, 0.5))
WVPlots::DiscreteDistribution(frmx, "x","Discrete example")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

frmx = data.frame(x = rbinom(1000, 20, 0.5))
WVPlots::DiscreteDistribution(frmx, "x","Discrete example")

Plot two density plots conditioned on an outcome variable.

Description

Plot two density plots conditioned on a binary outcome variable.

Usage

DoubleDensityPlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)
DoubleDensityPlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`truth_target`	if not NULL compare to this scalar value.
`palette`	name of Brewer palette (can be NULL)

Details

The use case for this visualization is to plot the distribution of a predictive model score (usually the predicted probability of a desired outcome) conditioned on the actual outcome. However, you can use it to compare the distribution of any numerical quantity conditioned on a binary feature. See the examples.

The plot will degrade gracefully in degenerate conditions, for example when only one category is present.

If palette is NULL, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_fill_manual.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

mpg = ggplot2::mpg
mpg$trans = gsub("\\(.*$", '', mpg$trans)
WVPlots::DoubleDensityPlot(mpg, "cty", "trans", "City driving mpg by transmission type")

if (FALSE) {
# redo the last plot with a custom palette
cmap = c("auto" = "#b2df8a", "manual" = "#1f78b4")
plt = WVPlots::DoubleDensityPlot(mpg, "cty", "trans",
              palette = NULL,
              title="City driving mpg by transmission type")
plt + ggplot2::scale_color_manual(values=cmap) +
      ggplot2::scale_fill_manual(values=cmap)

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(score=x,
   truth=(y>=as.numeric(quantile(y,probs=0.8))),
   stuck=TRUE,
   rare=FALSE)
frm[1,'rare'] = TRUE
WVPlots::DoubleDensityPlot(frm, "score", "truth", title="Example double density plot")
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

mpg = ggplot2::mpg
mpg$trans = gsub("\\(.*$", '', mpg$trans)
WVPlots::DoubleDensityPlot(mpg, "cty", "trans", "City driving mpg by transmission type")

if (FALSE) {
# redo the last plot with a custom palette
cmap = c("auto" = "#b2df8a", "manual" = "#1f78b4")
plt = WVPlots::DoubleDensityPlot(mpg, "cty", "trans",
              palette = NULL,
              title="City driving mpg by transmission type")
plt + ggplot2::scale_color_manual(values=cmap) +
      ggplot2::scale_fill_manual(values=cmap)

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(score=x,
   truth=(y>=as.numeric(quantile(y,probs=0.8))),
   stuck=TRUE,
   rare=FALSE)
frm[1,'rare'] = TRUE
WVPlots::DoubleDensityPlot(frm, "score", "truth", title="Example double density plot")
}

Plot two histograms conditioned on an outcome variable.

Description

Plot two histograms conditioned on a binary outcome variable.

Usage

DoubleHistogramPlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  palette = "Dark2",
  breaks = 40
)
DoubleHistogramPlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  palette = "Dark2",
  breaks = 40
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`palette`	name of Brewer palette (can be NULL)
`breaks`	breaks to pass to histogram

Details

To distinguish the two conditions, one histogram is plotted upside-down.

The use case for this visualization is to plot a predictive model score (usually the predicted probability of a desired outcome) conditioned on the actual outcome. However, you can use it to compare any numerical quantity conditioned on a binary feature.

If palette is NULL, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_fill_manual.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(x=x,y=y,yC=y>=as.numeric(quantile(y,probs=0.8)))
frm$absY <- abs(frm$y)
frm$posY = frm$y > 0
frm$costX = 1
WVPlots::DoubleHistogramPlot(frm, "x", "yC", title="Example double histogram plot")

if (FALSE) {
# redo the plot with a custom palette
plt = WVPlots::DoubleHistogramPlot(frm, "x", "yC", palette=NULL,
                              title="Example double histogram plot")
cmap = c("TRUE" = "#b2df8a", "FALSE" = "#1f78b4")
plt + ggplot2::scale_color_manual(values=cmap) +
      ggplot2::scale_fill_manual(values=cmap)
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(x=x,y=y,yC=y>=as.numeric(quantile(y,probs=0.8)))
frm$absY <- abs(frm$y)
frm$posY = frm$y > 0
frm$costX = 1
WVPlots::DoubleHistogramPlot(frm, "x", "yC", title="Example double histogram plot")

if (FALSE) {
# redo the plot with a custom palette
plt = WVPlots::DoubleHistogramPlot(frm, "x", "yC", palette=NULL,
                              title="Example double histogram plot")
cmap = c("TRUE" = "#b2df8a", "FALSE" = "#1f78b4")
plt + ggplot2::scale_color_manual(values=cmap) +
      ggplot2::scale_fill_manual(values=cmap)
}

Plot the cumulative gain curve of a sort-order.

Description

Plot the cumulative gain curve of a sort-order.

Usage

GainCurvePlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  estimate_sig = FALSE,
  large_count = 1000,
  truth_target = NULL,
  model_color = "darkblue",
  wizard_color = "darkgreen",
  shadow_color = "darkgray"
)
GainCurvePlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  estimate_sig = FALSE,
  large_count = 1000,
  truth_target = NULL,
  model_color = "darkblue",
  wizard_color = "darkgreen",
  shadow_color = "darkgray"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model score) column in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`estimate_sig`	logical, if TRUE compute significance.
`large_count`	numeric, upper bound target for number of plotting points.
`truth_target`	if not NULL compare to this scalar value.
`model_color`	color for the model curve
`wizard_color`	color for the "wizard" (best possible) curve
`shadow_color`	color for the shaded area under the curve

Details

The use case for this visualization is to compare a predictive model score to an actual outcome (either binary (0/1) or continuous). In this case the gain curve plot measures how well the model score sorts the data compared to the true outcome value.

The x-axis represents the fraction of items seen when sorted by score, and the y-axis represents the cumulative summed true outcome represented by the items seen so far. See, for example, https://www.ibm.com/docs/SSLVMB_24.0.0/spss/tutorials/mlp_bankloan_outputtype_02.html.

For comparison, GainCurvePlot also plots the "wizard curve": the gain curve when the data is sorted according to its true outcome.

To improve presentation quality, the plot is limited to approximately large_count points (default: 1000). For larger data sets, the data is appropriately randomly sampled down before plotting.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
WVPlots::GainCurvePlot(frm, "model", "value",
   title="Example Continuous Gain Curve")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
WVPlots::GainCurvePlot(frm, "model", "value",
   title="Example Continuous Gain Curve")

Plot the cumulative gain curve of a sort-order with costs.

Description

Plot the cumulative gain curve of a sort-order with costs.

Usage

GainCurvePlotC(
  frame,
  xvar,
  costVar,
  truthVar,
  title,
  ...,
  estimate_sig = FALSE,
  large_count = 1000,
  model_color = "darkblue",
  wizard_color = "darkgreen",
  shadow_color = "darkgray"
)
GainCurvePlotC(
  frame,
  xvar,
  costVar,
  truthVar,
  title,
  ...,
  estimate_sig = FALSE,
  large_count = 1000,
  model_color = "darkblue",
  wizard_color = "darkgreen",
  shadow_color = "darkgray"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model score) column in frame
`costVar`	cost of each item (drives x-axis sum)
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`estimate_sig`	logical, if TRUE compute significance
`large_count`	numeric, upper bound target for number of plotting points
`model_color`	color for the model curve
`wizard_color`	color for the "wizard" (best possible) curve
`shadow_color`	color for the shaded area under the curve

Details

GainCurvePlotC plots a cumulative gain curve for the case where items have an additional cost, in addition to an outcome value.

The x-axis represents the fraction of total cost experienced when items are sorted by score, and the y-axis represents the cumulative summed true outcome represented by the items seen so far.

For comparison, GainCurvePlotC also plots the "wizard curve": the gain curve when the data is sorted according to its true outcome/cost (the optimal sort order).

To improve presentation quality, the plot is limited to approximately large_count points (default: 1000). For larger data sets, the data is appropriately randomly sampled down before plotting.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
frm$costs=1
frm$costs[1]=5
WVPlots::GainCurvePlotC(frm, "model", "costs", "value",
   title="Example Continuous Gain CurveC")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
frm$costs=1
frm$costs[1]=5
WVPlots::GainCurvePlotC(frm, "model", "costs", "value",
   title="Example Continuous Gain CurveC")

Plot the cumulative gain curves of a sort-order.

Description

Plot the cumulative gain curves of a sort-order.

Usage

GainCurvePlotList(
  frame,
  xvars,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)

GainCurveListPlot(
  frame,
  xvars,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)
GainCurvePlotList(
  frame,
  xvars,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)

GainCurveListPlot(
  frame,
  xvars,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)

Arguments

`frame`	data frame to get values from
`xvars`	name of the independent (input or model score) columns in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`truth_target`	if not NULL compare to this scalar value.
`palette`	color palette for the model curves

Details

The x-axis represents the fraction of items seen when sorted by score, and the y-axis represents the gain seen so far (cumulative value of model over cummulative value of random selection)..

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
WVPlots::GainCurvePlotList(frm, c("model", "value"), "value",
   title="Example Continuous gain Curves")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
WVPlots::GainCurvePlotList(frm, c("model", "value"), "value",
   title="Example Continuous gain Curves")

Plot the cumulative gain curve of a sort-order with extra notation

Description

Plot the cumulative gain curve of a sort-order with extra notation.

Usage

GainCurvePlotWithNotation(
  frame,
  xvar,
  truthVar,
  title,
  gainx,
  labelfun,
  ...,
  sort_by_model = TRUE,
  estimate_sig = FALSE,
  large_count = 1000,
  model_color = "darkblue",
  wizard_color = "darkgreen",
  shadow_color = "darkgray",
  crosshair_color = "red",
  text_color = "black"
)
GainCurvePlotWithNotation(
  frame,
  xvar,
  truthVar,
  title,
  gainx,
  labelfun,
  ...,
  sort_by_model = TRUE,
  estimate_sig = FALSE,
  large_count = 1000,
  model_color = "darkblue",
  wizard_color = "darkgreen",
  shadow_color = "darkgray",
  crosshair_color = "red",
  text_color = "black"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model score) column in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`gainx`	the point on the x axis corresponding to the desired label
`labelfun`	a function to return a label for the marked point
`...`	no unnamed argument, added to force named binding of later arguments.
`sort_by_model`	logical, if TRUE use the model to calculate gainy, else use wizard.
`estimate_sig`	logical, if TRUE compute significance
`large_count`	numeric, upper bound target for number of plotting points
`model_color`	color for the model curve
`wizard_color`	color for the "wizard" (best possible) curve
`shadow_color`	color for the shaded area under the curve
`crosshair_color`	color for the annotation location lines
`text_color`	color for the annotation text

Details

This is the standard gain curve plot (see GainCurvePlot) with a label attached to a particular value of x. The label is created by a function labelfun, which takes as inputs the x and y coordinates of a label and returns a string (the label).

By default, uses the model to calculate the y value of the calculated point; to use the wizard curve, set sort_by_model = FALSE

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
gainx = 0.25  # get the predicted top 25% most valuable points as sorted by the model
# make a function to calculate the label for the annotated point
labelfun = function(gx, gy) {
  pctx = gx*100
  pcty = gy*100

  paste("The predicted top ", pctx, "% most valuable points by the model\n",
        "are ", pcty, "% of total actual value", sep='')
}
WVPlots::GainCurvePlotWithNotation(frm, "model", "value",
   title="Example Gain Curve with annotation",
   gainx=gainx,labelfun=labelfun)

# now get the top 25% actual most valuable points

labelfun = function(gx, gy) {
  pctx = gx*100
  pcty = gy*100

  paste("The actual top ", pctx, "% most valuable points\n",
        "are ", pcty, "% of total actual value", sep='')
}

WVPlots::GainCurvePlotWithNotation(frm, "model", "value",
   title="Example Gain Curve with annotation",
   gainx=gainx,labelfun=labelfun, sort_by_model=FALSE)

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
gainx = 0.25  # get the predicted top 25% most valuable points as sorted by the model
# make a function to calculate the label for the annotated point
labelfun = function(gx, gy) {
  pctx = gx*100
  pcty = gy*100

  paste("The predicted top ", pctx, "% most valuable points by the model\n",
        "are ", pcty, "% of total actual value", sep='')
}
WVPlots::GainCurvePlotWithNotation(frm, "model", "value",
   title="Example Gain Curve with annotation",
   gainx=gainx,labelfun=labelfun)

# now get the top 25% actual most valuable points

labelfun = function(gx, gy) {
  pctx = gx*100
  pcty = gy*100

  paste("The actual top ", pctx, "% most valuable points\n",
        "are ", pcty, "% of total actual value", sep='')
}

WVPlots::GainCurvePlotWithNotation(frm, "model", "value",
   title="Example Gain Curve with annotation",
   gainx=gainx,labelfun=labelfun, sort_by_model=FALSE)

Build a hex bin plot

Description

Build a hex bin plot with rational color coding.

Usage

HexBinPlot(
  d,
  xvar,
  yvar,
  title,
  ...,
  lightcolor = "#deebf7",
  darkcolor = "#000000",
  bins = 30,
  binwidth = NULL,
  na.rm = FALSE
)
HexBinPlot(
  d,
  xvar,
  yvar,
  title,
  ...,
  lightcolor = "#deebf7",
  darkcolor = "#000000",
  bins = 30,
  binwidth = NULL,
  na.rm = FALSE
)

Arguments

`d`	data frame
`xvar`	name of x variable column
`yvar`	name of y variable column
`title`	plot title
`...`	not used, forces later arguments to bind by name
`lightcolor`	light color for least dense areas
`darkcolor`	dark color for most dense areas
`bins`	passed to geom_hex
`binwidth`	passed to geom_hex
`na.rm`	passed to geom_hex

Details

Builds a standard ggplot2 hexbin plot, with a color scale such that dense areas are colored darker (the default ggplot2 fill scales will color dense areas lighter).

The user can choose an alternate color scale with endpoints lightcolor and darkcolor; it is up to the user to make sure that lightcolor is lighter than darkcolor.

Requires the hexbin package.

Value

a ggplot2 hexbin plot

Examples


if(requireNamespace("hexbin", quietly = TRUE)) {
   if (requireNamespace('data.table', quietly = TRUE)) {
	   # don't multi-thread during CRAN checks
		   data.table::setDTthreads(1)
   }
   set.seed(634267)
   dframe = data.frame(x = rnorm(1000), y = rnorm(1000))
   print(HexBinPlot(dframe, "x", "y", "Example hexbin"))

   diamonds = ggplot2::diamonds
   print(HexBinPlot(diamonds, "carat", "price", "Diamonds example"))

   # change the colorscale
    print(HexBinPlot(diamonds, "carat", "price", "Diamonds example",
                     lightcolor="#fed98e",
                     darkcolor="#993404"))
}

if(requireNamespace("hexbin", quietly = TRUE)) {
   if (requireNamespace('data.table', quietly = TRUE)) {
	   # don't multi-thread during CRAN checks
		   data.table::setDTthreads(1)
   }
   set.seed(634267)
   dframe = data.frame(x = rnorm(1000), y = rnorm(1000))
   print(HexBinPlot(dframe, "x", "y", "Example hexbin"))

   diamonds = ggplot2::diamonds
   print(HexBinPlot(diamonds, "carat", "price", "Diamonds example"))

   # change the colorscale
    print(HexBinPlot(diamonds, "carat", "price", "Diamonds example",
                     lightcolor="#fed98e",
                     darkcolor="#993404"))
}

Plot the cumulative lift curve of a sort-order.

Description

Plot the cumulative lift curve of a sort-order.

Usage

LiftCurvePlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  large_count = 1000,
  include_wizard = TRUE,
  truth_target = NULL,
  model_color = "darkblue",
  wizard_color = "darkgreen"
)
LiftCurvePlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  large_count = 1000,
  include_wizard = TRUE,
  truth_target = NULL,
  model_color = "darkblue",
  wizard_color = "darkgreen"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model score) column in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`large_count`	numeric, upper bound target for number of plotting points
`include_wizard`	logical, if TRUE plot the ideal or wizard plot.
`truth_target`	if not NULL compare to this scalar value.
`model_color`	color for the model curve
`wizard_color`	color for the "wizard" (best possible) curve

Details

The use case for this visualization is to compare a predictive model score to an actual outcome (either binary (0/1) or continuous). In this case the lift curve plot measures how well the model score sorts the data compared to the true outcome value.

The x-axis represents the fraction of items seen when sorted by score, and the y-axis represents the lift seen so far (cumulative value of model over cummulative value of random selection)..

For comparison, LiftCurvePlot also plots the "wizard curve": the lift curve when the data is sorted according to its true outcome.

To improve presentation quality, the plot is limited to approximately large_count points (default: 1000). For larger data sets, the data is appropriately randomly sampled down before plotting.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
WVPlots::LiftCurvePlot(frm, "model", "value",
   title="Example Continuous Lift Curve")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
WVPlots::LiftCurvePlot(frm, "model", "value",
   title="Example Continuous Lift Curve")

Plot the cumulative lift curves of a sort-order.

Description

Plot the cumulative lift curves of a sort-order.

Usage

LiftCurvePlotList(
  frame,
  xvars,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)

LiftCurveListPlot(
  frame,
  xvars,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)
LiftCurvePlotList(
  frame,
  xvars,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)

LiftCurveListPlot(
  frame,
  xvars,
  truthVar,
  title,
  ...,
  truth_target = NULL,
  palette = "Dark2"
)

Arguments

`frame`	data frame to get values from
`xvars`	name of the independent (input or model score) columns in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`truth_target`	if not NULL compare to this scalar value.
`palette`	color palette for the model curves

Details

The x-axis represents the fraction of items seen when sorted by score, and the y-axis represents the lift seen so far (cumulative value of model over cummulative value of random selection)..

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
WVPlots::LiftCurvePlotList(frm, c("model", "value"), "value",
   title="Example Continuous Lift Curves")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
WVPlots::LiftCurvePlotList(frm, c("model", "value"), "value",
   title="Example Continuous Lift Curves")

Log-log plot

Description

Plot a trend on log-log paper.

Usage

LogLogPlot(
  frame,
  xvar,
  yvar,
  title,
  ...,
  use_coord_trans = FALSE,
  point_color = "black",
  linear_color = "#018571",
  quadratic_color = "#a6611a",
  smoothing_color = "blue"
)
LogLogPlot(
  frame,
  xvar,
  yvar,
  title,
  ...,
  use_coord_trans = FALSE,
  point_color = "black",
  linear_color = "#018571",
  quadratic_color = "#a6611a",
  smoothing_color = "blue"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`yvar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`use_coord_trans`	logical if TRUE, use coord_trans instead of `coord_trans(x = "log10", y = "log10")` instead of `scale_x_log10() + scale_y_log10()` (useful when there is not enough range to show ticks).
`point_color`	the color of the data points
`linear_color`	the color of the linear growth lines
`quadratic_color`	the color of the quadratic growth lines
`smoothing_color`	the color of the smoothing line through the data

Details

This plot is intended for plotting functions that are observed costs or durations as a function of problem size. In this case we expect the ideal or expected cost function to be non-decreasing. Any negative trends are assumed to arise from the noise model. The graph is specialized to compare non-decreasing linear and non-decreasing quadratic growth.

Some care must be taken in drawing conclusions from log-log plots, as the transform is fairly violent. Please see: "(Mar's Law) Everything is linear if plotted log-log with a fat magic marker" (from Akin's Laws of Spacecraft Design https://spacecraft.ssl.umd.edu/akins_laws.html), and "So You Think You Have a Power Law" http://bactra.org/weblog/491.html.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(5326)
frm = data.frame(x = 1:20)
frm$y <- 5 + frm$x + 0.2 * frm$x * frm$x + 0.1*abs(rnorm(nrow(frm)))
WVPlots::LogLogPlot(frm, "x", "y", title="Example Trend")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(5326)
frm = data.frame(x = 1:20)
frm$y <- 5 + frm$x + 0.2 * frm$x * frm$x + 0.1*abs(rnorm(nrow(frm)))
WVPlots::LogLogPlot(frm, "x", "y", title="Example Trend")

Plot the relationship between two metrics.

Description

Plot the relationship between two metrics.

Usage

MetricPairPlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  x_metric = "false_positive_rate",
  y_metric = "true_positive_rate",
  truth_target = TRUE,
  points_to_plot = NULL,
  linecolor = "black"
)
MetricPairPlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  x_metric = "false_positive_rate",
  y_metric = "true_positive_rate",
  truth_target = TRUE,
  points_to_plot = NULL,
  linecolor = "black"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`truthVar`	name of the column to be predicted
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`x_metric`	metric to be plotted. See Details for the list of allowed metrics
`y_metric`	metric to be plotted. See Details for the list of allowed metrics
`truth_target`	truth value considered to be positive.
`points_to_plot`	how many data points to use for plotting. Defaults to NULL (all data)
`linecolor`	character: name of line color

Details

Plots two classifier metrics against each other, showing achievable combinations of performance metrics. For example, plotting true_positive_rate vs false_positive_rate recreates the ROC plot.

MetricPairPlot can plot a number of metrics. Some of the metrics are redundant, in keeping with the customary terminology of various analysis communities.

sensitivity: fraction of true positives that were predicted to be true (also known as the true positive rate)
specificity: fraction of true negatives to all negatives (or 1 - false_positive_rate)
precision: fraction of predicted positives that are true positives
recall: same as sensitivity or true positive rate
accuracy: fraction of items correctly decided
false_positive_rate: fraction of negatives predicted to be true over all negatives
true_positive_rate: fraction of positives predicted to be true over all positives
false_negative_rate: fraction of positives predicted to be all false over all positives
true_negative_rate: fraction negatives predicted to be false over all negatives

points_to_plot specifies the approximate number of datums used to create the plots as an absolute count; for example setting points_to_plot = 200 uses approximately 200 points, rather than the entire data set. This can be useful when visualizing very large data sets.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

# data with two different regimes of behavior
d <- rbind(
  data.frame(
    x =  rnorm(1000),
    y = sample(c(TRUE, FALSE), prob = c(0.02, 0.98), size = 1000, replace = TRUE)),
  data.frame(
    x =  rnorm(200) + 5,
    y = sample(c(TRUE, FALSE), size = 200, replace = TRUE))
)

# Sensitivity/Specificity examples
MetricPairPlot(d, 'x', 'y',
   x_metric = 'false_positive_rate',
   y_metric = 'true_positive_rate',
   truth_target = TRUE,
   title = 'ROC equivalent')
if(FALSE) {
ThresholdPlot(d, 'x', 'y',
   title = 'Sensitivity/Specificity',
   metrics = c('sensitivity', 'specificity'),
   truth_target = TRUE)
ROCPlot(d, 'x', 'y',
   truthTarget = TRUE,
   title = 'ROC example')

# Precision/Recall examples
ThresholdPlot(d, 'x', 'y',
   title = 'precision/recall',
   metrics = c('recall', 'precision'),
   truth_target = TRUE)
MetricPairPlot(d, 'x', 'y',
   x_metric = 'recall',
   y_metric = 'precision',
   title = 'recall/precision',
   truth_target = TRUE)
PRPlot(d, 'x', 'y',
   truthTarget = TRUE,
   title = 'p/r plot')
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

# data with two different regimes of behavior
d <- rbind(
  data.frame(
    x =  rnorm(1000),
    y = sample(c(TRUE, FALSE), prob = c(0.02, 0.98), size = 1000, replace = TRUE)),
  data.frame(
    x =  rnorm(200) + 5,
    y = sample(c(TRUE, FALSE), size = 200, replace = TRUE))
)

# Sensitivity/Specificity examples
MetricPairPlot(d, 'x', 'y',
   x_metric = 'false_positive_rate',
   y_metric = 'true_positive_rate',
   truth_target = TRUE,
   title = 'ROC equivalent')
if(FALSE) {
ThresholdPlot(d, 'x', 'y',
   title = 'Sensitivity/Specificity',
   metrics = c('sensitivity', 'specificity'),
   truth_target = TRUE)
ROCPlot(d, 'x', 'y',
   truthTarget = TRUE,
   title = 'ROC example')

# Precision/Recall examples
ThresholdPlot(d, 'x', 'y',
   title = 'precision/recall',
   metrics = c('recall', 'precision'),
   truth_target = TRUE)
MetricPairPlot(d, 'x', 'y',
   x_metric = 'recall',
   y_metric = 'precision',
   title = 'recall/precision',
   truth_target = TRUE)
PRPlot(d, 'x', 'y',
   truthTarget = TRUE,
   title = 'p/r plot')
}

Build a pair plot

Description

Creates a matrix of scatterplots, one for each possible pair of variables.

Usage

PairPlot(
  d,
  meas_vars,
  title,
  ...,
  group_var = NULL,
  alpha = 1,
  palette = "Dark2",
  point_color = "darkgray"
)
PairPlot(
  d,
  meas_vars,
  title,
  ...,
  group_var = NULL,
  alpha = 1,
  palette = "Dark2",
  point_color = "darkgray"
)

Arguments

`d`	data frame
`meas_vars`	the variables to be plotted
`title`	plot title
`...`	not used, forces later arguments to bind by name
`group_var`	variable for grouping and colorcoding
`alpha`	alpha for points on plot
`palette`	name of a brewer palette (NULL for ggplot2 default coloring)
`point_color`	point color for monochrome plots (no grouping)

Details

If palette is NULL, and group_var is non-NULL, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_color_manual.

Value

a ggplot2 pair plot

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

# PairPlot(iris, colnames(iris)[1:4], "Example plot", group_var = "Species")

# custom palette
colormap = c('#a6611a', '#dfc27d', '#018571')
PairPlot(iris, colnames(iris)[1:4], "Example plot",
         group_var = "Species", palette=NULL) +
         ggplot2::scale_color_manual(values=colormap)

# # no color-coding
# PairPlot(iris, colnames(iris)[1:4], "Example plot")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

# PairPlot(iris, colnames(iris)[1:4], "Example plot", group_var = "Species")

# custom palette
colormap = c('#a6611a', '#dfc27d', '#018571')
PairPlot(iris, colnames(iris)[1:4], "Example plot",
         group_var = "Species", palette=NULL) +
         ggplot2::scale_color_manual(values=colormap)

# # no color-coding
# PairPlot(iris, colnames(iris)[1:4], "Example plot")

Plot the trajectory of a model fit.

Description

Plot a history of model fit performance over the a trajectory of times.

Usage

plot_fit_trajectory(
  d,
  column_description,
  title,
  ...,
  epoch_name = "epoch",
  needs_flip = c(),
  pick_metric = NULL,
  discount_rate = NULL,
  draw_ribbon = FALSE,
  draw_segments = FALSE,
  val_color = "#d95f02",
  train_color = "#1b9e77",
  pick_color = "#e6ab02"
)
plot_fit_trajectory(
  d,
  column_description,
  title,
  ...,
  epoch_name = "epoch",
  needs_flip = c(),
  pick_metric = NULL,
  discount_rate = NULL,
  draw_ribbon = FALSE,
  draw_segments = FALSE,
  val_color = "#d95f02",
  train_color = "#1b9e77",
  pick_color = "#e6ab02"
)

Arguments

`d`	data frame to get values from.
`column_description`	description of column measures (data.frame with columns measure, validation, and training).
`title`	character title for plot.
`...`	force later arguments to be bound by name
`epoch_name`	name for epoch or trajectory column.
`needs_flip`	character array of measures that need to be flipped.
`pick_metric`	character metric to maximize.
`discount_rate`	numeric what fraction of over-fit to subtract from validation performance.
`draw_ribbon`	present the difference in training and validation performance as a ribbon rather than two curves? (default FALSE)
`draw_segments`	logical if TRUE draw over-fit/under-fit segments.
`val_color`	color for validation performance curve
`train_color`	color for training performance curve
`pick_color`	color for indicating optimal stopping point

Details

This visualization can be applied to any staged machine learning algorithm. For example one could plot the performance of a gradient boosting machine as a function of the number of trees added. The fit history data should be in the form given in the example below.

The example below gives a fit plot for a history report from Keras R package. Please see https://win-vector.com/2017/12/23/plotting-deep-learning-model-performance-trajectories/ for some examples and details.

Value

ggplot2 plot

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

d <- data.frame(
  epoch    = c(1,         2,         3,         4,         5),
  val_loss = c(0.3769818, 0.2996994, 0.2963943, 0.2779052, 0.2842501),
  val_acc  = c(0.8722000, 0.8895000, 0.8822000, 0.8899000, 0.8861000),
  loss     = c(0.5067290, 0.3002033, 0.2165675, 0.1738829, 0.1410933),
  acc      = c(0.7852000, 0.9040000, 0.9303333, 0.9428000, 0.9545333) )

cT <- data.frame(
  measure =    c("minus binary cross entropy", "accuracy"),
  training =   c("loss",                       "acc"),
  validation = c("val_loss",                   "val_acc"),
  stringsAsFactors = FALSE)

plt <- plot_fit_trajectory(
  d,
  column_description = cT,
  needs_flip = "minus binary cross entropy",
  title = "model performance by epoch, dataset, and measure",
  epoch_name = "epoch",
  pick_metric = "minus binary cross entropy",
  discount_rate = 0.1)

print(plt)

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

d <- data.frame(
  epoch    = c(1,         2,         3,         4,         5),
  val_loss = c(0.3769818, 0.2996994, 0.2963943, 0.2779052, 0.2842501),
  val_acc  = c(0.8722000, 0.8895000, 0.8822000, 0.8899000, 0.8861000),
  loss     = c(0.5067290, 0.3002033, 0.2165675, 0.1738829, 0.1410933),
  acc      = c(0.7852000, 0.9040000, 0.9303333, 0.9428000, 0.9545333) )

cT <- data.frame(
  measure =    c("minus binary cross entropy", "accuracy"),
  training =   c("loss",                       "acc"),
  validation = c("val_loss",                   "val_acc"),
  stringsAsFactors = FALSE)

plt <- plot_fit_trajectory(
  d,
  column_description = cT,
  needs_flip = "minus binary cross entropy",
  title = "model performance by epoch, dataset, and measure",
  epoch_name = "epoch",
  pick_metric = "minus binary cross entropy",
  discount_rate = 0.1)

print(plt)

Plot the trajectory of a Keras model fit.

Description

Plot a history of model fit performance over the number of training epochs.

Usage

plot_Keras_fit_trajectory(
  d,
  title,
  ...,
  epoch_name = "epoch",
  lossname = "loss",
  loss_pretty_name = "minus binary cross entropy",
  perfname = "acc",
  perf_pretty_name = "accuracy",
  pick_metric = loss_pretty_name,
  fliploss = TRUE,
  discount_rate = NULL,
  draw_ribbon = FALSE,
  val_color = "#d95f02",
  train_color = "#1b9e77",
  pick_color = "#e6ab02"
)
plot_Keras_fit_trajectory(
  d,
  title,
  ...,
  epoch_name = "epoch",
  lossname = "loss",
  loss_pretty_name = "minus binary cross entropy",
  perfname = "acc",
  perf_pretty_name = "accuracy",
  pick_metric = loss_pretty_name,
  fliploss = TRUE,
  discount_rate = NULL,
  draw_ribbon = FALSE,
  val_color = "#d95f02",
  train_color = "#1b9e77",
  pick_color = "#e6ab02"
)

Arguments

`d`	data frame to get values from.
`title`	character title for plot.
`...`	force later arguments to be bound by name
`epoch_name`	name for epoch or trajectory column.
`lossname`	name of training loss column (default 'loss')
`loss_pretty_name`	name for loss on graph (default 'minus binary cross entropy')
`perfname`	name of training performance column (default 'acc')
`perf_pretty_name`	name for performance metric on graph (default 'accuracy')
`pick_metric`	character: metric to maximize (NULL for no pick line - default loss_pretty_name)
`fliploss`	flip the loss so that "larger is better"? (default TRUE)
`discount_rate`	numeric: what fraction of over-fit to subtract from validation performance.
`draw_ribbon`	present the difference in training and validation performance as a ribbon rather than two curves? (default FALSE)
`val_color`	color for validation performance curve
`train_color`	color for training performance curve
`pick_color`	color for indicating optimal stopping point

Details

Assumes a performance matrix that carries information for both training and validation loss, and an additional training and validation performance metric, in the format that a Keras history object returns.

By default, flips the loss so that better performance is larger for both the loss and the performance metric, and then draws a vertical line at the minimum validation loss (maximum flipped validation loss). If you choose not to flip the loss, you should not use the loss as the pick_metric.

The example below gives a fit plot for a history report from Keras R package. Please see https://winvector.github.io/FluidData/PlotExample/KerasPerfPlot.html for some details.

Value

ggplot2 plot

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

# example data (from Keras)
d <- data.frame(
  val_loss = c(0.3769818, 0.2996994, 0.2963943, 0.2779052, 0.2842501),
  val_acc  = c(0.8722000, 0.8895000, 0.8822000, 0.8899000, 0.8861000),
  loss     = c(0.5067290, 0.3002033, 0.2165675, 0.1738829, 0.1410933),
  acc      = c(0.7852000, 0.9040000, 0.9303333, 0.9428000, 0.9545333) )

plt <- plot_Keras_fit_trajectory(
  d,
  title = "model performance by epoch, dataset, and measure")

print(plt)

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

# example data (from Keras)
d <- data.frame(
  val_loss = c(0.3769818, 0.2996994, 0.2963943, 0.2779052, 0.2842501),
  val_acc  = c(0.8722000, 0.8895000, 0.8822000, 0.8899000, 0.8861000),
  loss     = c(0.5067290, 0.3002033, 0.2165675, 0.1738829, 0.1410933),
  acc      = c(0.7852000, 0.9040000, 0.9303333, 0.9428000, 0.9545333) )

plt <- plot_Keras_fit_trajectory(
  d,
  title = "model performance by epoch, dataset, and measure")

print(plt)

Plot count data with a theoretical binomial

Description

Compares empirical count data to a binomial distribution

Usage

PlotDistCountBinomial(
  frm,
  xvar,
  trial_size,
  title,
  ...,
  p = NULL,
  limit_to_observed_range = FALSE,
  count_color = "black",
  binom_color = "blue"
)
PlotDistCountBinomial(
  frm,
  xvar,
  trial_size,
  title,
  ...,
  p = NULL,
  limit_to_observed_range = FALSE,
  count_color = "black",
  binom_color = "blue"
)

Arguments

`frm`	data frame to get values from
`xvar`	column of frm that counts the number of successes for each trial
`trial_size`	the number of "coin flips" in a trial
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`p`	mean of the binomial. If NULL, use empirical mean
`limit_to_observed_range`	If TRUE, limit plot to observed counts
`count_color`	color of empirical distribution
`binom_color`	color of theoretical binomial

Details

This function is useful for comparing the number of successes that occur in a series of trials, all of the same size, to a binomial of a given success-probability.

Plots the empirical distribution of successes, and a theoretical matching binomial. If the mean of the binomial, p, is given, the binomial with success-probability p is plotted. Otherwise, p is taken to be the pooled success rate of the data: sum(frm[[xvar]]) / (trial_size*nrow(frm)). The mean of the binomial is reported in the subtitle of the plot (to three significant figures).

If limit_to_observed_range is TRUE, the range of the plot will only cover the range of the empirical data. Otherwise, the range of the plot will be 0:trial_size (the default).

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(23590)
class_size = 35
nclasses = 100
true_frate = 0.4
fdata = data.frame(n_female = rbinom(nclasses, class_size, true_frate), stringsAsFactors = FALSE)

title = paste("Distribution of count of female students, class size =", class_size)
# compare to empirical p
PlotDistCountBinomial(fdata, "n_female", class_size, title)

if(FALSE) {
  # compare to theoretical p of 0.5
  PlotDistCountBinomial(fdata, "n_female", class_size, title,
                        p = 0.5)

  # Example where the distribution is not of a true single binomial
  fdata2 = rbind(data.frame(n_female = rbinom(50, class_size, 0.25)),
                data.frame(n_female = rbinom(10, class_size, 0.60)),
                stringsAsFactors = FALSE )
  PlotDistCountBinomial(fdata2, "n_female", class_size, title)
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(23590)
class_size = 35
nclasses = 100
true_frate = 0.4
fdata = data.frame(n_female = rbinom(nclasses, class_size, true_frate), stringsAsFactors = FALSE)

title = paste("Distribution of count of female students, class size =", class_size)
# compare to empirical p
PlotDistCountBinomial(fdata, "n_female", class_size, title)

if(FALSE) {
  # compare to theoretical p of 0.5
  PlotDistCountBinomial(fdata, "n_female", class_size, title,
                        p = 0.5)

  # Example where the distribution is not of a true single binomial
  fdata2 = rbind(data.frame(n_female = rbinom(50, class_size, 0.25)),
                data.frame(n_female = rbinom(10, class_size, 0.60)),
                stringsAsFactors = FALSE )
  PlotDistCountBinomial(fdata2, "n_female", class_size, title)
}

Plot distribution details as a histogram plus matching normal

Description

Compares empirical data to a normal distribution with the same mean and standard deviation.

Usage

PlotDistCountNormal(
  frm,
  xvar,
  title,
  ...,
  binWidth = c(),
  hist_color = "black",
  normal_color = "blue",
  mean_color = "blue",
  sd_color = "blue"
)
PlotDistCountNormal(
  frm,
  xvar,
  title,
  ...,
  binWidth = c(),
  hist_color = "black",
  normal_color = "blue",
  mean_color = "blue",
  sd_color = "blue"
)

Arguments

`frm`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`binWidth`	width of histogram bins
`hist_color`	color of empirical histogram
`normal_color`	color of matching theoretical normal
`mean_color`	color of mean line
`sd_color`	color of 1-standard deviation lines (can be NULL)

Details

Plots the histograms of the empirical distribution and of the matching normal distribution. Also plots the mean and plus/minus one standard deviation.

Bin width for the histogram is calculated automatically to yield approximately 50 bins across the range of the data, unless the binWidth argument is explicitly passed in. binWidth is reported in the subtitle of the plot.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
d <- data.frame(wt=100*rnorm(100))
PlotDistCountNormal(d,'wt','example')

# # no sd lines
# PlotDistCountNormal(d, 'wt', 'example', sd_color=NULL)

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
d <- data.frame(wt=100*rnorm(100))
PlotDistCountNormal(d,'wt','example')

# # no sd lines
# PlotDistCountNormal(d, 'wt', 'example', sd_color=NULL)

Plot empirical rate data as a density with the matching beta distribution

Description

Compares empirical rate data to a beta distribution with the same mean and standard deviation.

Usage

PlotDistDensityBeta(
  frm,
  xvar,
  title,
  ...,
  curve_color = "lightgray",
  beta_color = "blue",
  mean_color = "blue",
  sd_color = "darkgray"
)
PlotDistDensityBeta(
  frm,
  xvar,
  title,
  ...,
  curve_color = "lightgray",
  beta_color = "blue",
  mean_color = "blue",
  sd_color = "darkgray"
)

Arguments

`frm`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`title`	title to place on plot
`...`	force later arguments to bind by name
`curve_color`	color for empirical density curve
`beta_color`	color for matching theoretical beta
`mean_color`	color for mean line
`sd_color`	color for 1-standard deviation lines (can be NULL)

Details

Plots the empirical density, the theoretical matching beta, the mean value, and plus/minus one standard deviation from the mean.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
N = 100
pgray = 0.1  # rate of gray horses in the population
herd_size = round(runif(N, min=25, 50))
ngray = rbinom(N, herd_size, pgray)
hdata = data.frame(n_gray=ngray, herd_size=herd_size)

# observed rate of gray horses in each herd
hdata$rate_gray = with(hdata, ngray/herd_size)

title = "Observed prevalence of gray horses in population"

PlotDistDensityBeta(hdata, "rate_gray", title) +
  ggplot2::geom_vline(xintercept = pgray, linetype=4, color="maroon") +
  ggplot2::annotate("text", x=pgray+0.01, y=0.01, hjust="left",
                    label = paste("True prevalence =", pgray))

# # no sd lines
# PlotDistDensityBeta(hdata, "rate_gray", title,
#                     sd_color=NULL)
if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
N = 100
pgray = 0.1  # rate of gray horses in the population
herd_size = round(runif(N, min=25, 50))
ngray = rbinom(N, herd_size, pgray)
hdata = data.frame(n_gray=ngray, herd_size=herd_size)

# observed rate of gray horses in each herd
hdata$rate_gray = with(hdata, ngray/herd_size)

title = "Observed prevalence of gray horses in population"

PlotDistDensityBeta(hdata, "rate_gray", title) +
  ggplot2::geom_vline(xintercept = pgray, linetype=4, color="maroon") +
  ggplot2::annotate("text", x=pgray+0.01, y=0.01, hjust="left",
                    label = paste("True prevalence =", pgray))

# # no sd lines
# PlotDistDensityBeta(hdata, "rate_gray", title,
#                     sd_color=NULL)

Plot an empirical density with the matching normal distribution

Description

Compares empirical data to a normal distribution with the same mean and standard deviation.

Usage

PlotDistDensityNormal(
  frm,
  xvar,
  title,
  ...,
  adjust = 0.5,
  curve_color = "lightgray",
  normal_color = "blue",
  mean_color = "blue",
  sd_color = "darkgray"
)
PlotDistDensityNormal(
  frm,
  xvar,
  title,
  ...,
  adjust = 0.5,
  curve_color = "lightgray",
  normal_color = "blue",
  mean_color = "blue",
  sd_color = "darkgray"
)

Arguments

`frm`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`adjust`	passed to geom_density; controls smoothness of density plot
`curve_color`	color for empirical density curve
`normal_color`	color for theoretical matching normal
`mean_color`	color of mean line
`sd_color`	color for 1-standard deviation lines (can be NULL)

Details

Plots the empirical density, the theoretical matching normal, the mean value, and plus/minus one standard deviation from the mean.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
d <- data.frame(wt=100*rnorm(100))
PlotDistDensityNormal(d,'wt','example')

# # no sd lines
# PlotDistDensityNormal(d, 'wt', 'example', sd_color=NULL)

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
d <- data.frame(wt=100*rnorm(100))
PlotDistDensityNormal(d,'wt','example')

# # no sd lines
# PlotDistDensityNormal(d, 'wt', 'example', sd_color=NULL)

Plot empirical rate data as a histogram plus matching beta

Description

Compares empirical rate data to a beta distribution with the same mean and standard deviation.

Usage

PlotDistHistBeta(
  frm,
  xvar,
  title,
  ...,
  bins = 30,
  hist_color = "darkgray",
  beta_color = "blue",
  mean_color = "blue",
  sd_color = "darkgray"
)
PlotDistHistBeta(
  frm,
  xvar,
  title,
  ...,
  bins = 30,
  hist_color = "darkgray",
  beta_color = "blue",
  mean_color = "blue",
  sd_color = "darkgray"
)

Arguments

`frm`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`title`	title to place on plot
`...`	force later arguments to bind by name
`bins`	passed to geom_histogram(). Default: 30
`hist_color`	color of empirical histogram
`beta_color`	color of matching theoretical beta
`mean_color`	color of mean line
`sd_color`	color of 1-standard devation lines (can be NULL)

Details

Plots the histogram of the empirical distribution and the density of the matching beta distribution. Also plots the mean and plus/minus one standard deviation.

The number of bins for the histogram defaults to 30. The binwidth can also be passed in instead of the number of bins.

Value

ggplot2 plot

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
N = 100
pgray = 0.1  # rate of gray horses in the population
herd_size = round(runif(N, min=25, 50))
ngray = rbinom(N, herd_size, pgray)
hdata = data.frame(n_gray=ngray, herd_size=herd_size)

# observed rate of gray horses in each herd
hdata$rate_gray = with(hdata, n_gray/herd_size)

title = "Observed prevalence of gray horses in population"

PlotDistHistBeta(hdata, "rate_gray", title) +
  ggplot2::geom_vline(xintercept = pgray, linetype=4, color="maroon") +
  ggplot2::annotate("text", x=pgray+0.01, y=0.01, hjust="left",
                    label = paste("True prevalence =", pgray))

# # no sd lines
# PlotDistHistBeta(hdata, "rate_gray", title,
#                     sd_color=NULL)
if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
N = 100
pgray = 0.1  # rate of gray horses in the population
herd_size = round(runif(N, min=25, 50))
ngray = rbinom(N, herd_size, pgray)
hdata = data.frame(n_gray=ngray, herd_size=herd_size)

# observed rate of gray horses in each herd
hdata$rate_gray = with(hdata, n_gray/herd_size)

title = "Observed prevalence of gray horses in population"

PlotDistHistBeta(hdata, "rate_gray", title) +
  ggplot2::geom_vline(xintercept = pgray, linetype=4, color="maroon") +
  ggplot2::annotate("text", x=pgray+0.01, y=0.01, hjust="left",
                    label = paste("True prevalence =", pgray))

# # no sd lines
# PlotDistHistBeta(hdata, "rate_gray", title,
#                     sd_color=NULL)

Use `plotly` to produce a ROC plot.

Description

Note: any arrange_ warning is a version incompatibility between plotly and dplyr.

Usage

plotlyROC(
  d,
  predCol,
  outcomeCol,
  outcomeTarget,
  title,
  ...,
  estimate_sig = FALSE
)
plotlyROC(
  d,
  predCol,
  outcomeCol,
  outcomeTarget,
  title,
  ...,
  estimate_sig = FALSE
)

Arguments

`d`	dataframe
`predCol`	name of column with numeric predictions
`outcomeCol`	name of column with truth
`outcomeTarget`	value considered true
`title`	character title for plot
`...`	no unnamed argument, added to force named binding of later arguments.
`estimate_sig`	logical, if TRUE estimate and display significance of difference from AUC 0.5.

Value

plotly plot

Examples


if(FALSE && requireNamespace("plotly", quietly = TRUE)) {
   if (requireNamespace('data.table', quietly = TRUE)) {
	   # don't multi-thread during CRAN checks
		   data.table::setDTthreads(1)
   }
   set.seed(34903490)
   x = rnorm(50)
   y = 0.5*x^2 + 2*x + rnorm(length(x))
   frm = data.frame(x=x,yC=y>=as.numeric(quantile(y,probs=0.8)))
   plotlyROC(frm, 'x', 'yC', TRUE, 'example plot', estimate_sig = TRUE)
}


if(FALSE && requireNamespace("plotly", quietly = TRUE)) {
   if (requireNamespace('data.table', quietly = TRUE)) {
	   # don't multi-thread during CRAN checks
		   data.table::setDTthreads(1)
   }
   set.seed(34903490)
   x = rnorm(50)
   y = 0.5*x^2 + 2*x + rnorm(length(x))
   frm = data.frame(x=x,yC=y>=as.numeric(quantile(y,probs=0.8)))
   plotlyROC(frm, 'x', 'yC', TRUE, 'example plot', estimate_sig = TRUE)
}

Plot Precision-Recall plot.

Description

Plot Precision-Recall plot.

Usage

PRPlot(frame, xvar, truthVar, truthTarget, title, ..., estimate_sig = FALSE)
PRPlot(frame, xvar, truthVar, truthTarget, title, ..., estimate_sig = FALSE)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`truthTarget`	value we consider to be positive
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`estimate_sig`	logical, if TRUE compute significance

Details

See https://www.nature.com/articles/nmeth.3945 for a discussion of precision and recall, and how the precision/recall plot relates to the ROC plot.

In addition to plotting precision versus recall, PRPlot reports the best achieved F1 score, and plots an isoline corresponding to that F1 score.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(x=x,y=y,yC=y>=as.numeric(quantile(y,probs=0.8)))
frm$absY <- abs(frm$y)
frm$posY = frm$y > 0
frm$costX = 1
WVPlots::PRPlot(frm, "x", "yC", TRUE, title="Example Precision-Recall plot")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(x=x,y=y,yC=y>=as.numeric(quantile(y,probs=0.8)))
frm$absY <- abs(frm$y)
frm$posY = frm$y > 0
frm$costX = 1
WVPlots::PRPlot(frm, "x", "yC", TRUE, title="Example Precision-Recall plot")

Plot Precision-Recall or Enrichment-Recall as a function of threshold.

Description

Plot classifier performance metrics as a function of threshold.

Usage

PRTPlot(
  frame,
  predVar,
  truthVar,
  truthTarget,
  title,
  ...,
  plotvars = c("precision", "recall"),
  thresholdrange = c(-Inf, Inf),
  linecolor = "black"
)
PRTPlot(
  frame,
  predVar,
  truthVar,
  truthTarget,
  title,
  ...,
  plotvars = c("precision", "recall"),
  thresholdrange = c(-Inf, Inf),
  linecolor = "black"
)

Arguments

`frame`	data frame to get values from
`predVar`	name of the column of predicted scores
`truthVar`	name of the column of actual outcomes in frame
`truthTarget`	value we consider to be positive
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`plotvars`	variables to plot, must be at least one of the measures listed below. Defaults to c("precision", "recall")
`thresholdrange`	range of thresholds to plot.
`linecolor`	line color for the plot

Details

For a classifier, the precision is what fraction of predicted positives are true positives; the recall is what fraction of true positives the classifier finds, and the enrichment is the ratio of classifier precision to the average rate of positives. Plotting precision-recall or enrichment-recall as a function of classifier score helps identify a score threshold that achieves an acceptable tradeoff between precision and recall, or enrichment and recall.

In addition to precision/recall, PRTPlot can plot a number of other metrics:

precision: fraction of predicted positives that are true positives
recall: fraction of true positives that were predicted to be true
enrichment: ratio of classifier precision to prevalence of positive class
sensitivity: the same as recall (also known as the true positive rate)
specificity: fraction of true negatives to all negatives (or 1 - false_positive_rate)
false_positive_rate: fraction of negatives predicted to be true over all negatives

For example, plotting sensitivity/false_positive_rate as functions of threshold will "unroll" an ROC Plot.

Plots are in a single column, in the order specified by plotvars.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

df <- iris
df$isVersicolor <- with(df, Species=='versicolor')
model = glm(isVersicolor ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
            data=df, family=binomial)
df$pred = predict(model, newdata=df, type="response")

WVPlots::PRTPlot(df, "pred", "isVersicolor", TRUE, title="Example Precision-Recall threshold plot")

if (FALSE) {
WVPlots::PRTPlot(df, "pred", "isVersicolor", TRUE,
                 plotvars = c("sensitivity", "specificity", "false_positive_rate"),
                 title="Sensitivity/specificity/FPR as functions of threshold")
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

df <- iris
df$isVersicolor <- with(df, Species=='versicolor')
model = glm(isVersicolor ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
            data=df, family=binomial)
df$pred = predict(model, newdata=df, type="response")

WVPlots::PRTPlot(df, "pred", "isVersicolor", TRUE, title="Example Precision-Recall threshold plot")

if (FALSE) {
WVPlots::PRTPlot(df, "pred", "isVersicolor", TRUE,
                 plotvars = c("sensitivity", "specificity", "false_positive_rate"),
                 title="Sensitivity/specificity/FPR as functions of threshold")
}

Plot receiver operating characteristic plot.

Description

Plot receiver operating characteristic plot.

Usage

ROCPlot(
  frame,
  xvar,
  truthVar,
  truthTarget,
  title,
  ...,
  estimate_sig = FALSE,
  returnScores = FALSE,
  nrep = 100,
  parallelCluster = NULL,
  curve_color = "darkblue",
  fill_color = "black",
  diag_color = "black",
  add_beta_ideal_curve = FALSE,
  beta_ideal_curve_color = "#fd8d3c",
  add_beta1_ideal_curve = FALSE,
  beta1_ideal_curve_color = "#f03b20",
  add_symmetric_ideal_curve = FALSE,
  symmetric_ideal_curve_color = "#bd0026",
  add_convex_hull = FALSE,
  convex_hull_color = "#404040",
  ideal_plot_step_size = 0.001
)
ROCPlot(
  frame,
  xvar,
  truthVar,
  truthTarget,
  title,
  ...,
  estimate_sig = FALSE,
  returnScores = FALSE,
  nrep = 100,
  parallelCluster = NULL,
  curve_color = "darkblue",
  fill_color = "black",
  diag_color = "black",
  add_beta_ideal_curve = FALSE,
  beta_ideal_curve_color = "#fd8d3c",
  add_beta1_ideal_curve = FALSE,
  beta1_ideal_curve_color = "#f03b20",
  add_symmetric_ideal_curve = FALSE,
  symmetric_ideal_curve_color = "#bd0026",
  add_convex_hull = FALSE,
  convex_hull_color = "#404040",
  ideal_plot_step_size = 0.001
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`truthTarget`	value we consider to be positive
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`estimate_sig`	logical, if TRUE estimate and display significance of difference from AUC 0.5.
`returnScores`	logical if TRUE return detailed permutedScores
`nrep`	number of permutation repetitions to estimate p values.
`parallelCluster`	(optional) a cluster object created by package parallel or package snow.
`curve_color`	color of the ROC curve
`fill_color`	shading color for the area under the curve
`diag_color`	color for the AUC=0.5 line (x=y)
`add_beta_ideal_curve`	logical, if TRUE add the beta(a, b), beta(c, d) ideal curve found by moment matching.
`beta_ideal_curve_color`	color for ideal curve.
`add_beta1_ideal_curve`	logical, if TRUE add the beta(1, a), beta(b, 2) ideal curve defined in doi:10.1177/0272989X15582210
`beta1_ideal_curve_color`	color for ideal curve.
`add_symmetric_ideal_curve`	logical, if TRUE add the ideal curve as discussed in https://win-vector.com/2020/09/13/why-working-with-auc-is-more-powerful-than-one-might-think/.
`symmetric_ideal_curve_color`	color for ideal curve.
`add_convex_hull`	logical, if TRUE add convex hull to plot
`convex_hull_color`	color for convex hull curve
`ideal_plot_step_size`	step size used in ideal plots

Details

See https://www.nature.com/articles/nmeth.3945 for a discussion of true positive and false positive rates, and how the ROC plot relates to the precision/recall plot.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

beta_example <- function(
  n,
  shape1_pos, shape2_pos,
  shape1_neg, shape2_neg) {
  d <- data.frame(
    y = sample(
      c(TRUE, FALSE),
      size = n,
      replace = TRUE),
    score = 0.0
  )
  d$score[d$y] <- rbeta(sum(d$y), shape1 = shape1_pos, shape2 = shape2_pos)
  d$score[!d$y] <- rbeta(sum(!d$y), shape1 = shape1_neg, shape2 = shape2_neg)
  d
}

d1 <- beta_example(
  100,
  shape1_pos = 6,
  shape2_pos = 5,
  shape1_neg = 1,
  shape2_neg = 2)

ROCPlot(
   d1,
   xvar = "score",
   truthVar = "y", truthTarget = TRUE,
   title="Example ROC plot",
   estimate_sig = TRUE,
   add_beta_ideal_curve = TRUE,
   add_convex_hull = TRUE)

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

beta_example <- function(
  n,
  shape1_pos, shape2_pos,
  shape1_neg, shape2_neg) {
  d <- data.frame(
    y = sample(
      c(TRUE, FALSE),
      size = n,
      replace = TRUE),
    score = 0.0
  )
  d$score[d$y] <- rbeta(sum(d$y), shape1 = shape1_pos, shape2 = shape2_pos)
  d$score[!d$y] <- rbeta(sum(!d$y), shape1 = shape1_neg, shape2 = shape2_neg)
  d
}

d1 <- beta_example(
  100,
  shape1_pos = 6,
  shape2_pos = 5,
  shape1_neg = 1,
  shape2_neg = 2)

ROCPlot(
   d1,
   xvar = "score",
   truthVar = "y", truthTarget = TRUE,
   title="Example ROC plot",
   estimate_sig = TRUE,
   add_beta_ideal_curve = TRUE,
   add_convex_hull = TRUE)

Compare multiple ROC plots.

Description

Plot multiple receiver operating characteristic curves from the same data.frame.

Usage

ROCPlotList(
  frame,
  xvar_names,
  truthVar,
  truthTarget,
  title,
  ...,
  palette = "Dark2"
)

ROCPlotPairList(
  frame,
  xvar_names,
  truthVar,
  truthTarget,
  title,
  ...,
  palette = "Dark2"
)

ROCListPlot(
  frame,
  xvar_names,
  truthVar,
  truthTarget,
  title,
  ...,
  palette = "Dark2"
)
ROCPlotList(
  frame,
  xvar_names,
  truthVar,
  truthTarget,
  title,
  ...,
  palette = "Dark2"
)

ROCPlotPairList(
  frame,
  xvar_names,
  truthVar,
  truthTarget,
  title,
  ...,
  palette = "Dark2"
)

ROCListPlot(
  frame,
  xvar_names,
  truthVar,
  truthTarget,
  title,
  ...,
  palette = "Dark2"
)

Arguments

`frame`	data frame to get values from
`xvar_names`	names of the independent (input or model) columns in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`truthTarget`	value we consider to be positive
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`palette`	name of a brewer palette (NULL for ggplot2 default coloring)

Details

The use case for this function is to compare the performance of two models when applied to a data set, where the predictions from both models are columns of the same data frame.

If palette is NULL, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_color_manual.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x1 = rnorm(50)
x2 = rnorm(length(x1))
x3 = rnorm(length(x1))
y = 0.2*x2^2 + 0.5*x2 + x1 + rnorm(length(x1))
frm = data.frame(
   x1 = x1,
   x2 = x2,
   x3 = x3,
   yC = y >= as.numeric(quantile(y,probs=0.8)))
WVPlots::ROCPlotList(
   frame = frm,
   xvar_names = c("x1", "x2", "x3"),
   truthVar = "yC", truthTarget = TRUE,
   title = "Example ROC list plot")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x1 = rnorm(50)
x2 = rnorm(length(x1))
x3 = rnorm(length(x1))
y = 0.2*x2^2 + 0.5*x2 + x1 + rnorm(length(x1))
frm = data.frame(
   x1 = x1,
   x2 = x2,
   x3 = x3,
   yC = y >= as.numeric(quantile(y,probs=0.8)))
WVPlots::ROCPlotList(
   frame = frm,
   xvar_names = c("x1", "x2", "x3"),
   truthVar = "yC", truthTarget = TRUE,
   title = "Example ROC list plot")

Compare two ROC plots.

Description

Plot two receiver operating characteristic curves from the same data.frame.

Usage

ROCPlotPair(
  frame,
  xvar1,
  xvar2,
  truthVar,
  truthTarget,
  title,
  ...,
  estimate_sig = FALSE,
  returnScores = FALSE,
  nrep = 100,
  parallelCluster = NULL,
  palette = "Dark2"
)
ROCPlotPair(
  frame,
  xvar1,
  xvar2,
  truthVar,
  truthTarget,
  title,
  ...,
  estimate_sig = FALSE,
  returnScores = FALSE,
  nrep = 100,
  parallelCluster = NULL,
  palette = "Dark2"
)

Arguments

`frame`	data frame to get values from
`xvar1`	name of the first independent (input or model) column in frame
`xvar2`	name of the second independent (input or model) column in frame
`truthVar`	name of the dependent (output or result to be modeled) column in frame
`truthTarget`	value we consider to be positive
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`estimate_sig`	logical, if TRUE estimate and display significance of difference from AUC 0.5.
`returnScores`	logical if TRUE return detailed permutedScores
`nrep`	number of permutation repetitions to estimate p values.
`parallelCluster`	(optional) a cluster object created by package parallel or package snow.
`palette`	name of a brewer palette (NULL for ggplot2 default coloring)

Details

The use case for this function is to compare the performance of two models when applied to a data set, where the predictions from both models are columns of the same data frame.

If palette is NULL, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_color_manual.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x1 = rnorm(50)
x2 = rnorm(length(x1))
y = 0.2*x2^2 + 0.5*x2 + x1 + rnorm(length(x1))
frm = data.frame(x1=x1,x2=x2,yC=y>=as.numeric(quantile(y,probs=0.8)))
# WVPlots::ROCPlot(frm, "x1", "yC", TRUE, title="Example ROC plot")
# WVPlots::ROCPlot(frm, "x2", "yC", TRUE, title="Example ROC plot")
WVPlots::ROCPlotPair(frm, "x1", "x2", "yC", TRUE,
   title="Example ROC pair plot", estimate_sig = TRUE)

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x1 = rnorm(50)
x2 = rnorm(length(x1))
y = 0.2*x2^2 + 0.5*x2 + x1 + rnorm(length(x1))
frm = data.frame(x1=x1,x2=x2,yC=y>=as.numeric(quantile(y,probs=0.8)))
# WVPlots::ROCPlot(frm, "x1", "yC", TRUE, title="Example ROC plot")
# WVPlots::ROCPlot(frm, "x2", "yC", TRUE, title="Example ROC plot")
WVPlots::ROCPlotPair(frm, "x1", "x2", "yC", TRUE,
   title="Example ROC pair plot", estimate_sig = TRUE)

Compare two ROC plots.

Description

Plot two receiver operating characteristic curves from different data frames.

Usage

ROCPlotPair2(
  nm1,
  frame1,
  xvar1,
  truthVar1,
  truthTarget1,
  nm2,
  frame2,
  xvar2,
  truthVar2,
  truthTarget2,
  title,
  ...,
  estimate_sig = TRUE,
  returnScores = FALSE,
  nrep = 100,
  parallelCluster = NULL,
  palette = "Dark2"
)
ROCPlotPair2(
  nm1,
  frame1,
  xvar1,
  truthVar1,
  truthTarget1,
  nm2,
  frame2,
  xvar2,
  truthVar2,
  truthTarget2,
  title,
  ...,
  estimate_sig = TRUE,
  returnScores = FALSE,
  nrep = 100,
  parallelCluster = NULL,
  palette = "Dark2"
)

Arguments

`nm1`	name of first model
`frame1`	data frame to get values from
`xvar1`	name of the first independent (input or model) column in frame
`truthVar1`	name of the dependent (output or result to be modeled) column in frame
`truthTarget1`	value we consider to be positive
`nm2`	name of second model
`frame2`	data frame to get values from
`xvar2`	name of the first independent (input or model) column in frame
`truthVar2`	name of the dependent (output or result to be modeled) column in frame
`truthTarget2`	value we consider to be positive
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`estimate_sig`	logical, if TRUE estimate and display significance of difference from AUC 0.5.
`returnScores`	logical if TRUE return detailed permutedScores
`nrep`	number of permutation repetitions to estimate p values.
`parallelCluster`	(optional) a cluster object created by package parallel or package snow.
`palette`	name of Brewer palette to color curves (can be NULL)

Details

Use this curve to compare model predictions to true outcome from two data frames, each of which has its own model predictions and true outcome columns.

If palette is NULL, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_color_manual.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x1 = rnorm(50)
x2 = rnorm(length(x1))
y = 0.2*x2^2 + 0.5*x2 + x1 + rnorm(length(x1))
frm = data.frame(x1=x1,x2=x2,yC=y>=as.numeric(quantile(y,probs=0.8)))
# WVPlots::ROCPlot(frm, "x1", "yC", TRUE, title="Example ROC plot")
# WVPlots::ROCPlot(frm, "x2", "yC", TRUE, title="Example ROC plot")
WVPlots::ROCPlotPair2('train',frm, "x1", "yC", TRUE,
                      'test', frm, "x2", "yC", TRUE,
                      title="Example ROC pair plot", estimate_sig = TRUE)

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x1 = rnorm(50)
x2 = rnorm(length(x1))
y = 0.2*x2^2 + 0.5*x2 + x1 + rnorm(length(x1))
frm = data.frame(x1=x1,x2=x2,yC=y>=as.numeric(quantile(y,probs=0.8)))
# WVPlots::ROCPlot(frm, "x1", "yC", TRUE, title="Example ROC plot")
# WVPlots::ROCPlot(frm, "x2", "yC", TRUE, title="Example ROC plot")
WVPlots::ROCPlotPair2('train',frm, "x1", "yC", TRUE,
                      'test', frm, "x2", "yC", TRUE,
                      title="Example ROC pair plot", estimate_sig = TRUE)

Plot a scatter box plot.

Description

Plot a boxplot with the data points superimposed.

Usage

ScatterBoxPlot(
  frm,
  xvar,
  yvar,
  title,
  ...,
  pt_alpha = 0.3,
  pt_color = "black",
  box_color = "black",
  box_fill = "lightgray"
)
ScatterBoxPlot(
  frm,
  xvar,
  yvar,
  title,
  ...,
  pt_alpha = 0.3,
  pt_color = "black",
  box_color = "black",
  box_fill = "lightgray"
)

Arguments

`frm`	data frame to get values from
`xvar`	name of the independent column in frame; assumed discrete
`yvar`	name of the continuous column in frame
`title`	plot title
`...`	(doesn't take additional arguments, used to force later arguments by name)
`pt_alpha`	transparency of points in scatter plot
`pt_color`	point color
`box_color`	boxplot line color
`box_fill`	boxplot fill color (can be NA for no fill)

Details

xvar is a discrete variable and yvar is a continuous variable.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

classes = c("a", "b", "c")
means = c(2, 4, 3)
names(means) = classes
label = sample(classes, size=1000, replace=TRUE)
meas = means[label] + rnorm(1000)
frm2 = data.frame(label=label,
                  meas = meas)
WVPlots::ScatterBoxPlot(frm2, "label", "meas", pt_alpha=0.2, title="Example Scatter/Box plot")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

classes = c("a", "b", "c")
means = c(2, 4, 3)
names(means) = classes
label = sample(classes, size=1000, replace=TRUE)
meas = means[label] + rnorm(1000)
frm2 = data.frame(label=label,
                  meas = meas)
WVPlots::ScatterBoxPlot(frm2, "label", "meas", pt_alpha=0.2, title="Example Scatter/Box plot")

Plot a scatter box plot in horizontal mode.

Description

Plot a boxplot with the data points superimposed. Box plots are aligned horizontally.

Usage

ScatterBoxPlotH(
  frm,
  xvar,
  yvar,
  title,
  ...,
  pt_alpha = 0.3,
  pt_color = "black",
  box_color = "black",
  box_fill = "lightgray"
)
ScatterBoxPlotH(
  frm,
  xvar,
  yvar,
  title,
  ...,
  pt_alpha = 0.3,
  pt_color = "black",
  box_color = "black",
  box_fill = "lightgray"
)

Arguments

`frm`	data frame to get values from
`xvar`	name of the continuous column in frame
`yvar`	name of the independent column in frame; assumed discrete
`title`	plot title
`...`	(doesn't take additional arguments, used to force later arguments by name)
`pt_alpha`	transparency of points in scatter plot
`pt_color`	point color
`box_color`	boxplot line color
`box_fill`	boxplot fill color (can be NA for no fill)

Details

xvar is a continuous variable and yvar is a discrete variable.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

classes = c("a", "b", "c")
means = c(2, 4, 3)
names(means) = classes
label = sample(classes, size=1000, replace=TRUE)
meas = means[label] + rnorm(1000)
frm2 = data.frame(label=label,
                  meas = meas)
WVPlots::ScatterBoxPlotH(frm2, "meas", "label",  pt_alpha=0.2, title="Example Scatter/Box plot")


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

classes = c("a", "b", "c")
means = c(2, 4, 3)
names(means) = classes
label = sample(classes, size=1000, replace=TRUE)
meas = means[label] + rnorm(1000)
frm2 = data.frame(label=label,
                  meas = meas)
WVPlots::ScatterBoxPlotH(frm2, "meas", "label",  pt_alpha=0.2, title="Example Scatter/Box plot")

Plot a scatter plot with marginals.

Description

Plot a scatter plot with optional smoothing curves or contour lines, and marginal histogram/density plots. Based on https://win-vector.com/2015/06/11/wanted-a-perfect-scatterplot-with-marginals/. See also ggExtra::ggMarginal.

Usage

ScatterHist(
  frame,
  xvar,
  yvar,
  title,
  ...,
  smoothmethod = "lm",
  estimate_sig = FALSE,
  minimal_labels = TRUE,
  binwidth_x = NULL,
  binwidth_y = NULL,
  adjust_x = 1,
  adjust_y = 1,
  point_alpha = 0.5,
  contour = FALSE,
  point_color = "black",
  hist_color = "gray",
  smoothing_color = "blue",
  density_color = "blue",
  contour_color = "blue"
)
ScatterHist(
  frame,
  xvar,
  yvar,
  title,
  ...,
  smoothmethod = "lm",
  estimate_sig = FALSE,
  minimal_labels = TRUE,
  binwidth_x = NULL,
  binwidth_y = NULL,
  adjust_x = 1,
  adjust_y = 1,
  point_alpha = 0.5,
  contour = FALSE,
  point_color = "black",
  hist_color = "gray",
  smoothing_color = "blue",
  density_color = "blue",
  contour_color = "blue"
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the independent (input or model) column in frame
`yvar`	name of the dependent (output or result to be modeled) column in frame
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`smoothmethod`	(optional) one of 'auto', 'loess', 'gam', 'lm', 'identity', or 'none'.
`estimate_sig`	logical if TRUE and smoothmethod is 'identity' or 'lm', report goodness of fit and significance of relation.
`minimal_labels`	logical drop some annotations
`binwidth_x`	numeric binwidth for x histogram
`binwidth_y`	numeric binwidth for y histogram
`adjust_x`	numeric adjust x density plot
`adjust_y`	numeric adjust y density plot
`point_alpha`	numeric opaqueness of the plot points
`contour`	logical if TRUE add a 2d contour plot
`point_color`	color for scatter plots
`hist_color`	fill color for marginal histograms
`smoothing_color`	color for smoothing line
`density_color`	color for marginal density plots
`contour_color`	color for contour plots

Details

If smoothmethod is:

'auto', 'loess' or 'gam': the appropriate smoothing curve is added to the scatterplot.
'lm' (the default): the best fit line is added to the scatterplot.
'identity': the line x = y is added to the scatterplot. This is useful for comparing model predictions to true outcome.
'none': no smoothing line is added to the scatterplot.

If estimate_sig is TRUE and smoothmethod is:

'lm': the R-squared of the linear fit is reported.
'identity': the R-squared of the exact relation between xvar and yvar is reported.

Note that the identity R-squared is NOT the square of the correlation between xvar and yvar (which includes an implicit shift and scale). It is the coefficient of determination between xvar and yvar, and can be negative. See https://en.wikipedia.org/wiki/Coefficient_of_determination for more details. If xvar is the output of a model to predict yvar, then the identity R-squared, not the lm R-squared, is the correct measure.

If smoothmethod is neither 'lm' or 'identity' then estimate_sig is ignored.

Value

plot grid

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(x=x,y=y)
WVPlots::ScatterHist(frm, "x", "y",
  title= "Example Fit",
  smoothmethod = "gam",
  contour = TRUE)

if (FALSE) {
# Same plot with custom colors
WVPlots::ScatterHist(frm, "x", "y",
  title= "Example Fit",
  smoothmethod = "gam",
  contour = TRUE,
  point_color = "#006d2c", # dark green
  hist_color = "#6baed6", # medium blue
  smoothing_color = "#54278f", # dark purple
  density_color = "#08519c", # darker blue
  contour_color = "#9e9ac8") # lighter purple
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
x = rnorm(50)
y = 0.5*x^2 + 2*x + rnorm(length(x))
frm = data.frame(x=x,y=y)
WVPlots::ScatterHist(frm, "x", "y",
  title= "Example Fit",
  smoothmethod = "gam",
  contour = TRUE)

if (FALSE) {
# Same plot with custom colors
WVPlots::ScatterHist(frm, "x", "y",
  title= "Example Fit",
  smoothmethod = "gam",
  contour = TRUE,
  point_color = "#006d2c", # dark green
  hist_color = "#6baed6", # medium blue
  smoothing_color = "#54278f", # dark purple
  density_color = "#08519c", # darker blue
  contour_color = "#9e9ac8") # lighter purple
}

Plot a conditional scatter plot with marginals.

Description

Plot a scatter plot conditioned on a discrete variable, with marginal conditional density plots.

Usage

ScatterHistC(
  frame,
  xvar,
  yvar,
  cvar,
  title,
  ...,
  annot_size = 3,
  colorPalette = "Dark2",
  adjust_x = 1,
  adjust_y = 1
)
ScatterHistC(
  frame,
  xvar,
  yvar,
  cvar,
  title,
  ...,
  annot_size = 3,
  colorPalette = "Dark2",
  adjust_x = 1,
  adjust_y = 1
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the x variable
`yvar`	name of the y variable
`cvar`	name of condition variable
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`annot_size`	numeric scale annotation text (if present)
`colorPalette`	name of a Brewer palette (see https://colorbrewer2.org/ )
`adjust_x`	numeric: adjust x density plot
`adjust_y`	numeric: adjust y density plot

Details

xvar and yvar are the coordinates of the points, and cvar is the discrete conditioning variable that indicates which category each point (x,y) belongs to.

Value

plot grid

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
frm = data.frame(x=rnorm(50),y=rnorm(50))
frm$cat <- frm$x+frm$y>0
WVPlots::ScatterHistC(frm, "x", "y", "cat",
                      title="Example Conditional Distribution")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
frm = data.frame(x=rnorm(50),y=rnorm(50))
frm$cat <- frm$x+frm$y>0
WVPlots::ScatterHistC(frm, "x", "y", "cat",
                      title="Example Conditional Distribution")

Plot a height scatter plot with marginals.

Description

Plot a scatter plot conditioned on a continuous variable, with marginal conditional density plots.

Usage

ScatterHistN(
  frame,
  xvar,
  yvar,
  zvar,
  title,
  ...,
  annot_size = 3,
  colorPalette = "RdYlBu",
  nclus = 3,
  adjust_x = 1,
  adjust_y = 1
)
ScatterHistN(
  frame,
  xvar,
  yvar,
  zvar,
  title,
  ...,
  annot_size = 3,
  colorPalette = "RdYlBu",
  nclus = 3,
  adjust_x = 1,
  adjust_y = 1
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the x variable
`yvar`	name of the y variable
`zvar`	name of height variable
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`annot_size`	numeric: scale annotation text (if present)
`colorPalette`	name of a Brewer palette (see https://colorbrewer2.org/ )
`nclus`	scalar: number of z-clusters to plot
`adjust_x`	numeric: adjust x density plot
`adjust_y`	numeric: adjust y density plot

Details

xvar and yvar are the coordinates of the points, and zvar is the continuous conditioning variable. zvar is partitioned into nclus disjoint ranges (by default, 3), which are then treated as discrete categories.The scatterplot and marginal density plots are color-coded by these categories.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
frm = data.frame(x=rnorm(50),y=rnorm(50))
frm$z <- frm$x+frm$y
WVPlots::ScatterHistN(frm, "x", "y", "z", title="Example Joint Distribution")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(34903490)
frm = data.frame(x=rnorm(50),y=rnorm(50))
frm$z <- frm$x+frm$y
WVPlots::ScatterHistN(frm, "x", "y", "z", title="Example Joint Distribution")

Plot the distribution of a variable with a tail shaded

Description

Plot the distribution of a variable with a tail shaded. Annotate with the area of the shaded region.

Usage

ShadedDensity(
  frame,
  xvar,
  threshold,
  title,
  ...,
  tail = "left",
  linecolor = "darkgray",
  shading = "darkblue",
  annotate_area = TRUE
)
ShadedDensity(
  frame,
  xvar,
  threshold,
  title,
  ...,
  tail = "left",
  linecolor = "darkgray",
  shading = "darkblue",
  annotate_area = TRUE
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the variable to be density plotted
`threshold`	boundary value for the tail
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`tail`	which tail to shade, 'left' (default) or 'right'
`linecolor`	color of density curve
`shading`	color of shaded region and boundaries
`annotate_area`	if TRUE (default), report the area of the shaded region

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
d = data.frame(meas=rnorm(100))
threshold = -1.5
WVPlots::ShadedDensity(d, "meas", threshold,
                       title="Example shaded density plot, left tail")
if (FALSE) {
WVPlots::ShadedDensity(d, "meas", -threshold, tail="right",
                       title="Example shaded density plot, right tail")
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
d = data.frame(meas=rnorm(100))
threshold = -1.5
WVPlots::ShadedDensity(d, "meas", threshold,
                       title="Example shaded density plot, left tail")
if (FALSE) {
WVPlots::ShadedDensity(d, "meas", -threshold, tail="right",
                       title="Example shaded density plot, right tail")
}

Plot the distribution of a variable with a center region shaded

Description

Plot the distribution of a variable with a center region shaded. Annotate with the area of the shaded region.

Usage

ShadedDensityCenter(
  frame,
  xvar,
  boundaries,
  title,
  ...,
  linecolor = "darkgray",
  shading = "darkblue",
  annotate_area = TRUE
)
ShadedDensityCenter(
  frame,
  xvar,
  boundaries,
  title,
  ...,
  linecolor = "darkgray",
  shading = "darkblue",
  annotate_area = TRUE
)

Arguments

`frame`	data frame to get values from
`xvar`	name of the variable to be density plotted
`boundaries`	vector of the min and max boundaries of the shaded region
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`linecolor`	color of density curve
`shading`	color of shaded region and boundaries
`annotate_area`	if TRUE (default), report the area of the shaded region

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
d = data.frame(meas=rnorm(100))
boundaries = c(-1.5, 1.5)
WVPlots::ShadedDensityCenter(d, "meas", boundaries,
                       title="Example center-shaded density plot")

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

set.seed(52523)
d = data.frame(meas=rnorm(100))
boundaries = c(-1.5, 1.5)
WVPlots::ShadedDensityCenter(d, "meas", boundaries,
                       title="Example center-shaded density plot")

Plot a Shadow Histogram Plot

Description

Plot a histogram of a continuous variable xvar, faceted on a categorical conditioning variable, condvar. Each faceted plot also shows a "shadow plot" of the unconditioned histogram for comparison.

Usage

ShadowHist(
  frm,
  xvar,
  condvar,
  title,
  ...,
  ncol = 1,
  monochrome = FALSE,
  palette = "Dark2",
  fillcolor = "darkblue",
  bins = 30,
  binwidth = NULL
)
ShadowHist(
  frm,
  xvar,
  condvar,
  title,
  ...,
  ncol = 1,
  monochrome = FALSE,
  palette = "Dark2",
  fillcolor = "darkblue",
  bins = 30,
  binwidth = NULL
)

Arguments

`frm`	data frame to get values from.
`xvar`	name of the primary continuous variable
`condvar`	name of conditioning variable (categorical variable, controls faceting).
`title`	title to place on plot.
`...`	no unnamed argument, added to force named binding of later arguments.
`ncol`	numeric: number of columns in facet_wrap.
`monochrome`	logical: if TRUE, all facets filled with same color
`palette`	character: if monochrome==FALSE, name of brewer color palette (can be NULL)
`fillcolor`	character: if monochrome==TRUE, name of fill color
`bins`	number of bins. Defaults to thirty.
`binwidth`	width of the bins. Overrides bins.

Details

Currently supports only the bins and binwidth arguments (see geom_histogram), but not the center, boundary, or breaks arguments.

By default, the facet plots are arranged in a single column. This can be changed with the optional ncol argument.

If palette is NULL, and monochrome is FALSE, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_fill_manual. For consistency with previous releases, ShadowHist defaults to monochrome = FALSE, while ShadowPlot defaults to monochrome = TRUE.

Please see here for some interesting discussion https://drsimonj.svbtle.com/plotting-background-data-for-groups-with-ggplot2.

Value

a ggplot2 histogram plot

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

ShadowHist(iris, "Petal.Length", "Species",
           title = "Petal Length distribution by Species")

if (FALSE) {
# make all the facets the same color
ShadowHist(iris, "Petal.Length", "Species",
           monochrome=TRUE,
           title = "Petal Length distribution by Species")
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

ShadowHist(iris, "Petal.Length", "Species",
           title = "Petal Length distribution by Species")

if (FALSE) {
# make all the facets the same color
ShadowHist(iris, "Petal.Length", "Species",
           monochrome=TRUE,
           title = "Petal Length distribution by Species")
}

Plot a Shadow Bar Plot

Description

Plot a bar chart of row counts conditioned on the categorical variable condvar, faceted on a second categorical variable, refinevar. Each faceted plot also shows a "shadow plot" of the totals conditioned on condvar alone.

Usage

ShadowPlot(
  frm,
  condvar,
  refinevar,
  title,
  ...,
  monochrome = TRUE,
  palette = "Dark2",
  fillcolor = "darkblue",
  ncol = 1
)
ShadowPlot(
  frm,
  condvar,
  refinevar,
  title,
  ...,
  monochrome = TRUE,
  palette = "Dark2",
  fillcolor = "darkblue",
  ncol = 1
)

Arguments

`frm`	data frame to get values from.
`condvar`	name of the primary conditioning variable (a categorical variable, controls x-axis).
`refinevar`	name of the second or refining conditioning variable (also a categorical variable, controls faceting).
`title`	title to place on plot.
`...`	no unnamed argument, added to force named binding of later arguments.
`monochrome`	logical: if TRUE, all facets filled with same color
`palette`	character: if monochrome==FALSE, name of brewer color palette (can be NULL)
`fillcolor`	character: if monochrome==TRUE, name of fill color for bars
`ncol`	numeric: number of columns in facet_wrap.

Details

This plot enables comparisons of subpopulation totals across both condvar and refinevar simultaneously.

By default, the facet plots are arranged in a single column. This can be changed with the optional ncol argument.

If palette is NULL, and monochrome is FALSE, plot colors will be chosen from the default ggplot2 palette. Setting palette to NULL allows the user to choose a non-Brewer palette, for example with scale_fill_manual. For consistency with previous releases, ShadowPlot defaults to monochrome = TRUE, while ShadowHist defaults to monochrome = FALSE.

Please see here for some interesting discussion https://drsimonj.svbtle.com/plotting-background-data-for-groups-with-ggplot2.

Value

a ggplot2 bar chart counting examples grouped by condvar, faceted by refinevar.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

ShadowPlot(mtcars, "carb", "cyl",
           title = "Number of example cars by carb and cyl counts")

if (FALSE) {
# colorcode the facets
ShadowPlot(mtcars, "carb", "cyl",
           monochrome = FALSE,
           title = "Number of example cars by carb and cyl counts")
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

ShadowPlot(mtcars, "carb", "cyl",
           title = "Number of example cars by carb and cyl counts")

if (FALSE) {
# colorcode the facets
ShadowPlot(mtcars, "carb", "cyl",
           monochrome = FALSE,
           title = "Number of example cars by carb and cyl counts")
}

Simulate the deprecated `ggplot2::aes_string()`.

Description

Use to allow replacing code of the form ggplot2::aes_string(...) with code of the form ggplot2::aes(!!!simulate_aes_string(...)). Purpose is to get out of the way of the deprecation and possible future removal of ggplot2::aes_string(). Inspired by the research of https://stackoverflow.com/a/74424353/6901725.

Usage

simulate_aes_string(...)
simulate_aes_string(...)

Arguments

...

named string arguments to turn into symbols using 'rlang::data_sym()'.

Value

some rlang NSE that simulates string values at great complexity (but needed for newer ggplot2()).

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

d <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
xvar <- 'x'  # the idea is, this is passed in and not known at coding time
yvar <- 'y'
# what we want:
#  ggplot2::ggplot(data = d, mapping = ggplot2::aes_string(x = xvar, y = yvar)) +
#     ggplot2::geom_point()
# The required "tidy evaluation ideoms[sic] with `aes()`".
ggplot2::ggplot(data = d, mapping = ggplot2::aes(!!!simulate_aes_string(x = xvar, y = yvar))) +
   ggplot2::geom_point()

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

d <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
xvar <- 'x'  # the idea is, this is passed in and not known at coding time
yvar <- 'y'
# what we want:
#  ggplot2::ggplot(data = d, mapping = ggplot2::aes_string(x = xvar, y = yvar)) +
#     ggplot2::geom_point()
# The required "tidy evaluation ideoms[sic] with `aes()`".
ggplot2::ggplot(data = d, mapping = ggplot2::aes(!!!simulate_aes_string(x = xvar, y = yvar))) +
   ggplot2::geom_point()

Plot classifier metrics as a function of thresholds.

Description

Plot classifier metrics as a function of thresholds.

Usage

ThresholdPlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  metrics = c("sensitivity", "specificity"),
  truth_target = TRUE,
  points_to_plot = NULL,
  monochrome = TRUE,
  palette = "Dark2",
  linecolor = "black"
)
ThresholdPlot(
  frame,
  xvar,
  truthVar,
  title,
  ...,
  metrics = c("sensitivity", "specificity"),
  truth_target = TRUE,
  points_to_plot = NULL,
  monochrome = TRUE,
  palette = "Dark2",
  linecolor = "black"
)

Arguments

`frame`	data frame to get values from
`xvar`	column of scores
`truthVar`	column of true outcomes
`title`	title to place on plot
`...`	no unnamed argument, added to force named binding of later arguments.
`metrics`	metrics to be computed. See Details for the list of allowed metrics
`truth_target`	truth value considered to be positive.
`points_to_plot`	how many data points to use for plotting. Defaults to NULL (all data)
`monochrome`	logical: if TRUE, all subgraphs plotted in same color
`palette`	character: if monochrome==FALSE, name of brewer color palette (can be NULL)
`linecolor`	character: if monochrome==TRUE, name of line color

Details

By default, ThresholdPlot plots sensitivity and specificity of a a classifier as a function of the decision threshold. Plotting sensitivity-specificity (or other metrics) as a function of classifier score helps identify a score threshold that achieves an acceptable tradeoff among desirable properties.

ThresholdPlot can plot a number of metrics. Some of the metrics are redundant, in keeping with the customary terminology of various analysis communities.

sensitivity: fraction of true positives that were predicted to be true (also known as the true positive rate)
specificity: fraction of true negatives to all negatives (or 1 - false_positive_rate)
precision: fraction of predicted positives that are true positives
recall: same as sensitivity or true positive rate
accuracy: fraction of items correctly decided
false_positive_rate: fraction of negatives predicted to be true over all negatives
true_positive_rate: fraction of positives predicted to be true over all positives
false_negative_rate: fraction of positives predicted to be all false over all positives
true_negative_rate: fraction negatives predicted to be false over all negatives

For example, plotting sensitivity/false_positive_rate as functions of threshold will "unroll" an ROC Plot.

ThresholdPlot can also plot distribution diagnostics about the scores:

fraction: the fraction of datums that scored greater than a given threshold
cdf: CDF or 1 - fraction; the fraction of datums that scored less than a given threshold

Plots are in a single column, in the order specified by metrics.

Examples


if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

# data with two different regimes of behavior
d <- rbind(
  data.frame(
    x =  rnorm(1000),
    y = sample(c(TRUE, FALSE), prob = c(0.02, 0.98), size = 1000, replace = TRUE)),
  data.frame(
    x =  rnorm(200) + 5,
    y = sample(c(TRUE, FALSE), size = 200, replace = TRUE))
)

# Sensitivity/Specificity examples
ThresholdPlot(d, 'x', 'y',
   title = 'Sensitivity/Specificity',
   metrics = c('sensitivity', 'specificity'),
   truth_target = TRUE)
if(FALSE) {
MetricPairPlot(d, 'x', 'y',
   x_metric = 'false_positive_rate',
   y_metric = 'true_positive_rate',
   truth_target = TRUE,
   title = 'ROC equivalent')
ROCPlot(d, 'x', 'y',
   truthTarget = TRUE,
   title = 'ROC example')

# Precision/Recall examples
ThresholdPlot(d, 'x', 'y',
   title = 'precision/recall',
   metrics = c('recall', 'precision'),
   truth_target = TRUE)
MetricPairPlot(d, 'x', 'y',
   x_metric = 'recall',
   y_metric = 'precision',
   title = 'recall/precision',
   truth_target = TRUE)
PRPlot(d, 'x', 'y',
   truthTarget = TRUE,
   title = 'p/r plot')
}

if (requireNamespace('data.table', quietly = TRUE)) {
	# don't multi-thread during CRAN checks
		data.table::setDTthreads(1)
}

# data with two different regimes of behavior
d <- rbind(
  data.frame(
    x =  rnorm(1000),
    y = sample(c(TRUE, FALSE), prob = c(0.02, 0.98), size = 1000, replace = TRUE)),
  data.frame(
    x =  rnorm(200) + 5,
    y = sample(c(TRUE, FALSE), size = 200, replace = TRUE))
)

# Sensitivity/Specificity examples
ThresholdPlot(d, 'x', 'y',
   title = 'Sensitivity/Specificity',
   metrics = c('sensitivity', 'specificity'),
   truth_target = TRUE)
if(FALSE) {
MetricPairPlot(d, 'x', 'y',
   x_metric = 'false_positive_rate',
   y_metric = 'true_positive_rate',
   truth_target = TRUE,
   title = 'ROC equivalent')
ROCPlot(d, 'x', 'y',
   truthTarget = TRUE,
   title = 'ROC example')

# Precision/Recall examples
ThresholdPlot(d, 'x', 'y',
   title = 'precision/recall',
   metrics = c('recall', 'precision'),
   truth_target = TRUE)
MetricPairPlot(d, 'x', 'y',
   x_metric = 'recall',
   y_metric = 'precision',
   title = 'recall/precision',
   truth_target = TRUE)
PRPlot(d, 'x', 'y',
   truthTarget = TRUE,
   title = 'p/r plot')
}

Package 'WVPlots'

Help Index

WVPlots: Common Plots for Analysis

Description

Details

Author(s)

See Also

Plot a scatter plot of a binary variable with smoothing curve.

Description

Usage

Arguments

Details

Examples

Plot a Cleveland dot plot.

Description

Usage

Arguments

Details

Examples

Plot a scatter plot with smoothing line.

Description

Usage

Arguments

Details

Examples

Plot distribution of a single discrete numerical variable.

Description

Usage

Arguments

Examples

Plot two density plots conditioned on an outcome variable.

Description

Usage

Arguments

Details

Examples

Plot two histograms conditioned on an outcome variable.

Description

Usage

Arguments

Details

Examples

Plot the cumulative gain curve of a sort-order.

Description

Usage

Arguments

Details

Examples

Plot the cumulative gain curve of a sort-order with costs.

Description

Usage

Arguments

Details

See Also

Examples

Plot the cumulative gain curves of a sort-order.

Description

Usage

Arguments

Details

Examples

Plot the cumulative gain curve of a sort-order with extra notation

Description

Usage

Arguments

Details

See Also

Examples

Build a hex bin plot

Description

Usage

Arguments

Details

Value

See Also

Examples

Plot the cumulative lift curve of a sort-order.

Description

Usage

Arguments