--- title: "rquery Many Columns" author: "John Mount, Win-Vector LLC" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{rquery Many Columns} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- A common data manipulation need is: applying the same operation to a number of columns. In the [`rquery`](https://github.com/WinVector/rquery) [`R`](https://www.r-project.org) package we strongly recommend using value-oriented (or standard evaluation) for this task. Here is what this looks like. For our data lets build a simple data set similar to `mtcars`: ```{r} library(wrapr) library(rquery) have_rqdatatable <- requireNamespace('rqdatatable', quietly = TRUE) example_data <- wrapr::build_frame( "mpg" , "cyl", "disp", "car" | 21 , 6 , 160 , "Mazda RX4" | 21 , 6 , 160 , "Mazda RX4 Wag" | 22.8 , 4 , 108 , "Datsun 710" | 21.4 , 6 , 258 , "Hornet 4 Drive" | 18.7 , 8 , 360 , "Hornet Sportabout" | 18.1 , 6 , 225 , "Valiant" ) knitr::kable(example_data) ``` Now suppose for a number of columns we wish to perform a calculation, such centering it with respect to the grand average. This is easily handled by first specifying the set of variables we wish to work with. ```{r} vars <- setdiff(colnames(example_data), 'car') print(vars) ``` Now we build up what we want as a name-vector of strings using the [`:=` named map builder](https://winvector.github.io/wrapr/articles/named_map_builder.html). ```{r} expressions <- vars := paste0(vars, ' - mean(', vars, ')') print(expressions) ``` The idea is: the `:=` operator fits into `R` idiom by looking very much like a vectorized version of "names get assigned expressions". These expressions can then be used in an `rquery` pipeline using the `_se`-variant of `extend()`: `extend_se()`. ```{r} ops <- local_td(example_data) %.>% extend_se(., expressions) cat(format(ops)) ``` And this operator pipeline is ready to use (assuming we have `rqdatatable` available): ```{r} if(have_rqdatatable) { example_data %.>% ops %.>% knitr::kable(.) } ``` The expression construction can also be done inside the `extend_se()` operator. ```{r} ops <- local_td(example_data) %.>% extend_se(., vars := paste0(vars, ' - mean(', vars, ')')) cat(format(ops)) ``` Note: the above is only a notional example to demonstrate the operations; for supervised machine learning we would probably use base::scale(), which saves the learned centering for later re-use on new data. The point is: we use standard `R` tools to build up the lists of names and operations. We are not restricted to any single argument manipulation grammar. For example we could build all interaction terms as follows. ```{r} combos <- t(combn(vars, 2)) interactions <- paste0(combos[, 1], '_', combos[, 2]) := paste0(combos[, 1], ' * ', combos[, 2]) print(interactions) ``` ```{r} ops <- local_td(example_data) %.>% extend_se(., interactions) cat(format(ops)) ``` It is a critical advantage to work with sets of variables as standard values. In this case what we can do is limited only by the power of `R` itself. Note: we also supply an alias for `:=` as `%:=%` for those that don't want to confuse this assignment with how the symbol is used in `data.table`. Take care that `:=` has the precedence-level of an assignment and `%:=%` has the precedence-level of a user defined operator. As, as always, our queries can be used on data. ```{r} if(have_rqdatatable) { example_data %.>% ops %.>% knitr::kable(.) } ``` Or even in a database. ```{r} have_db <- requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE) if(have_db) { raw_connection <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") RSQLite::initExtension(raw_connection) db <- rquery_db_info( connection = raw_connection, is_dbi = TRUE, connection_options = rq_connection_tests(raw_connection)) rq_copy_to(db, 'example_data', example_data, temporary = TRUE, overwrite = TRUE) sql <- to_sql(ops, db) cat(format(sql)) } ``` ```{r} if(have_db) { res_table <- materialize(db, ops) DBI::dbReadTable(raw_connection, res_table$table_name) %.>% knitr::kable(.) } ``` ```{r} if(have_db) { DBI::dbDisconnect(raw_connection) } ```