Title: | 'rquery' for 'data.table' |
---|---|
Description: | Implements the 'rquery' piped Codd-style query algebra using 'data.table'. This allows for a high-speed in memory implementation of Codd-style data manipulation tools. |
Authors: | John Mount [aut, cre], Win-Vector LLC [cph] |
Maintainer: | John Mount <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.3.3 |
Built: | 2024-10-13 03:02:22 UTC |
Source: | https://github.com/winvector/rqdatatable |
rqdatatable
: Relational Query Generator for Data Manipulation Implemented by data.tableImplements the rquery
piped query algebra using data.table
. This allows
for a high-speed in memory implementation of Codd-style data manipulation tools.
Maintainer: John Mount [email protected]
Other contributors:
Win-Vector LLC [copyright holder]
Useful links:
Report bugs at https://github.com/WinVector/rqdatatable/issues
rquery
pipeline with data.table
sources.data.table
s are looked for by name in the tables
argument and in the execution environment.
Main external execution interface.
ex_data_table( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
ex_data_table( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
ex_data_table_step.relop_drop_columns
: implement drop columns
ex_data_table_step.relop_extend
: implement extend/assign operator
ex_data_table_step.relop_natural_join
: implement natural join
ex_data_table_step.relop_non_sql
: direct function (non-sql) operator (not implemented for data.table
)
ex_data_table_step.relop_null_replace
: implement NA/NULL replacement
ex_data_table_step.relop_orderby
: implement row ordering
ex_data_table_step.relop_project
: implement row ordering
ex_data_table_step.relop_rename_columns
: implement column renaming
ex_data_table_step.relop_select_columns
: implement select columns
ex_data_table_step.relop_select_rows
: implement select rows
ex_data_table_step.relop_sql
: direct sql operator (not implemented for data.table
)
ex_data_table_step.relop_table_source
: implement data source
ex_data_table_step.relop_theta_join
: implement theta join (not implemented for data.table
)
ex_data_table_step.relop_unionall
: implement row binding
resulting data.table (intermediate tables can somtimes be mutated as is practice with data.table).
a <- data.table::data.table(x = c(1, 2) , y = c(20, 30), z = c(300, 400)) optree <- local_td(a) %.>% select_columns(., c("x", "y")) %.>% select_rows_nse(., x<2 & y<30) cat(format(optree)) ex_data_table(optree) # other ways to execute the pipeline include data.frame(x = 0, y = 4, z = 400) %.>% optree
a <- data.table::data.table(x = c(1, 2) , y = c(20, 30), z = c(300, 400)) optree <- local_td(a) %.>% select_columns(., c("x", "y")) %.>% select_rows_nse(., x<2 & y<30) cat(format(optree)) ex_data_table(optree) # other ways to execute the pipeline include data.frame(x = 0, y = 4, z = 400) %.>% optree
rquery
pipeline with data.table
in parallel.Execute an rquery
pipeline with data.table
in parallel, partitioned by a given column.
Note: usually the overhead of partitioning and distributing the work will by far overwhelm any parallel speedup.
Also data.table
itself already seems to exploit some thread-level parallelism (one often sees user time > elapsed time).
Requires the parallel
package. For a worked example with significant speedup please see https://github.com/WinVector/rqdatatable/blob/master/extras/Parallel_rqdatatable.md.
ex_data_table_parallel( optree, partition_column, cl = NULL, ..., tables = list(), source_limit = NULL, debug = FALSE, env = parent.frame() )
ex_data_table_parallel( optree, partition_column, cl = NULL, ..., tables = list(), source_limit = NULL, debug = FALSE, env = parent.frame() )
optree |
relop operations tree. |
partition_column |
character name of column to partition work by. |
cl |
a cluster object, created by package parallel or by package snow. If NULL, use the registered default cluster. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
debug |
logical if TRUE use lapply instead of parallel::clusterApplyLB. |
env |
environment to look for values in. |
Care must be taken that the calculation partitioning is course enough to ensure a correct calculation. For example: anything one is joining on, aggregating over, or ranking over must be grouped so that all elements affecting a given result row are in the same level of the partition.
resulting data.table (intermediate tables can sometimes be mutated as is practice with data.table).
Throw on error if this method is called, signalling that a specific data.table
implemetation is needed for this method.
## Default S3 method: ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## Default S3 method: ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
data.table
based implementation.
## S3 method for class 'relop_drop_columns' ex_data_table_step( optree, ..., tables = list(), source_limit = NULL, source_usage = NULL, env = parent.frame() )
## S3 method for class 'relop_drop_columns' ex_data_table_step( optree, ..., tables = list(), source_limit = NULL, source_usage = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
source_usage |
list mapping source table names to vectors of columns used. |
env |
environment to work in. |
dL <- data.frame(x = 1, y = 2, z = 3) rquery_pipeline <- local_td(dL) %.>% drop_columns(., "y") dL %.>% rquery_pipeline
dL <- data.frame(x = 1, y = 2, z = 3) rquery_pipeline <- local_td(dL) %.>% drop_columns(., "y") dL %.>% rquery_pipeline
data.table
based implementation.
## S3 method for class 'relop_extend' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_extend' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
Will re-order columns if there are ordering terms.
dL <- build_frame( "subjectID", "surveyCategory" , "assessmentTotal", "one" | 1 , "withdrawal behavior", 5 , 1 | 1 , "positive re-framing", 2 , 1 | 2 , "withdrawal behavior", 3 , 1 | 2 , "positive re-framing", 4 , 1 ) rquery_pipeline <- local_td(dL) %.>% extend_nse(., probability %:=% exp(assessmentTotal * 0.237)/ sum(exp(assessmentTotal * 0.237)), count %:=% sum(one), rank %:=% rank(), orderby = c("assessmentTotal", "surveyCategory"), reverse = c("assessmentTotal"), partitionby = 'subjectID') %.>% orderby(., c("subjectID", "probability")) dL %.>% rquery_pipeline
dL <- build_frame( "subjectID", "surveyCategory" , "assessmentTotal", "one" | 1 , "withdrawal behavior", 5 , 1 | 1 , "positive re-framing", 2 , 1 | 2 , "withdrawal behavior", 3 , 1 | 2 , "positive re-framing", 4 , 1 ) rquery_pipeline <- local_td(dL) %.>% extend_nse(., probability %:=% exp(assessmentTotal * 0.237)/ sum(exp(assessmentTotal * 0.237)), count %:=% sum(one), rank %:=% rank(), orderby = c("assessmentTotal", "surveyCategory"), reverse = c("assessmentTotal"), partitionby = 'subjectID') %.>% orderby(., c("subjectID", "probability")) dL %.>% rquery_pipeline
data.table
based implementation.
## S3 method for class 'relop_natural_join' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_natural_join' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
d1 <- build_frame( "key", "val", "val1" | "a" , 1 , 10 | "b" , 2 , 11 | "c" , 3 , 12 ) d2 <- build_frame( "key", "val", "val2" | "a" , 5 , 13 | "b" , 6 , 14 | "d" , 7 , 15 ) # key matching join optree <- natural_join(local_td(d1), local_td(d2), jointype = "FULL", by = 'key') ex_data_table(optree) # full cross-product join # (usually with jointype = "FULL", but "LEFT" is more # compatible with rquery field merge semantics). optree2 <- natural_join(local_td(d1), local_td(d2), jointype = "LEFT", by = NULL) ex_data_table(optree2) # notice ALL non-"by" fields take coalese to left table.
d1 <- build_frame( "key", "val", "val1" | "a" , 1 , 10 | "b" , 2 , 11 | "c" , 3 , 12 ) d2 <- build_frame( "key", "val", "val2" | "a" , 5 , 13 | "b" , 6 , 14 | "d" , 7 , 15 ) # key matching join optree <- natural_join(local_td(d1), local_td(d2), jointype = "FULL", by = 'key') ex_data_table(optree) # full cross-product join # (usually with jointype = "FULL", but "LEFT" is more # compatible with rquery field merge semantics). optree2 <- natural_join(local_td(d1), local_td(d2), jointype = "LEFT", by = NULL) ex_data_table(optree2) # notice ALL non-"by" fields take coalese to left table.
data.table
case.Passes a single table to a function that takes a single data.frame as its argument, and returns a single data.frame.
## S3 method for class 'relop_non_sql' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_non_sql' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
rq_df_funciton_node
, rq_df_grouped_funciton_node
set.seed(3252) d <- data.frame(a = rnorm(1000), b = rnorm(1000)) optree <- local_td(d) %.>% quantile_node(.) d %.>% optree p2 <- local_td(d) %.>% rsummary_node(.) d %.>% p2 summary(d)
set.seed(3252) d <- data.frame(a = rnorm(1000), b = rnorm(1000)) optree <- local_td(d) %.>% quantile_node(.) d %.>% optree p2 <- local_td(d) %.>% rsummary_node(.) d %.>% p2 summary(d)
data.table
based implementation.
## S3 method for class 'relop_null_replace' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_null_replace' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- build_frame( "x", "y" | 2L , 5 | NA , 7 | NA , NA ) rquery_pipeline <- local_td(dL) %.>% null_replace(., c("x", "y"), 0, note_col = "nna") dL %.>% rquery_pipeline
dL <- build_frame( "x", "y" | 2L , 5 | NA , 7 | NA , NA ) rquery_pipeline <- local_td(dL) %.>% null_replace(., c("x", "y"), 0, note_col = "nna") dL %.>% rquery_pipeline
data.table
based implementation.
## S3 method for class 'relop_order_expr' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_order_expr' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- build_frame( "x", "y" | 2L , "b" | -4L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) %.>% order_expr(., abs(x)) dL %.>% rquery_pipeline
dL <- build_frame( "x", "y" | 2L , "b" | -4L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) %.>% order_expr(., abs(x)) dL %.>% rquery_pipeline
data.table
based implementation.
## S3 method for class 'relop_orderby' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_orderby' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) %.>% orderby(., "y") dL %.>% rquery_pipeline
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) %.>% orderby(., "y") dL %.>% rquery_pipeline
data.table
based implementation.
## S3 method for class 'relop_project' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_project' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- build_frame( "subjectID", "surveyCategory" , "assessmentTotal" | 1 , "withdrawal behavior", 5 | 1 , "positive re-framing", 2 | 2 , "withdrawal behavior", 3 | 2 , "positive re-framing", 4 ) test_p <- local_td(dL) %.>% project(., maxscore := max(assessmentTotal), count := n(), groupby = 'subjectID') cat(format(test_p)) dL %.>% test_p
dL <- build_frame( "subjectID", "surveyCategory" , "assessmentTotal" | 1 , "withdrawal behavior", 5 | 1 , "positive re-framing", 2 | 2 , "withdrawal behavior", 3 | 2 , "positive re-framing", 4 ) test_p <- local_td(dL) %.>% project(., maxscore := max(assessmentTotal), count := n(), groupby = 'subjectID') cat(format(test_p)) dL %.>% test_p
data.table
based implementation.
## S3 method for class 'relop_rename_columns' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_rename_columns' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) %.>% rename_columns(., c("x" = "y", "y" = "x")) dL %.>% rquery_pipeline
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) %.>% rename_columns(., c("x" = "y", "y" = "x")) dL %.>% rquery_pipeline
data.table
based implementation.
## S3 method for class 'relop_select_columns' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_select_columns' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- data.frame(x = 1, y = 2, z = 3) rquery_pipeline <- local_td(dL) %.>% select_columns(., "y") dL %.>% rquery_pipeline
dL <- data.frame(x = 1, y = 2, z = 3) rquery_pipeline <- local_td(dL) %.>% select_columns(., "y") dL %.>% rquery_pipeline
data.table
based implementation.
## S3 method for class 'relop_select_rows' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_select_rows' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) %.>% select_rows_nse(., x <= 2) dL %.>% rquery_pipeline
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) %.>% select_rows_nse(., x <= 2) dL %.>% rquery_pipeline
data.table
based implementation.
## S3 method for class 'relop_set_indicator' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_set_indicator' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
d <- data.frame(a = c("1", "2", "1", "3"), b = c("1", "1", "3", "2"), q = 1, stringsAsFactors = FALSE) set <- c("1", "2") op_tree <- local_td(d) %.>% set_indicator(., "one_two", "a", set) %.>% set_indicator(., "z", "a", c()) d %.>% op_tree
d <- data.frame(a = c("1", "2", "1", "3"), b = c("1", "1", "3", "2"), q = 1, stringsAsFactors = FALSE) set <- c("1", "2") op_tree <- local_td(d) %.>% set_indicator(., "one_two", "a", set) %.>% set_indicator(., "z", "a", c()) d %.>% op_tree
Execute one step using the rquery.rquery_db_executor SQL supplier. Note: it is not a good practice to use SQL nodes in data.table intended pipelines (loss of class information and cost of data transfer). This implementation is only here for completeness.
## S3 method for class 'relop_sql' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_sql' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
# WARNING: example tries to change rquery.rquery_db_executor option to RSQLite and back. if (requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)) { # example database connection my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") old_o <- options(list("rquery.rquery_db_executor" = list(db = my_db))) # example data d <- data.frame(v1 = c(1, 2, NA, 3), v2 = c(NA, "b", NA, "c"), v3 = c(NA, NA, 7, 8), stringsAsFactors = FALSE) # example xform vars <- column_names(d) # build a NA/NULLs per-row counting expression. # names are "quoted" by wrapping them with as.name(). # constants can be quoted by an additional list wrapping. expr <- lapply(vars, function(vi) { list("+ (CASE WHEN (", as.name(vi), "IS NULL ) THEN 1.0 ELSE 0.0 END)") }) expr <- unlist(expr, recursive = FALSE) expr <- c(list(0.0), expr) # instantiate the operator node op_tree <- local_td(d) %.>% sql_node(., "num_missing" %:=% list(expr)) cat(format(op_tree)) d %.>% op_tree options(old_o) DBI::dbDisconnect(my_db) }
# WARNING: example tries to change rquery.rquery_db_executor option to RSQLite and back. if (requireNamespace("DBI", quietly = TRUE) && requireNamespace("RSQLite", quietly = TRUE)) { # example database connection my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") old_o <- options(list("rquery.rquery_db_executor" = list(db = my_db))) # example data d <- data.frame(v1 = c(1, 2, NA, 3), v2 = c(NA, "b", NA, "c"), v3 = c(NA, NA, 7, 8), stringsAsFactors = FALSE) # example xform vars <- column_names(d) # build a NA/NULLs per-row counting expression. # names are "quoted" by wrapping them with as.name(). # constants can be quoted by an additional list wrapping. expr <- lapply(vars, function(vi) { list("+ (CASE WHEN (", as.name(vi), "IS NULL ) THEN 1.0 ELSE 0.0 END)") }) expr <- unlist(expr, recursive = FALSE) expr <- c(list(0.0), expr) # instantiate the operator node op_tree <- local_td(d) %.>% sql_node(., "num_missing" %:=% list(expr)) cat(format(op_tree)) d %.>% op_tree options(old_o) DBI::dbDisconnect(my_db) }
data.table
based implementation.
Looks for tables first in tables
and then in env
.
Will accept any data.frame
that can
be converted to data.table
.
## S3 method for class 'relop_table_source' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_table_source' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) dL %.>% rquery_pipeline
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- local_td(dL) dL %.>% rquery_pipeline
Limited implementation. All terms must be of the form: "(table1.col CMP table2.col) (, (table1.col CMP table2.col) )".
## S3 method for class 'relop_theta_join' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_theta_join' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
d1 <- data.frame(AUC = 0.6, R2 = 0.2) d2 <- data.frame(AUC2 = 0.4, R2 = 0.3) optree <- theta_join_se(local_td(d1), local_td(d2), "AUC >= AUC2") ex_data_table(optree, tables = list(d1 = d1, d2 = d2)) %.>% print(.)
d1 <- data.frame(AUC = 0.6, R2 = 0.2) d2 <- data.frame(AUC2 = 0.4, R2 = 0.3) optree <- theta_join_se(local_td(d1), local_td(d2), "AUC >= AUC2") ex_data_table(optree, tables = list(d1 = d1, d2 = d2)) %.>% print(.)
data.table
based implementation.
## S3 method for class 'relop_unionall' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
## S3 method for class 'relop_unionall' ex_data_table_step( optree, ..., tables = list(), source_usage = NULL, source_limit = NULL, env = parent.frame() )
optree |
relop operations tree. |
... |
not used, force later arguments to bind by name. |
tables |
named list map from table names used in nodes to data.tables and data.frames. |
source_usage |
list mapping source table names to vectors of columns used. |
source_limit |
if not null limit all table sources to no more than this many rows (used for debugging). |
env |
environment to work in. |
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- unionall(list(local_td(dL), local_td(dL))) dL %.>% rquery_pipeline
dL <- build_frame( "x", "y" | 2L , "b" | 1L , "a" | 3L , "c" ) rquery_pipeline <- unionall(list(local_td(dL), local_td(dL))) dL %.>% rquery_pipeline
Build data.table implementation of lookup_by_column. We do this here as rqdatatable is a data.table aware package (and rquery is not).
make_dt_lookup_by_column(pick, result)
make_dt_lookup_by_column(pick, result)
pick |
character scalar, name of column to control value choices. |
result |
character scalar, name of column to place values in. |
f_dt() function.
df = data.frame(x = c(1, 2, 3, 4), y = c(5, 6, 7, 8), choice = c("x", "y", "x", "z"), stringsAsFactors = FALSE) make_dt_lookup_by_column("choice", "derived")(df) # # base-R implementation # df %.>% lookup_by_column(., "choice", "derived") # # # data.table implementation (requies rquery 1.1.0, or newer) # # df %.>% lookup_by_column(., "choice", "derived", # # f_dt_factory = rqdatatable::make_dt_lookup_by_column)
df = data.frame(x = c(1, 2, 3, 4), y = c(5, 6, 7, 8), choice = c("x", "y", "x", "z"), stringsAsFactors = FALSE) make_dt_lookup_by_column("choice", "derived")(df) # # base-R implementation # df %.>% lookup_by_column(., "choice", "derived") # # # data.table implementation (requies rquery 1.1.0, or newer) # # df %.>% lookup_by_column(., "choice", "derived", # # f_dt_factory = rqdatatable::make_dt_lookup_by_column)
Helper to build data.table capable non-sql nodes.
rq_df_funciton_node( ., f, ..., f_db = NULL, columns_produced, display_form, orig_columns = FALSE )
rq_df_funciton_node( ., f, ..., f_db = NULL, columns_produced, display_form, orig_columns = FALSE )
. |
or data.frame input. |
f |
function that takes a data.table to a data.frame (or data.table). |
... |
force later arguments to bind by name. |
f_db |
implementation signature: f_db(db, incoming_table_name, outgoing_table_name, nd, ...) (db being a database handle). NULL defaults to using f. |
columns_produced |
character columns produces by f. |
display_form |
display form for node. |
orig_columns |
orig_columns, if TRUE assume all input columns are present in derived table. |
relop non-sql node implementation.
ex_data_table_step.relop_non_sql
, rq_df_grouped_funciton_node
# a node generator is something an expert can # write and part-time R users can use. grouped_regression_node <- function(., group_col = "group", xvar = "x", yvar = "y") { force(group_col) formula_str <- paste(yvar, "~", xvar) f <- function(df, nd = NULL) { dlist <- split(df, df[[group_col]]) clist <- lapply(dlist, function(di) { mi <- lm(as.formula(formula_str), data = di) ci <- as.data.frame(summary(mi)$coefficients) ci$Variable <- rownames(ci) rownames(ci) <- NULL ci[[group_col]] <- di[[group_col]][[1]] ci }) data.table::rbindlist(clist) } columns_produced = c("Variable", "Estimate", "Std. Error", "t value", "Pr(>|t|)", group_col) rq_df_funciton_node( ., f, columns_produced = columns_produced, display_form = paste0(yvar, "~", xvar, " grouped by ", group_col)) } # work an example set.seed(3265) d <- data.frame(x = rnorm(1000), y = rnorm(1000), group = sample(letters[1:5], 1000, replace = TRUE), stringsAsFactors = FALSE) rquery_pipeline <- local_td(d) %.>% grouped_regression_node(.) cat(format(rquery_pipeline)) d %.>% rquery_pipeline
# a node generator is something an expert can # write and part-time R users can use. grouped_regression_node <- function(., group_col = "group", xvar = "x", yvar = "y") { force(group_col) formula_str <- paste(yvar, "~", xvar) f <- function(df, nd = NULL) { dlist <- split(df, df[[group_col]]) clist <- lapply(dlist, function(di) { mi <- lm(as.formula(formula_str), data = di) ci <- as.data.frame(summary(mi)$coefficients) ci$Variable <- rownames(ci) rownames(ci) <- NULL ci[[group_col]] <- di[[group_col]][[1]] ci }) data.table::rbindlist(clist) } columns_produced = c("Variable", "Estimate", "Std. Error", "t value", "Pr(>|t|)", group_col) rq_df_funciton_node( ., f, columns_produced = columns_produced, display_form = paste0(yvar, "~", xvar, " grouped by ", group_col)) } # work an example set.seed(3265) d <- data.frame(x = rnorm(1000), y = rnorm(1000), group = sample(letters[1:5], 1000, replace = TRUE), stringsAsFactors = FALSE) rquery_pipeline <- local_td(d) %.>% grouped_regression_node(.) cat(format(rquery_pipeline)) d %.>% rquery_pipeline
Helper to build data.table capable non-sql nodes.
rq_df_grouped_funciton_node( ., f, ..., f_db = NULL, columns_produced, group_col, display_form )
rq_df_grouped_funciton_node( ., f, ..., f_db = NULL, columns_produced, group_col, display_form )
. |
or data.frame input. |
f |
function that takes a data.table to a data.frame (or data.table). |
... |
force later arguments to bind by name. |
f_db |
implementation signature: f_db(db, incoming_table_name, outgoing_table_name) (db being a database handle). NULL defaults to using f. |
columns_produced |
character columns produces by f. |
group_col |
character, column to split by. |
display_form |
display form for node. |
relop non-sql node implementation.
ex_data_table_step.relop_non_sql
, rq_df_funciton_node
# a node generator is something an expert can # write and part-time R users can use. grouped_regression_node <- function(., group_col = "group", xvar = "x", yvar = "y") { force(group_col) formula_str <- paste(yvar, "~", xvar) f <- function(di) { mi <- lm(as.formula(formula_str), data = di) ci <- as.data.frame(summary(mi)$coefficients) ci$Variable <- rownames(ci) rownames(ci) <- NULL colnames(ci) <- c("Estimate", "Std_Error", "t_value", "p_value", "Variable") ci } columns_produced = c("Estimate", "Std_Error", "t_value", "p_value", "Variable", group_col) rq_df_grouped_funciton_node( ., f, columns_produced = columns_produced, group_col = group_col, display_form = paste0(yvar, "~", xvar, " grouped by ", group_col)) } # work an example set.seed(3265) d <- data.frame(x = rnorm(1000), y = rnorm(1000), group = sample(letters[1:5], 1000, replace = TRUE), stringsAsFactors = FALSE) rquery_pipeline <- local_td(d) %.>% grouped_regression_node(.) cat(format(rquery_pipeline)) d %.>% rquery_pipeline
# a node generator is something an expert can # write and part-time R users can use. grouped_regression_node <- function(., group_col = "group", xvar = "x", yvar = "y") { force(group_col) formula_str <- paste(yvar, "~", xvar) f <- function(di) { mi <- lm(as.formula(formula_str), data = di) ci <- as.data.frame(summary(mi)$coefficients) ci$Variable <- rownames(ci) rownames(ci) <- NULL colnames(ci) <- c("Estimate", "Std_Error", "t_value", "p_value", "Variable") ci } columns_produced = c("Estimate", "Std_Error", "t_value", "p_value", "Variable", group_col) rq_df_grouped_funciton_node( ., f, columns_produced = columns_produced, group_col = group_col, display_form = paste0(yvar, "~", xvar, " grouped by ", group_col)) } # work an example set.seed(3265) d <- data.frame(x = rnorm(1000), y = rnorm(1000), group = sample(letters[1:5], 1000, replace = TRUE), stringsAsFactors = FALSE) rquery_pipeline <- local_td(d) %.>% grouped_regression_node(.) cat(format(rquery_pipeline)) d %.>% rquery_pipeline
Sets rqdatatable (and hence data.table) as the default executor for rquery).
set_rqdatatable_as_executor()
set_rqdatatable_as_executor()