Title: | extra functionality for dplyr like mutate_rows for mutation of a subset of rows |
---|---|
Description: | Some extra functionality that is not (yet) in dplyr, e.g. mutate_rows (mutation of subset of rows) xsummarise_each (summarise_each with more flexible alignment of results), or s_filter, s_arrange ,... that allow string arguments. |
Authors: | Sebastian Kranz |
Maintainer: | Sebastian Kranz <[email protected]> |
License: | GPL>=2.0 |
Version: | 0.1.4 |
Built: | 2024-11-28 02:52:31 UTC |
Source: | https://github.com/skranz/dplyrExtras |
If dt is a data table, then modify is essentially just a wrapper for data.table syntax that allows modification or creation of new columns. If dt is not a data table, it will by default be converted to a data table and then transformed and returned as the original data frame. Unlike mutate from dplyr, one can use the .SD argument in function calls, which can quite useful sometimes.
modify(.dt, .if, .by = NULL, ..., .envir = parent.frame(), .inplace = is.data.table(.dt), .as.data.table = is.data.table(.dt))
modify(.dt, .if, .by = NULL, ..., .envir = parent.frame(), .inplace = is.data.table(.dt), .as.data.table = is.data.table(.dt))
.dt |
a data.table |
.if |
optional a boolean conditions that specifies the rows that shall be modifed |
.by |
optional a vector of column names used for computations that are splitted by groups |
... |
formulas for columns that are modified or newly created |
.envir |
optional an environment in which the expressions shall be evaluated if variables are not found in .dt |
.inplace |
allows .dt inplace modification (TRUE if .dt is a data table) |
.as.data.table |
shall result be a data.table (only true if .dt is a data.table) |
## Not run: library(microbenchmark) K = 3 n = 10 dt = data.table(a= sample(1:3,n,replace=TRUE), b= sample(1:100,n,replace=TRUE), x=rnorm(n)) df = as.data.frame(dt) # Set x to 100 where a==2 modify(dt,a==2, y=x+100, z=y+K) modify(df,a==2,y=200,z=y*5+K) dt[,y:=x+2] # Set x to the mean value of b*100 in each group of a modify(dt,.by=c("a"), x=mean(b)*100) dt # Call with strings com = "x=200" s_modify(dt,"a==2", com) dt ## End(Not run)
## Not run: library(microbenchmark) K = 3 n = 10 dt = data.table(a= sample(1:3,n,replace=TRUE), b= sample(1:100,n,replace=TRUE), x=rnorm(n)) df = as.data.frame(dt) # Set x to 100 where a==2 modify(dt,a==2, y=x+100, z=y+K) modify(df,a==2,y=200,z=y*5+K) dt[,y:=x+2] # Set x to the mean value of b*100 in each group of a modify(dt,.by=c("a"), x=mean(b)*100) dt # Call with strings com = "x=200" s_modify(dt,"a==2", com) dt ## End(Not run)
Old pseudonym for mutate_rows. Originally the function mutate_rows was called mutate_if, however, dplyr 5.0 introduced also a function called mutate_if that does something different however. I keep the synonym mutate_if in order to reduce the probability that old code breaks, but it is recommended to use mutate_rows instead.#'
mutate_if(.data, .if, ...)
mutate_if(.data, .if, ...)
.data |
the data |
.if |
a logical condition that selects rows, e.g. a=="B" |
... |
the command to mutate existing columns |
change values of columns only in rows that satisfy the .if condition Note: you cannot create new columns with mutate_rows but only change values in selected rows of existing columns
mutate_rows(.data, .if, ...)
mutate_rows(.data, .if, ...)
.data |
the data |
.if |
a logical condition that selects rows, e.g. a=="B" |
... |
the command to mutate existing columns |
## Not run: library(microbenchmark) library(dplyr) library(data.table) library(dplyrExtras) # create a data set.seed(123456) n = 10 df = data.frame(a= sample(1:3,n,replace=TRUE), b= sample(1:100,n,replace=TRUE), x=rnorm(n)) dt = as.data.table(df) # different calls to mutate_rows mutate_rows(df,a==3,y=100) mutate_rows(tbl_df(df),a==1,x=200) mutate_rows(as.tbl(df),a==1,x=300,b=400) mutate_rows(dt,a==1 | a==2,x=400) mutate_rows(group_by(dt,a),a==1 | a==2,x=mean(b)) mutate_rows(group_by(df,a),a==1 | a==2,x=mean(b)) # if you create a new column rows that don't # match the if condition have an NA mutate_rows(df,a==3,z=100) # You can only have one if condition in a mutate_rows call # So multiple changes require nesting or piping library(magrittr) df mutate_rows(a==2,z=200) # For historical reasons there is also still the synonym # mutate_if. But not that dplyr 5.0 has introduced its own # mutate_if function with quite different functionality mutate_if(df,a==3,y=99) # Small benchmark: compare with mutate + ifelse n = 1e6 df = data.frame(a= sample(1:3,n,replace=TRUE), b= sample(1:100,n,replace=TRUE), x=rnorm(n)) microbenchmark(times = 5L, mutate(df, x=ifelse(a==2,x+100,x)), mutate(df, x=if_else(a==2,x+100,x)), mutate_rows(df, a==2, x=x+100) ) # mean # ifelse 540.7145 # if_else 360.4928 # mutate_rows 114.3891 ## End(Not run)
## Not run: library(microbenchmark) library(dplyr) library(data.table) library(dplyrExtras) # create a data set.seed(123456) n = 10 df = data.frame(a= sample(1:3,n,replace=TRUE), b= sample(1:100,n,replace=TRUE), x=rnorm(n)) dt = as.data.table(df) # different calls to mutate_rows mutate_rows(df,a==3,y=100) mutate_rows(tbl_df(df),a==1,x=200) mutate_rows(as.tbl(df),a==1,x=300,b=400) mutate_rows(dt,a==1 | a==2,x=400) mutate_rows(group_by(dt,a),a==1 | a==2,x=mean(b)) mutate_rows(group_by(df,a),a==1 | a==2,x=mean(b)) # if you create a new column rows that don't # match the if condition have an NA mutate_rows(df,a==3,z=100) # You can only have one if condition in a mutate_rows call # So multiple changes require nesting or piping library(magrittr) df mutate_rows(a==2,z=200) # For historical reasons there is also still the synonym # mutate_if. But not that dplyr 5.0 has introduced its own # mutate_if function with quite different functionality mutate_if(df,a==3,y=99) # Small benchmark: compare with mutate + ifelse n = 1e6 df = data.frame(a= sample(1:3,n,replace=TRUE), b= sample(1:100,n,replace=TRUE), x=rnorm(n)) microbenchmark(times = 5L, mutate(df, x=ifelse(a==2,x+100,x)), mutate(df, x=if_else(a==2,x+100,x)), mutate_rows(df, a==2, x=x+100) ) # mean # ifelse 540.7145 # if_else 360.4928 # mutate_rows 114.3891 ## End(Not run)
Inplace remove columns from a data.table
remove_cols(dt, cols)
remove_cols(dt, cols)
dt |
a data.table |
cols |
a string with column names |
## Not run: dt = data.table(a=1:3, b=1:3, c=4:6) dt remove_cols(dt,c("a" ,"b")) dt ## End(Not run)
## Not run: dt = data.table(a=1:3, b=1:3, c=4:6) dt remove_cols(dt,c("a" ,"b")) dt ## End(Not run)
Modified version of dplyr's arrange that uses string arguments
s_arrange(.data, ...)
s_arrange(.data, ...)
.data |
the data frame (or similar object) |
... |
string version of arguments to original dplyr function |
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
Modified version of dplyr's filter that uses string arguments
s_filter(.data, ...)
s_filter(.data, ...)
.data |
the data frame (or similar object) |
... |
string version of arguments to original dplyr function |
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
Modified version of dplyr's group_by that uses string arguments
s_group_by(.data, ...)
s_group_by(.data, ...)
.data |
the data frame (or similar object) |
... |
string version of arguments to original dplyr function |
Modified version of modify that uses string arguments
s_modify(.dt, ..., .envir = parent.frame())
s_modify(.dt, ..., .envir = parent.frame())
.dt |
the data.table / similar object that shall be modified |
... |
string version of arguments of modify (see example) |
.envir |
environment in which arguments will be evaluated (relevant if variables outside .dt are used) |
Modified version of dplyr's mutate that uses string arguments
s_mutate(.data, ...)
s_mutate(.data, ...)
.data |
the data frame (or similar object) |
... |
string version of arguments to original dplyr function |
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
Modified version of mutate_if that uses string arguments
s_mutate_if(.data, ...)
s_mutate_if(.data, ...)
.data |
the data frame (or similar object) |
... |
string version of arguments to mutate_if |
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
Modified version of dplyr's select that uses string arguments
s_select(.data, ...)
s_select(.data, ...)
.data |
the data frame (or similar object) |
... |
string version of arguments to original dplyr function |
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
Modified version of summarise that uses string arguments
s_summarise(.data, ...)
s_summarise(.data, ...)
.data |
the data frame (or similar object) |
... |
string version of arguments to original dplyr function |
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
## Not run: library(dplyrExtras) d = mtcars cols = c("mpg","cyl","hp:vs") s_select(d,cols) # also works and yields identical result... cols = c("mpg","cyl, hp:vs") s_select(d,cols) s_filter(d,"gear == 3","cyl == 8") s_arrange(d, "-mpg, gear, carb") gd = s_group_by(d,"cyl") s_summarise(gd, "mean(disp), max(disp)") s_mutate_if(d, "cyl==6, new.col=100") ## End(Not run)
extended version of summarise_each that allows alternative arrangement of outputs
xsummarise_each(tbl, funs, ..., .long = "funs", .wide = NA, .fun.var = ".fun", .var.var = ".var")
xsummarise_each(tbl, funs, ..., .long = "funs", .wide = NA, .fun.var = ".fun", .var.var = ".var")
tbl |
the date frame |
funs |
a list of functions, see the help of summarize_each, best use the helper function funs() to create them |
... |
columns in tbl on which the functions should be applied |
.long |
either "funs" or "vars". Shall variables or functions be put in different rows? |
.wide |
currently ignored |
## Not run: library(dplyr) library(dplyrExtras) by_species <- group_by(iris,Species) summarise_each(by_species,funs(min=min, max=max, mean=mean), Petal.Width, Sepal.Width) xsummarise_each(by_species,funs(min=min, max=max(.,na.rm=TRUE)), Petal.Width, Sepal.Width, .long="vars", .var.var = "Variable") xsummarise_each(by_species,funs(min=min, max=max), Petal.Width, Sepal.Width, .long="funs") xsummarise_each_(by_species,funs(min, max), vars=c("Petal.Width", "Sepal.Width")) xsummarise_each_(by_species,funs(min, max), vars=c("Petal.Width", "Sepal.Width"), .long="vars", .var.var = "Variable") ## End(Not run)
## Not run: library(dplyr) library(dplyrExtras) by_species <- group_by(iris,Species) summarise_each(by_species,funs(min=min, max=max, mean=mean), Petal.Width, Sepal.Width) xsummarise_each(by_species,funs(min=min, max=max(.,na.rm=TRUE)), Petal.Width, Sepal.Width, .long="vars", .var.var = "Variable") xsummarise_each(by_species,funs(min=min, max=max), Petal.Width, Sepal.Width, .long="funs") xsummarise_each_(by_species,funs(min, max), vars=c("Petal.Width", "Sepal.Width")) xsummarise_each_(by_species,funs(min, max), vars=c("Petal.Width", "Sepal.Width"), .long="vars", .var.var = "Variable") ## End(Not run)
extended version of summarise_each_q that allows alternative arrangement of outputs
xsummarise_each_(tbl, funs, vars, .long = "funs", .wide = NA, fun.var = ".fun", var.var = ".var")
xsummarise_each_(tbl, funs, vars, .long = "funs", .wide = NA, fun.var = ".fun", var.var = ".var")
tbl |
the date frame |
funs |
a list of functions, see the help of summarize_each, best use the helper function funs() to create them |
vars |
columns in tbl on which the functions should be applied |
.long |
either "funs" or "vars". Shall variables or functions be put in different rows? |
.wide |
currently ignored |