Creating multiple columns by different functions of different variables

Creating multiple columns by different functions of different variables - r

I'm having trouble transitioning to data.table. I am trying to group by some categorical variables, and apply a list of functions that each target different variables in order to create new columns. This is something that seems like it should be easy with mapply or Map, but I can't figure out to assemble the proper subset in the to pass to the functions.
Here is what it looks like,
set.seed(2015)
dat <- data.table(cat1 = factor('Total'),
cat2 = factor(rep(letters[1:4], 5)),
cat3 = factor(rep(1:4, each=5)),
var1 = sample(20),
var2 = sample(20),
var3 = sample(20))
## I have list of factor columns to group by
groups <- c(paste0("cat", 1:3))
setkeyv(dat, groups)
## List of functions, and corresponding list of column names that
## they are to be applied to. So, in this example I should get
## two new columns: V1=sum(var1) and V2=mean(var2, var3)
thing <- function(...) mean(c(...), na.rm=TRUE) # arbitrary function
funs <- list("sum", "thing") # named functions
targets <- list("var1", c("var2", "var3")) # variables
outnames <- funs # names or result columns
## Can't get this part
f <- function(fn, vars) do.call(fn, vars)
dat[, outnames := Map(f, funs, targets), by=groups]
The result for this example should be like this
dat[, `:=`(sum=sum(var1), thing=thing(var2, var3)), by=groups]

We need to subset the dataset columns based on the column names in the 'targets' list. One way would to loop through the list elements of 'targets' and subset the data.table (.SD[, x, with=FALSE]), and then apply the function.
dat[, unlist(outnames) := Map(f, funs, lapply(targets, function(x)
.SD[, ..x])), by = groups]

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows

If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)

Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

fast replacement of data.table values by labels stored in another data.table

It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]

Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...

a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")

I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

Repeating calculation with different DF

I have around 10 DFs and would like to perform the following calculations on all of them and then have out as 10 new DFs.
I have been able to get this to work for 1 DF, but rather than copying the code and changing the names, 10 times, I wanted to see if there is a way to do this in. Ideally, I end up with 1 DF and 10 different columns, but am happy with anything
The calculations I am trying to do are:
temp <- merge (x=DF1, y=temp1, by = c("name"), all.x= TRUE)
asset_column <-grep("^Assets_", names(DF1))
return_column <-grep("^Return_", names(DF1))
OutputDF <-
stack(colSums(t(t(temp[asset_column])/colSums(temp[asset_column],
na.rm=TRUE)) * US_only[return_column],na.rm =TRUE))
OutputDF['values'] = OutputDF['values']/100

If these are repeatable calculations in a list, loop through the list with lapply and do the same code where we specify the first dataset from the anonymous function call (function(x) x)
out <- lapply(lst1, function(x) {
temp <- merge (x, y=temp1, by = c("name"), all.x= TRUE)
asset_column <-grep("^Assets_", names(x))
return_column <-grep("^Return_", names(x))
OutputDF <-
stack(colSums(t(t(temp[asset_column])/colSums(temp[asset_column],
na.rm=TRUE)) * US_only[return_column],na.rm =TRUE))
OutputDF['values'] = OutputDF['values']/100
OutputDF
})
Here, the output is also a list of data.frames which can be kept in the list as such or extract with [[

for loop and a function combined to calculate a formula and then a regression

A little confused with how I am trying to acheive the results I want.
I have an environment in R which consists of 5 data.frames called df[i]
So;
df1
df2
df3
df4
df5
Inside of these df´s I have 5 columns called col[j]
col1
col2
col3
col4
col5
In total I have 25 columns across 5 data frames (5 df x 5 col).
I also have a static variable called R which is a vector of numbers
I am trying to calculate for each column of each dataframe a basic formula using a function/loop. The formula for column 1 of df1 would be;
Y = df1$col1 - R
I am trying to calculate this and repeat for each colum[j:5] in df[i:5] and store it in a new data.frame
j <- 1:5
i <- 1:5
fun <- function(x){
for(i in 1:col[j](df[i])){
Y[j] <- col[j] - R
}
}
EDIT: Added comment below for easier reading.
Y1a = df1$col1 - R
Y2a = df1$col2 - R
Y3a = df1$col3 - R
.....
.....
Y1b = df2$col1 - R
Y2b = df2$col2 - R
Y3b = df2$col3 - R
..... etc

# Put your data in a list:
dflist = mget(paste0("df", 1:5))
# Apply your function to every data frame
ylist = lapply(dflist, function(x) x - R)
# Name the resulting columns y1:y5
ylist = lapply(ylist, setNames, paste0("y", 1:5))
Have a look at How to make a list of data frames for examples and discussion of why using lists is better.

tidyverse version
dplyr::mutate_all apply a fonction to each column of a data.frame.
So I would do like that:
all_df <- list(df1, df2, df3, df4, df5)
map(all_df, function(x) mutate_all(x, function(y) y - R))
It should return you a list of length 5. Each df contains your desired statistic.

Mapply to Add Column to Each Dataframe in a List

Implemented some code from previous question:
Lapply to Add Columns to Each Dataframe in a List
Using the method above, I receive corrupt data. While I cannot provide actual data, I am wondering if additional arguments need to be implemented to prevent shuffling.
Basically, this:
Require: data.table
df1 <- data.frame(x = runif(3), y = runif(3))
df2 <- data.frame(x = runif(3), y = runif(3))
dfs <- list(df1, df2)
years <- list(2013, 2014)
a<-Map(cbind, dfs, year = years)
final<-rbindlist(a)
But applied to a list of thousands of data frame lists has incorrect results. Assume that some data frames, say df 1.5 somewhere between two above data frames, are empty. Would that affect the order in which the Map binds the years to the dfs? Essentially, I have an output with some data belonging to different years than the Map attached it to. I tested the length and order of years list, and compared it to the output year in final. They are identical. Any thoughts?

We create a logical index based on the length of each element in 'dfs', use that to subset both the 'dfs' and the 'years' and then do the cbind with Map
i1 <- sapply(dfs, length)>1
Or to make it more stringent
i1 <- sapply(dfs, function(x) is.data.frame(x) & !is.null(x) & length(x) >0 )
a <- Map(cbind, dfs[i1], year = years[i1])
and then do the rbindlist with fill = TRUE in case the number of columns are not the same in all the data.frames in the `list.
rbindlist(a, fill = TRUE)
data
dfs[[3]] <- list(NULL)
dfs[[4]] <- data.frame()
years <- 2013:2016

Use the idcol argument to rbindlist and add the year column afterwards:
res = rbindlist(dfs, idcol=TRUE)
res[.(.id = 1:2, year = 2013:2014), on=".id", year := i.year]
X[i, on=cols, z := i.z] merges X with i on cols and then copies z from i to X.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating multiple columns by different functions of different variables - r

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

fast replacement of data.table values by labels stored in another data.table

Repeating calculation with different DF

for loop and a function combined to calculate a formula and then a regression

Mapply to Add Column to Each Dataframe in a List

Categories

Resources