create multiple smaller dataframes from a larger one by IDs in R - r

I have a big dataframe of almost 2mlns entries divided by 11 columns. I want to split the database into multiple smaller database by filtering by the first two columns. I give here an example of the db.
investor asset price col4 col5 ecc
44KL TLSA
451L F
4639L AAPL
44KL UBI
44KL F
I want to create a new single dataframe for each investor paired with a single asset.
This means I want the investor '44KL' to be divided into three different dataframes called TSLA, UBI and F. And this must apply for all the investors I have in my dataset.
I've tried with a parallel approach by doing this:
I first used unique() on the database to create the 'investor_ids' and the 'asset_list'
then I tried with:
file_names <- investors %>%
dplyr::filter(investor %in% investor_ids) %>%
dplyr::filter(asset %in% asset_list) %>%
dplyr::arrange(investor) %>%
dplyr::mutate(name = stringr::str_c("INV", investor, asset, num_trx, stat, sep = "_")) %>%
purrr::pluck("name")
for_asset <- function(df) {
for(inv in investor){
for (ass in assets) {
df <- subset(df, subset = asset == ass)
}
}
}
Parallel --------------------------------------------------------------
cl <- parallel::makeCluster(parallel::detectCores())
doParallel::registerDoParallel(cl)
tictoc::tic()
foreach::foreach(i = seq_along(file_names), .errorhandling = "pass") %dopar% {
df <- for_asset(db_test)
nm <- paste0("dev/test-data/investors-rdata-assetbased/", file_names[i], ".RData")
save(df, file = nm)
}
time <- tictoc::toc()
parallel::stopCluster(cl)
but I end up with the correct number of dataframes, but all are just NULL values.
Can you help me?
i then want to move on by applying computations on the new formed dfs so I need something easy to use.
I tried with split but I get a list of lists on which I don't know how to work

You can do this:
dfs = split(df, df[c(1,2)], drop=TRUE)
purrr::walk(names(dfs), function(d) {
readr::write_csv(dfs[[d]],paste0("dev/test-data/investors-rdata-assetbased/",d,".csv"))
})
A better option, by far, is to set your df to data.table i.e.
library(data.table)
setDT(df)
and then work with each investor/asset subgroup, using df[i,j,by=.(investor,asset)]

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Speed up the processing time of for loop for big data in R

I have very large datasets bdd_cases having 150,000 rows and bdd_control containing 15 million rows. Here I have reduced the size of these datasets and given as drive link for simplicity. Among other things, I am trying to add matching rows from bdd_control to bdd_cases based on cluster_case and subset variables.
I have the following for loop written for this purpose and it works perfectly for the small dataset example given here. It takes around 13 secs even for this small dataset.
#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"
bdd_cases <- as.data.frame(read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id1)))
bdd_control <- as.data.frame(read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id2)))
#declare empty dataframe
bdd_temp <- NULL
list_p <- unique(bdd_cases$cluster_case)
#for loop
for (i in 1:length(list_p)) {
temp <- bdd_cases %>%
filter(cluster_case==list_p[i]) #select the first case from bdd_cases
temp0 <- bdd_control %>% filter(subset==temp$subset) #select the rows from bdd_control that match the first case above on the subset variable
temp <- rbind(temp, temp0) #bind the two
temp$cluster_case <- list_p[i] #add the ith cluster_case to all the rows
temp <- temp %>%
group_by(cluster_case) %>% #group by cluster case
mutate(age_diff = abs(age - age[case_control=="case"]), #calculate difference in age between case and controls
fup_diff = foll_up - foll_up[case_control=="case"], #calculate difference in foll_up between case and controls
age_fup = ifelse(age_diff<=2 & fup_diff==0,"accept","delete")) %>% #keep the matching controls and remove the other controls for the ith cluster_case
filter(age_fup=="accept") %>%
select(-age_fup)
bdd_temp <- bdd_temp %>% # finally add this matched case and control to the empty dataframe
bind_rows(temp)
}
My problem arises when I try the same for loop for the original datasets with millions of rows. My program has been running for 2 days. I am running it on R studio server which has 64 cores and 270 GB RAM.
I have referred to previous posts like this one(Speed up the loop operation in R) which talks about vectorisation and use of lists instead of dataframes. However, I am not able to apply those to my specific situation.
Are there any specific improvements I can make to the commands within my for loop which would speed up the execution?
Any little improvement in speed would mean a lot. Thanks.
This should speed things up considerably.
On my systemn, the speed gain is about a factor 5.
#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"
library(data.table)
# use fread for reading, fast and get a nice progress bar as bonus
bdd_cases <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id1))
bdd_control <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id2))
#Put everything in a list
L <- lapply(unique(bdd_cases$cluster_case), function(x){
temp <- rbind(bdd_cases[cluster_case == x, ],
bdd_control[subset == bdd_cases[cluster_case == x, ]$subset])
temp[, cluster_case := x]
temp[, `:=`(age_diff = abs(age - age[case_control=="case"]),
fup_diff = foll_up - foll_up[case_control=="case"])]
temp[age_diff <= 2 & fup_diff == 0, ]
})
#Rowbind the list
final <- rbindlist(L, use.names = TRUE, fill = TRUE)

for loops nested in R

I have a dataset dt, it stored list dataset names, I need to use them to create some new datasets with select some variables, then I use the dataset I just created, repeat the same process .....
The first row and second row were data available.
Then use data available to create a new data.
Then use data just create to create a new data
The final output was list of datasets
I appreciated any helps or suggestions.
dt <- data.frame(name = c("mtcars","iris", "mtcars_new","mtcars_new_1"),
data_source = c("mtcars","iris", "mtcars","mtcars_new"),
variable = c("","","mpg,cyl,am,hp","mpg,cyl"), stringsAsFactors = FALSE)
> dt
name data_source variable
1 mtcars mtcars
2 iris iris
3 mtcars_new mtcars mpg,cyl,am,hp
4 mtcars_new_1 mtcars_new mpg,cyl
dt_list <- list(mtcars, iris)
names(dt_list ) <- c("mtcars","iris")
# The final list of datasets
final_dt <- list(mtcars, iris, mtcars_new, mtcars_new_1)
So far if I wrote a loop like that, I got only mtcars_new dataset, but I don't know how to return to the list and continue looping to get mtcars_new_1 and so on. I have many datasets, and I don't know how many times I should looping through nested data.
mtcars_new <- data.frame()
for(i in 1:nrow(dt)){
if(dt$data_source[[i]] %in% names(dt_list) && !dt$name[[i]] %in% names(dt_list)){
check <- eval(parse(text = dt$data_source[[i]]))
var <- c(unlist(strsplit(dt$variable[[i]],",")))
mtcars_new <- check[, colnames(check) %in% var]
}
}
This will produce the desired output shown. Since the fourth loop uses the data created in the third loop, you need to have a way to append the results of each loop to a growing list of available data sets. Then within each loop find which one is the right starting data set from the available list.
dt <- data.frame(name = c("mtcars","iris", "mtcars_new","mtcars_new_1"),
data_source = c("mtcars","iris", "mtcars","mtcars_new"),
variable = c("","","mpg,cyl,am,hp","mpg,cyl"), stringsAsFactors = FALSE)
input_data_sets <- list(mtcars, iris)
names(input_data_sets) <- c("mtcars","iris")
final_data_sets <- list()
for(i in 1:nrow(dt)) {
available_data_sets <- c(input_data_sets, final_data_sets) #Grows a list of all available data sets
num_to_use <- which(dt$data_source[[i]] == names(available_data_sets)) #finds the right list member to use
temp <- available_data_sets[num_to_use][[1]]
var <- c(unlist(strsplit(dt$variable[[i]],",")))
temp <- list(subset(temp, select = var)) #keep only the desired variables
names(temp) <- dt$name[i] #assign the name provided
final_data_sets <- c(final_data_sets, temp) #add to list of final data sets which will be the output. Anything listed here will become part of the available list in the next loop
}

How to read and use the dataframes with the different names in a loop?

I'm struggling with the following issue: I have many data frames with different names (For instance, Beverage, Construction, Electronic etc., dim. 540x1000). I need to clean each of them, calculate and save as zoo object and R data file. Cleaning is the same for all of them - deleting the empty columns and the columns with some specific names.
For example:
Beverages <- Beverages[,colSums(is.na(Beverages))<nrow(Beverages)] #removing empty columns
Beverages_OK <- Beverages %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
Beverages_OK[, 1] <- NULL #dropping the first column
Beverages_OK <- cbind(data[1], Beverages_OK) # adding a date column
Beverages_zoo <- read.zoo(Beverages_OK, header = FALSE, format = "%Y-%m-%d")
save (Beverages_OK, file = "StatisticsInRFormat/Beverages.RData")
I tied to use 'lapply' function like this:
list <- ls() # the list of all the dataframes
lapply(list, function(X) {
temp <- X
temp <- temp [,colSums(is.na(temp))< nrow(temp)] #removing empty columns
temp <- temp %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
temp[, 1] <- NULL
temp <- cbind(data[1], temp)
X_zoo <- read.zoo(X, header = FALSE, format = "%Y-%m-%d") # I don't know how to have the zame name as X has.
save (X, file = "StatisticsInRFormat/X.RData")
})
but it doesn't work. Is any way to do such a job? Is any r-package that facilitates it?
Thanks a lot.
If you are sure the you have only the needed data frames in the environment this should get you started:
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
list <- ls()
lapply(list, function(x) {
tmp <- get(x)
})

Most efficient way of avoiding loops to create data.frame

I have a data.frame which includes the runs scored in each innings of baseball games as a character vector.
I want to create a new data.frame which lists the number of runs in each innings for each game. I can do this with a loop but appreciate that this is too slow for any reasonable number of observations and that the rbind method shown is also not ideal.
The number of innings may vary and an x indicates that the team did not need to bat in 9th inning as game was already won.
library(stringr)
data <- data.frame(gameID=c("a","b","c"),innings=c("002100000","30000000x","10101010101"))
for(i in 1:nrow(data)) {
box <- as.integer(str_split(data$innings[i], "")[[1]])
tempdf <- data.frame(box,id=data$gameID[i])
if(i!=1) {
df <- rbind(df,tempdf)
} else {
df <- tempdf
}
}
This helps a bit (30%):
res <- vector("list", nrow(data))
for(i in seq_along(res))
res[[i]] <- data.frame(box=as.integer(str_split(data$innings[i], "")[[1]]),
id=data$gameID[i])
do.call(rbind, res)
Not sure if this is faster,
library(splitstackshape)
data$innings <- gsub('', ' ', data$innings)
cSplit(data, 'innings', ' ', 'long')
Here's a way using lists with lapply:
library(dplyr) # for bind_rows -- you can also use do.call(rbind, list)
innings <- str_split(data$innings, "")
names(innings) <- data$gameID
innings <- lapply(innings, function(x) data.frame(box = x))
bind_rows(innings, .id = "id")
This should be pretty fast:
## Defined these separately just for readability
innings <- as.character(data$innings) # or use 'stringsAsFactors=FALSE' when defining the data frame
box <- unlist(strsplit(innings, ""))
id <- rep(data$gameID, nchar(innings))
## To get a character matrix back
cbind(box, id)
## To get a data frame back
data.frame(box=box, id=id, stringsAsFactors=FALSE)
Using a matrix is faster, but if you want to have mixed classes use a data frame. Also, for a data frame, it's faster to use characters than factors (thus the stringsAsFactors=FALSE argument). If you want box to be numeric, you can wrap it in as.integer (but then the matrix option wont work, of course).

Resources