This question already has answers here:
Group Data in R for consecutive rows
(3 answers)
Closed 6 years ago.
I have written a for loop that takes a group of 5 rows from a dataframe and passes it to a function, the function then returns just one row after doing some operations on those 5 rows. Below is the code:
for (i in 1:nrow(features_data1)){
if (i - start == 4){
group = features_data1[start:i,]
group <- as.data.frame(group)
start <- i+1
sub_data = feature_calculation(group)
final_data = rbind(final_data,sub_data)
}
}
Can anyone please suggest me an alternative to this as the for loop is taking a lot of time. The function feature_calculation is huge.
Try this for a base R approach:
# convert features to data frame in advance so we only have to do this once
features_df <- as.data.frame(features_data1)
# assign each observation (row) to a group of 5 rows and split the data frame into a list of data frames
group_assignments <- as.factor(rep(1:ceiling(nrow(features_df) / 5), each = 5, length.out = nrow(features_df)))
groups <- split(features_df, group_assignments)
# apply your function to each group individually (i.e. to each element in the list)
sub_data <- lapply(X = groups, FUN = feature_calculation)
# bind your list of data frames into a single data frame
final_data <- do.call(rbind, sub_data)
You might be able to use the purrr and dplyr packages for a speed-up. The latter has a function bind_rows that is much quicker than do.call(rbind, list_of_data_frames) if this is likely to be very large.
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
This question already has answers here:
How to extract certain columns from a list of data frames
(3 answers)
Closed 2 years ago.
so x is a vector. i am trying to print the first col of df's name's saved in the vector. so far I have tried the below but they don't seem to work.
x = (c('Ethereum,another Df..., another DF...,'))
for (i in x){
print(i[,1])
}
sapply(toString(Ethereum), function(i) print(i[1]))
You can try this
x <- c('Ethereum','anotherDf',...)
for (i in x){
print(get(i)[,1])
}
You can use mget to get data in a list and using lapply extract the first column of each dataframe in the list.
data <- lapply(mget(x), `[`, 1)
#Use `[[` to get it as vector.
#data <- lapply(mget(x), `[[`, 1)
Similar solution using purrr::map :
data <- purrr::map(mget(x), `[`, 1)
This question already has answers here:
Split a large dataframe into a list of data frames based on common value in column
(3 answers)
Closed 4 years ago.
What I am trying to do is filter a larger data frame into 78 unique data frames based on the value of the first column in the larger data frame. The only way I can think of doing it properly is by applying the filter() function inside a for() loop:
for (i in 1:nrow(plantline))
{x1 = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])}
The issue is I don't know how to create a new data frame, say x2, x3, x4... every time the loop runs.
Can someone tell me if that is possible or if I should be trying to do this some other way?
There must be many duplicates for this question
split(plantline, plantline$Plant_Line)
will create a list of data.frames.
However, depending on your use case, splitting the large data.frame into pieces might not be necessary as grouping can be used.
You could use split -
# creates a list of dataframes into 78 unique data frames based on
# the value of the first column in the larger data frame
lst = split(large_data_frame, large_data_frame$first_column)
# takes the dataframes out of the list into the global environment
# although it is not suggested since it is difficult to work with 78
# dataframes
list2env(lst, envir = .GlobalEnv)
The names of the dataframes will be the same as the value of the variables in the first column.
It would be easier if we could see the dataframes....
I propose something nevertheless. You can create a list of dataframes:
dataframes <- vector("list", nrow(plantline))
for (i in 1:nrow(plantline)){
dataframes[[i]] = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])
}
You can use assign :
for (i in 1:nrow(plantline))
{assign(paste0(x,i), filter(rawdta.df, Plant_Line == plantline$Plant_Line[i]))}
alternatively you can save your results in a list :
X <- list()
for (i in 1:nrow(plantline))
{X[[i]] = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])}
Would be easier with sample data. by would be my favorite.
d <- data.frame(plantline = rep(LETTERS[1:3], 4),
x = 1:12,
stringsAsFactors = F)
l <- by(d, d$plantline, data.frame)
print(l$A)
print(l$B)
Solution using plyr:
ma <- cbind(x = 1:10, y = (-4:5)^2, z = 1:2)
ma <- as.data.frame(ma)
library(plyr)
dlply(ma, "z") # you split ma by the column named z
This question already has answers here:
Using cbind on an arbitrarily long list of objects
(4 answers)
Closed 4 years ago.
I have n number of dataframes named "s.dfx" where x=1:n. All the dataframes have 7 columns with different names. Now I want to cbind all the dataframes.
I know the comand
t<-cbind.data.frame(s.df1,s,df2,...,s.dfn)
But I want to optimize and cbind them in a loop, since n is a large number.
I have tried
for(t2 in 1:n){
t<-cbind.data.drame(s.df[t2])
}
But I get this error "Error in [.data.frame(s.df, t2) : undefined columns selected"
Can anyone help?
I don't think that a for-loop would be any faster than do.call(cbind, dfs), but it wasn't clear to me that you actually had such a list yet. I thought you might need to build such list from a character object. This answer assumes you don't have a list yet but that you do have all your dataframes numbered in an ascending sequence that ends in n where the decimal representation might have multiple digits.
t <- do.call( cbind, mget( paste0("s.dfs", 1:n) ) )
Pasqui uses ls inside mget and a pattern to capture all the numbered dataframes. I would have used a slightly different one, since you suggested that the number was higher than 9 which is all that his pattern would capture:
ls(pattern = "^s\\.df[0-9]+") # any number of digits
# ^ need double escapes to make '.' a literal period or fixed=TRUE
library(purrr) #to be redundant
#generating dummy data frames
df1 <- data.frame(x = c(1,2), y = letters[1:2])
df2 <- data.frame(x = c(10,20), y = letters[c(10, 20)])
df3 <- data.frame(x = c(100, 200), y = letters[c(11, 22)])
#' DEMO [to be adapted]: capturing the EXAMPLE data frames in a list
dfs <- mget(ls(pattern = "^df[1-3]"))
#A Tidyverse (purrr) Solution
t <- purrr::reduce(.x = dfs, .f = bind_cols)
#Base R
do.call(cbind,dfs)
# or
Reduce(cbind,dfs)
If I have a list of data frames in R, such as:
x<-c(1:10)
y<-2*x
z<-3*x
df.list <- list(data.frame(x),data.frame(y),data.frame(z))
And I'd like to average over a specific column (this is a simplified example) of all these data frames, is there any easy way to do it?
The length of the list is known but dynamic (i.e. it can change depending on run conditions).
For example:
dfone<-data.frame(c(1:10))
dftwo<-data.frame(c(11:20))
dfthree<-data.frame(c(21:30))
(Assume all the column names are val)
row, output
1, (1+11+21)/3 = 11
2, (2+12+22)/3 = 12
3, (3+13+23)/3 = 13
etc
So output[i,1] = (dfone[i,1]+dftwo[i,1]+dfthree[i,1])/3
To do this in a for loop would be trivial:
for (i in 1:length(dfone))
{
dfoutput[i,'value']=(dfone[i,'value']+dftwo[i,'value']+dfthree[i,'value'])/3
}
But I'm sure there must be a more elegant way?
Edit after the question turned out to be something else. Does this answer your question?
dfs <- list(dfone, dftwo, dfthree)
#oneliner
res <- rowMeans(sapply(dfs,function(x){
return(x[,"val"])
}))
#in steps
#step one: extract wanted column from all data
#this returns a matrix with one val-column for each df in the list
step1 <- sapply(dfs,function(x){
return(x[,"val"])
})
#step two: calculate the rowmeans. this is self-explanatory
step2 <- rowMeans(step1)
#or an even shorter oneliner with thanks to#davidarenburg:
rowMeans(sapply(dfs, `[[`, "value"))