This question already has answers here:
Split a large dataframe into a list of data frames based on common value in column
(3 answers)
Closed 4 years ago.
What I am trying to do is filter a larger data frame into 78 unique data frames based on the value of the first column in the larger data frame. The only way I can think of doing it properly is by applying the filter() function inside a for() loop:
for (i in 1:nrow(plantline))
{x1 = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])}
The issue is I don't know how to create a new data frame, say x2, x3, x4... every time the loop runs.
Can someone tell me if that is possible or if I should be trying to do this some other way?
There must be many duplicates for this question
split(plantline, plantline$Plant_Line)
will create a list of data.frames.
However, depending on your use case, splitting the large data.frame into pieces might not be necessary as grouping can be used.
You could use split -
# creates a list of dataframes into 78 unique data frames based on
# the value of the first column in the larger data frame
lst = split(large_data_frame, large_data_frame$first_column)
# takes the dataframes out of the list into the global environment
# although it is not suggested since it is difficult to work with 78
# dataframes
list2env(lst, envir = .GlobalEnv)
The names of the dataframes will be the same as the value of the variables in the first column.
It would be easier if we could see the dataframes....
I propose something nevertheless. You can create a list of dataframes:
dataframes <- vector("list", nrow(plantline))
for (i in 1:nrow(plantline)){
dataframes[[i]] = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])
}
You can use assign :
for (i in 1:nrow(plantline))
{assign(paste0(x,i), filter(rawdta.df, Plant_Line == plantline$Plant_Line[i]))}
alternatively you can save your results in a list :
X <- list()
for (i in 1:nrow(plantline))
{X[[i]] = filter(rawdta.df, Plant_Line == plantline$Plant_Line[i])}
Would be easier with sample data. by would be my favorite.
d <- data.frame(plantline = rep(LETTERS[1:3], 4),
x = 1:12,
stringsAsFactors = F)
l <- by(d, d$plantline, data.frame)
print(l$A)
print(l$B)
Solution using plyr:
ma <- cbind(x = 1:10, y = (-4:5)^2, z = 1:2)
ma <- as.data.frame(ma)
library(plyr)
dlply(ma, "z") # you split ma by the column named z
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
This question already has answers here:
Combine a list of data frames into one data frame by row
(10 answers)
Closed 5 years ago.
I am sorry if this question has been answered already. Also, this is my first time on stackoverflow.
I have a beginner R question concerning lists , data frames and merge() and/or rbind().
I started with a Panel that looks like this
COUNTRY YEAR VAR
A 1
A 2
B 1
B 2
For efficiency purposes, I created a list that consists of one data frame for each country and performed a variety of calculations on each individual data.frame. However, I cannot seem to combine the individual data frames into one large frame again.
rbind() and merge() both tell me that only replacement of elements is allowed.
Could someone tell me what I am doing wrong/ and how to actually recombine the data frames?
Thank you
Maybe you want to do something like:
do.call("rbind", my.df.list)
dplyr lets you use bind_rows function for that:
library(dplyr)
foo <- list(df1 = data.frame(x=c('a', 'b', 'c'),y = c(1,2,3)),
df2 = data.frame(x=c('d', 'e', 'f'),y = c(4,5,6)))
bind_rows(foo)
Note that the basic solution
do.call("rbind", my.df.list)
will be slow if we have many dataframes. A scalable solution is:
library(data.table)
rbindlist(my.df.list)
which, from the docs, is the same as do.call("rbind", l) on data.frames, but much faster.
plyr is probably best. Another useful approach if the data frames can be different is to use reshape:
library(reshape)
data <- merge_recurse(listofdataframes)
Look at my answer to this related question on merging data frames.
There might be a better way to do this, but this seems to work and it's straightforward. (My code has four lines so that it's easier to see the steps; these four could easily be combined.)
# first re-create your data frame:
A = matrix( ceiling(10*runif(8)), nrow=4)
colnames(A) = c("country", "year_var")
dfa = data.frame(A)
# now re-create the list you made from the individual rows of the data frame:
df1 = dfa[1,]
df2 = dfa[2,]
df3 = dfa[3,]
df4 = dfa[4,]
df_all = list(df1, df2, df3, df4)
# to recreate your original data frame:
x = unlist(df_all) # from your list create a single 1D array
A = matrix(x, nrow=4) # dimension that array in accord w/ your original data frame
colnames(A) = c("country", "year_var") # put the column names back on
dfa = data.frame(A) # from the matrix, create your original data frame
This question already has answers here:
Using cbind on an arbitrarily long list of objects
(4 answers)
Closed 4 years ago.
I have n number of dataframes named "s.dfx" where x=1:n. All the dataframes have 7 columns with different names. Now I want to cbind all the dataframes.
I know the comand
t<-cbind.data.frame(s.df1,s,df2,...,s.dfn)
But I want to optimize and cbind them in a loop, since n is a large number.
I have tried
for(t2 in 1:n){
t<-cbind.data.drame(s.df[t2])
}
But I get this error "Error in [.data.frame(s.df, t2) : undefined columns selected"
Can anyone help?
I don't think that a for-loop would be any faster than do.call(cbind, dfs), but it wasn't clear to me that you actually had such a list yet. I thought you might need to build such list from a character object. This answer assumes you don't have a list yet but that you do have all your dataframes numbered in an ascending sequence that ends in n where the decimal representation might have multiple digits.
t <- do.call( cbind, mget( paste0("s.dfs", 1:n) ) )
Pasqui uses ls inside mget and a pattern to capture all the numbered dataframes. I would have used a slightly different one, since you suggested that the number was higher than 9 which is all that his pattern would capture:
ls(pattern = "^s\\.df[0-9]+") # any number of digits
# ^ need double escapes to make '.' a literal period or fixed=TRUE
library(purrr) #to be redundant
#generating dummy data frames
df1 <- data.frame(x = c(1,2), y = letters[1:2])
df2 <- data.frame(x = c(10,20), y = letters[c(10, 20)])
df3 <- data.frame(x = c(100, 200), y = letters[c(11, 22)])
#' DEMO [to be adapted]: capturing the EXAMPLE data frames in a list
dfs <- mget(ls(pattern = "^df[1-3]"))
#A Tidyverse (purrr) Solution
t <- purrr::reduce(.x = dfs, .f = bind_cols)
#Base R
do.call(cbind,dfs)
# or
Reduce(cbind,dfs)
This question already has answers here:
Group Data in R for consecutive rows
(3 answers)
Closed 6 years ago.
I have written a for loop that takes a group of 5 rows from a dataframe and passes it to a function, the function then returns just one row after doing some operations on those 5 rows. Below is the code:
for (i in 1:nrow(features_data1)){
if (i - start == 4){
group = features_data1[start:i,]
group <- as.data.frame(group)
start <- i+1
sub_data = feature_calculation(group)
final_data = rbind(final_data,sub_data)
}
}
Can anyone please suggest me an alternative to this as the for loop is taking a lot of time. The function feature_calculation is huge.
Try this for a base R approach:
# convert features to data frame in advance so we only have to do this once
features_df <- as.data.frame(features_data1)
# assign each observation (row) to a group of 5 rows and split the data frame into a list of data frames
group_assignments <- as.factor(rep(1:ceiling(nrow(features_df) / 5), each = 5, length.out = nrow(features_df)))
groups <- split(features_df, group_assignments)
# apply your function to each group individually (i.e. to each element in the list)
sub_data <- lapply(X = groups, FUN = feature_calculation)
# bind your list of data frames into a single data frame
final_data <- do.call(rbind, sub_data)
You might be able to use the purrr and dplyr packages for a speed-up. The latter has a function bind_rows that is much quicker than do.call(rbind, list_of_data_frames) if this is likely to be very large.
I have two dataframes and I would like to do independent 2-group t-tests on the rows (i.e. t.test(y1, y2) where y1 is a row in dataframe1 and y2 is matching row in dataframe2)
whats best way of accomplishing this?
EDIT:
I just found the format: dataframe1[i,] dataframe2[i,]. This will work in a loop. Is that the best solution?
The approach you outlined is reasonable, just make sure to preallocate your storage vector. I'd double check that you really want to compare the rows instead of the columns. Most datasets I work with have each row as a unit of observation and the columns represent separate responses/columns of interest Regardless, it's your data - so if that's what you need to do, here's an approach:
#Fake data
df1 <- data.frame(matrix(runif(100),10))
df2 <- data.frame(matrix(runif(100),10))
#Preallocate results
testresults <- vector("list", nrow(df1))
#For loop
for (j in seq(nrow(df1))){
testresults[[j]] <- t.test(df1[j,], df2[j,])
}
You now have a list that is as long as you have rows in df1. I would then recommend using lapply and sapply to easily extract things out of the list object.
It would make more sense to have your data stored as columns.
You can transpose a data.frame by
df1_t <- as.data.frame(t(df1))
df2_t <- as.data.frame(t(df2))
Then you can use mapply to cycle through the two data.frames a column at a time
t.test_results <- mapply(t.test, x= df1_t, y = df2_t, SIMPLIFY = F)
Or you could use Map which is a simple wrapper for mapply with SIMPLIFY = F (Thus saving key strokes!)
t.test_results <- Map(t.test, x = df1_t, y = df2_t)