Split a large file into equal rows in R

Split a large file into equal rows in R - r

Here is my sample data
library(dplyr)
Singer <- c("A","B","C","A","B","D")
Rank <- c(1,2,3,3,2,1)
data <- data_frame(Singer,Rank)
I would like to split the data into three separate csv files, and each of them should have two rows. I tried to use the split function, but it did not word out as I expected.
d <- split(data,rep(1:2,each=2))

Group first, then use do to apply the writing function to each pair of rows.
library(dplyr)
library(readr)
data %>%
group_by(g = ceiling(row_number() / 2)) %>%
do(write_csv(., paste0(.$g[1], '.csv')))

Related

Stack tibbles in a loop

I have various spss-datasets (survey data) and for each dataset there are a number of waves (one wave for each month):
Let's assume that I have four datasets (1 to 4) and two waves for each (_W1 and _W2):
datasets <- c("dataset1_W1.sav", "dataset1_W2.sav",
"dataset2_W1.sav", "dataset2_W2.sav",
"dataset3_W1.sav", "dataset3_W2.sav",
"dataset4_W1.sav", "dataset4_W2.sav")
datasets
My goal is to stack all waves of each dataset (dataset1_W1 and dataset1_W2; dataset2_W1 and dataset2_W2; etc.). In order to do so I read the files using haven::read_spss(filename) and then I stack them using dplyr::bind_rows(df1, df2).
Now, I'd like to create a tibble for each dataset:
library(dplyr)
library(haven)
ds1_1 <- haven::read_spss("dataset1_W1.sav")
ds1_2 <- haven::read_spss("dataset1_W2.sav")
dataset1_all <- dplyr::bind_rows(ds1_1, ds1_2)
ds2_1 <- haven::read_spss("dataset2_W1.sav")
ds2_2 <- haven::read_spss("dataset2_W2.sav")
dataset2_all <- dplyr::bind_rows(ds2_1, ds2_2)
etc.
But how can I create those tibbles (dataset1_all, data2_all etc.) automatically? I've read that I should avoid dynamic variable names.

This will create a named list of dataframes, where each element is a binded dataset from both waves:
library(tidyverse)
datasets <- c("dataset1_W1.sav", "dataset1_W2.sav",
"dataset2_W1.sav", "dataset2_W2.sav",
"dataset3_W1.sav", "dataset3_W2.sav",
"dataset4_W1.sav", "dataset4_W2.sav")
dataset_id <- str_extract(datasets, "[^0-9]*[0-9]")
list_of_dfs <- datasets %>%
split(dataset_id) %>%
map_depth(.depth = 2,. f = haven::read_spss) %>%
map(bind_rows)

How to create a for loop for combining several data frames and df subsets into one data frame?

I'm trying to modify data frames and struggle with combining my operations into a for loop. I want to subset a data frame according to one particular column, attach different rows to each subset and combine the modified subsets into one single data frame again. Let's use the iris data as an example:
#Create data frame subsets based on Species column
iris_subs <- split(iris, iris$Species)
#create an empty data frame with the same columns as in iris and one empty row
emptydf <- iris[FALSE,]
emptydf[nrow(emptydf)+1,] <- NA
#create a data frame with sums for each species
iris %>% group_by(Species) %>% summarise_all(sum) -> iris_sums
iris_sums <- iris_sums[,-c(1)] #delete column with species names
#Combine data frames into one data frame with original data, sum for this species and an empty row for each subset
iris_setosa <- bind_rows(iris_subs[1], iris_sums[1,], emptydf)
iris_versicolor <- bind_rows(iris_subs[2], iris_sums[2,], emptydf)
iris_virginica <- bind_rows(iris_subs[3], iris_sums[3,], emptydf)
new_iris <- bind_rows(iris_setosa, iris_versicolor, iris_virginica)
This code does the job. However, I have a couple of hundreds of data frames which I want to process in this way and the number of different species varies for each data frame. How can I automate the last part in a for loop?
I would like something like this
#empty data frame to store output
new_iris <- iris[FALSE,]
for (i in iris_subs) {
new_iris[i] <- bind_rows(iris_subs[i], iris_sums[i,], emptydf)
new_iris <- merge(new_iris[i])
}
Error in iris_subs[i] : invalid subscript type 'list'
Apart from the error, this is probably way too simple... I'm an R beginner and have searched the net for days now, but cannot find any answer to this. Does anyone have a suggestion for how to achieve this? Thank you for any hints!

We can create a function and repeat it for all the dataframes. Here is a shorter version of what you were trying to do
library(dplyr)
repeat_process <- function(df) {
iris_sums <- df %>% group_by(Species) %>% summarise_all(sum) %>% select(-Species)
df %>% bind_rows(iris_sums, emptydf[rep(1:nrow(emptydf), n_distinct(df$Species)), ])
}
Now let's assume you have a list of dataframes
list_df <- list(iris, iris)
You can apply this function to each dataframe in the list
lapply(list_df, repeat_process)

You can define a function that will sum up all numeric columns of a data.frame, and leave other columns as NA, append this to original data frame:
numericCols = sapply(iris,is.numeric)
func = function(df,numCols){
iris_sums <- colSums(df[,numCols])
result <- rep(NA,ncol(df))
names(result) <- colnames(df)
result[names(iris_sums)] <- iris_sums
rbind(df,result,rep(NA,ncol(df)))
}
Then we use purrr to map each subset:
split(iris,iris$Species) %>% map_dfr(func,numCols=numericCols)

R dplyr group_by subject appears to use entire dataframe instead of subject

Background
I am working with a large dataset from a repeated measures clinical trial in R, where I want to do some data manipulations for each subject. This could be extraction of the max value in column x for each subject or the mean of column y for each subject.
Problem
I am fond of using the dplyr package and pipes, which led me to the group_by function. But when I try to apply it, the data that I want to extract does not seem to group by subject as it is supposed to, but rather extracts data based on the entire dataset.
Code
This is what I have done so far:
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
library(dplyr)
library(plyr)
data <- tbl_df(data)
test <- data %>%
filter(!is.na(wght)) %>%
dplyr::group_by(subject_id) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
ungroup()
Sample of the test dataframe:
Find a .csv sample of my dataset here:
https://drive.google.com/file/d/1wGkSQyJXqSswThiNsqC26qaP7d3catyX/view?usp=sharing

Is this what you want? In my example below, the output shows the max value for the maxwght column by subject id. You could replace max() with mean, for example, if you require the mean value for maxwght for each subject id.
library(dplyr)
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
test <- data %>%
filter(!is.na(wght)) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
group_by(subject_id) %>%
summarise(value = max(maxwght)) %>%
ungroup()

R filter rows based on multiple partial strings applied to multiple columns

Sample of dataset:
diag01 <- as.factor(c("S7211","J47","J47","K729","M2445","Z509","Z488","R13","L893","N318","L0311","S510","A047","D649"))
diag02 <- as.factor(c("K590","D761","J961","T501","M8580","R268","T831","G8240","B9688","G550","E162","T8902","E86","I849"))
diag03 <- as.factor(c("F058","M0820","E877","E86","G712","R32","A408","E888","G8220","C794","T68","L0310","M1094","D469"))
diag04 <- as.factor(c("E86","C845","R790","I420","G4732","R600","L893","R509","T913","C795","M8412","G8212","L891","L0311"))
diag05 <- as.factor(c("R001","N289","E876","E871","H659","R4589","N508","B99","I209","C773","T921","Q070","H919","L033"))
diag06 <- as.factor(c("I951","E877","S7240","I500","H901","E119","Z223","K590","I959","C509","G819","F719","Z290","R13"))
df <- data.frame(diag01, diag02, diag03, diag04, diag05, diag06)
I want to filter the entire rows that have a partial string match anywhere in a given list of columns (e.g. diag01, diag02, ...). I can achieve this on a single column e.g.
junk <- filter(df, grepl(pattern="^E11|^E16|^E86|^E87|^E88", diag02))
but I need to apply this to multiple columns (the original dataset has 216 columns and >1,000,000 rows). Among other options, I have tried
junk <- filter(df, grepl(pattern="^E11|^E16|^E86|^E87|^E88", df[,c(1:6)]))
junk <- apply(df, 1, function(r) any(r %in% grepl(pattern="^E11|^E16|^E86|^E87|^E88")))
I need the entire row and ideally I would like the filtering criteria to be restricted to a given list of columns as it is likely values in other columns may begin with the declared partial strings.
Made a genuine effort to search for a solution but obviously my knowledge of R is lacking.

Perhaps we need
df %>%
filter_all(any_vars(grepl(pattern="^(E11|E16|E86|E87|E88)", .)))
Or with purrr and dplyr
library(dplyr)
library(purrr)
df %>%
map(~grepl(pattern="^E11|^E16|^E86|^E87|^E88", .)) %>%
reduce(`|`) %>%
df[.,]

Split dataframe, then select random observations from list, and the lists merge back into a dataframe

I have a data frame with 3 variables (subject, trialtype, and RT), and I need to select randomly half of the RT observations for the each subject, and then re-create the data frame from that selection.
In browsing the list I've got up to here
split_df <- split(bucnidata_rt,
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
(this gives a series of split_df[1], split_df[2], ....)
But then I can not subset using this
split_df[1] <- sample(nrow(split_df[1]), 24), ]
I think because sample only works on data frames and this split_df[1] is a list.
To re-merge I would do:
remerged_df <- unsplit(split_df[1],
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
Could you please help me with step 2?

I propose a slightly different approach using dplyr if you don't mind. You can group by subject and then randomly select 50% of observations of each group:
library(dplyr)
bucnidata_rt %>%
group_by(Subject) %>%
sample_frac(size = 0.5)
Edit
Here's another way, closer to what you started. I use the mtcars dataset in this case:
split_df <- split(mtcars, mtcars$cyl) #split by `cyl`
#randomly select 50% of rows per group, without replacement
split_df <- lapply(split_df, function(x) x[sample(seq_len(nrow(x)), nrow(x)/2, replace=FALSE),])
#merge the randomly selected list elements back into one data.frame
remerged_df <- do.call(rbind, split_df)
#check the result
nrow(remerged_df)
#[1] 15
Edit #2 corrected dplyr method after comment by #Gregor

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split a large file into equal rows in R - r

Group first, then use do to apply the writing function to each pair of rows. library(dplyr) library(readr) data %>% group_by(g = ceiling(row_number() / 2)) %>% do(write_csv(., paste0(.$g[1], '.csv')))

Related

Stack tibbles in a loop

How to create a for loop for combining several data frames and df subsets into one data frame?

R dplyr group_by subject appears to use entire dataframe instead of subject

R filter rows based on multiple partial strings applied to multiple columns

Split dataframe, then select random observations from list, and the lists merge back into a dataframe

Categories

Resources