Here is my sample data
library(dplyr)
Singer <- c("A","B","C","A","B","D")
Rank <- c(1,2,3,3,2,1)
data <- data_frame(Singer,Rank)
I would like to split the data into three separate csv files, and each of them should have two rows. I tried to use the split function, but it did not word out as I expected.
d <- split(data,rep(1:2,each=2))
Group first, then use do to apply the writing function to each pair of rows.
library(dplyr)
library(readr)
data %>%
group_by(g = ceiling(row_number() / 2)) %>%
do(write_csv(., paste0(.$g[1], '.csv')))
Related
I have various spss-datasets (survey data) and for each dataset there are a number of waves (one wave for each month):
Let's assume that I have four datasets (1 to 4) and two waves for each (_W1 and _W2):
datasets <- c("dataset1_W1.sav", "dataset1_W2.sav",
"dataset2_W1.sav", "dataset2_W2.sav",
"dataset3_W1.sav", "dataset3_W2.sav",
"dataset4_W1.sav", "dataset4_W2.sav")
datasets
My goal is to stack all waves of each dataset (dataset1_W1 and dataset1_W2; dataset2_W1 and dataset2_W2; etc.). In order to do so I read the files using haven::read_spss(filename) and then I stack them using dplyr::bind_rows(df1, df2).
Now, I'd like to create a tibble for each dataset:
library(dplyr)
library(haven)
ds1_1 <- haven::read_spss("dataset1_W1.sav")
ds1_2 <- haven::read_spss("dataset1_W2.sav")
dataset1_all <- dplyr::bind_rows(ds1_1, ds1_2)
ds2_1 <- haven::read_spss("dataset2_W1.sav")
ds2_2 <- haven::read_spss("dataset2_W2.sav")
dataset2_all <- dplyr::bind_rows(ds2_1, ds2_2)
etc.
But how can I create those tibbles (dataset1_all, data2_all etc.) automatically? I've read that I should avoid dynamic variable names.
This will create a named list of dataframes, where each element is a binded dataset from both waves:
library(tidyverse)
datasets <- c("dataset1_W1.sav", "dataset1_W2.sav",
"dataset2_W1.sav", "dataset2_W2.sav",
"dataset3_W1.sav", "dataset3_W2.sav",
"dataset4_W1.sav", "dataset4_W2.sav")
dataset_id <- str_extract(datasets, "[^0-9]*[0-9]")
list_of_dfs <- datasets %>%
split(dataset_id) %>%
map_depth(.depth = 2,. f = haven::read_spss) %>%
map(bind_rows)
I'm trying to modify data frames and struggle with combining my operations into a for loop. I want to subset a data frame according to one particular column, attach different rows to each subset and combine the modified subsets into one single data frame again. Let's use the iris data as an example:
#Create data frame subsets based on Species column
iris_subs <- split(iris, iris$Species)
#create an empty data frame with the same columns as in iris and one empty row
emptydf <- iris[FALSE,]
emptydf[nrow(emptydf)+1,] <- NA
#create a data frame with sums for each species
iris %>% group_by(Species) %>% summarise_all(sum) -> iris_sums
iris_sums <- iris_sums[,-c(1)] #delete column with species names
#Combine data frames into one data frame with original data, sum for this species and an empty row for each subset
iris_setosa <- bind_rows(iris_subs[1], iris_sums[1,], emptydf)
iris_versicolor <- bind_rows(iris_subs[2], iris_sums[2,], emptydf)
iris_virginica <- bind_rows(iris_subs[3], iris_sums[3,], emptydf)
new_iris <- bind_rows(iris_setosa, iris_versicolor, iris_virginica)
This code does the job. However, I have a couple of hundreds of data frames which I want to process in this way and the number of different species varies for each data frame. How can I automate the last part in a for loop?
I would like something like this
#empty data frame to store output
new_iris <- iris[FALSE,]
for (i in iris_subs) {
new_iris[i] <- bind_rows(iris_subs[i], iris_sums[i,], emptydf)
new_iris <- merge(new_iris[i])
}
Error in iris_subs[i] : invalid subscript type 'list'
Apart from the error, this is probably way too simple... I'm an R beginner and have searched the net for days now, but cannot find any answer to this. Does anyone have a suggestion for how to achieve this? Thank you for any hints!
We can create a function and repeat it for all the dataframes. Here is a shorter version of what you were trying to do
library(dplyr)
repeat_process <- function(df) {
iris_sums <- df %>% group_by(Species) %>% summarise_all(sum) %>% select(-Species)
df %>% bind_rows(iris_sums, emptydf[rep(1:nrow(emptydf), n_distinct(df$Species)), ])
}
Now let's assume you have a list of dataframes
list_df <- list(iris, iris)
You can apply this function to each dataframe in the list
lapply(list_df, repeat_process)
You can define a function that will sum up all numeric columns of a data.frame, and leave other columns as NA, append this to original data frame:
numericCols = sapply(iris,is.numeric)
func = function(df,numCols){
iris_sums <- colSums(df[,numCols])
result <- rep(NA,ncol(df))
names(result) <- colnames(df)
result[names(iris_sums)] <- iris_sums
rbind(df,result,rep(NA,ncol(df)))
}
Then we use purrr to map each subset:
split(iris,iris$Species) %>% map_dfr(func,numCols=numericCols)
Background
I am working with a large dataset from a repeated measures clinical trial in R, where I want to do some data manipulations for each subject. This could be extraction of the max value in column x for each subject or the mean of column y for each subject.
Problem
I am fond of using the dplyr package and pipes, which led me to the group_by function. But when I try to apply it, the data that I want to extract does not seem to group by subject as it is supposed to, but rather extracts data based on the entire dataset.
Code
This is what I have done so far:
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
library(dplyr)
library(plyr)
data <- tbl_df(data)
test <- data %>%
filter(!is.na(wght)) %>%
dplyr::group_by(subject_id) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
ungroup()
Sample of the test dataframe:
Find a .csv sample of my dataset here:
https://drive.google.com/file/d/1wGkSQyJXqSswThiNsqC26qaP7d3catyX/view?usp=sharing
Is this what you want? In my example below, the output shows the max value for the maxwght column by subject id. You could replace max() with mean, for example, if you require the mean value for maxwght for each subject id.
library(dplyr)
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
test <- data %>%
filter(!is.na(wght)) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
group_by(subject_id) %>%
summarise(value = max(maxwght)) %>%
ungroup()
Sample of dataset:
diag01 <- as.factor(c("S7211","J47","J47","K729","M2445","Z509","Z488","R13","L893","N318","L0311","S510","A047","D649"))
diag02 <- as.factor(c("K590","D761","J961","T501","M8580","R268","T831","G8240","B9688","G550","E162","T8902","E86","I849"))
diag03 <- as.factor(c("F058","M0820","E877","E86","G712","R32","A408","E888","G8220","C794","T68","L0310","M1094","D469"))
diag04 <- as.factor(c("E86","C845","R790","I420","G4732","R600","L893","R509","T913","C795","M8412","G8212","L891","L0311"))
diag05 <- as.factor(c("R001","N289","E876","E871","H659","R4589","N508","B99","I209","C773","T921","Q070","H919","L033"))
diag06 <- as.factor(c("I951","E877","S7240","I500","H901","E119","Z223","K590","I959","C509","G819","F719","Z290","R13"))
df <- data.frame(diag01, diag02, diag03, diag04, diag05, diag06)
I want to filter the entire rows that have a partial string match anywhere in a given list of columns (e.g. diag01, diag02, ...). I can achieve this on a single column e.g.
junk <- filter(df, grepl(pattern="^E11|^E16|^E86|^E87|^E88", diag02))
but I need to apply this to multiple columns (the original dataset has 216 columns and >1,000,000 rows). Among other options, I have tried
junk <- filter(df, grepl(pattern="^E11|^E16|^E86|^E87|^E88", df[,c(1:6)]))
junk <- apply(df, 1, function(r) any(r %in% grepl(pattern="^E11|^E16|^E86|^E87|^E88")))
I need the entire row and ideally I would like the filtering criteria to be restricted to a given list of columns as it is likely values in other columns may begin with the declared partial strings.
Made a genuine effort to search for a solution but obviously my knowledge of R is lacking.
Perhaps we need
df %>%
filter_all(any_vars(grepl(pattern="^(E11|E16|E86|E87|E88)", .)))
Or with purrr and dplyr
library(dplyr)
library(purrr)
df %>%
map(~grepl(pattern="^E11|^E16|^E86|^E87|^E88", .)) %>%
reduce(`|`) %>%
df[.,]
I have a data frame with 3 variables (subject, trialtype, and RT), and I need to select randomly half of the RT observations for the each subject, and then re-create the data frame from that selection.
In browsing the list I've got up to here
split_df <- split(bucnidata_rt,
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
(this gives a series of split_df[1], split_df[2], ....)
But then I can not subset using this
split_df[1] <- sample(nrow(split_df[1]), 24), ]
I think because sample only works on data frames and this split_df[1] is a list.
To re-merge I would do:
remerged_df <- unsplit(split_df[1],
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
Could you please help me with step 2?
I propose a slightly different approach using dplyr if you don't mind. You can group by subject and then randomly select 50% of observations of each group:
library(dplyr)
bucnidata_rt %>%
group_by(Subject) %>%
sample_frac(size = 0.5)
Edit
Here's another way, closer to what you started. I use the mtcars dataset in this case:
split_df <- split(mtcars, mtcars$cyl) #split by `cyl`
#randomly select 50% of rows per group, without replacement
split_df <- lapply(split_df, function(x) x[sample(seq_len(nrow(x)), nrow(x)/2, replace=FALSE),])
#merge the randomly selected list elements back into one data.frame
remerged_df <- do.call(rbind, split_df)
#check the result
nrow(remerged_df)
#[1] 15
Edit #2 corrected dplyr method after comment by #Gregor