I have various spss-datasets (survey data) and for each dataset there are a number of waves (one wave for each month):
Let's assume that I have four datasets (1 to 4) and two waves for each (_W1 and _W2):
datasets <- c("dataset1_W1.sav", "dataset1_W2.sav",
"dataset2_W1.sav", "dataset2_W2.sav",
"dataset3_W1.sav", "dataset3_W2.sav",
"dataset4_W1.sav", "dataset4_W2.sav")
datasets
My goal is to stack all waves of each dataset (dataset1_W1 and dataset1_W2; dataset2_W1 and dataset2_W2; etc.). In order to do so I read the files using haven::read_spss(filename) and then I stack them using dplyr::bind_rows(df1, df2).
Now, I'd like to create a tibble for each dataset:
library(dplyr)
library(haven)
ds1_1 <- haven::read_spss("dataset1_W1.sav")
ds1_2 <- haven::read_spss("dataset1_W2.sav")
dataset1_all <- dplyr::bind_rows(ds1_1, ds1_2)
ds2_1 <- haven::read_spss("dataset2_W1.sav")
ds2_2 <- haven::read_spss("dataset2_W2.sav")
dataset2_all <- dplyr::bind_rows(ds2_1, ds2_2)
etc.
But how can I create those tibbles (dataset1_all, data2_all etc.) automatically? I've read that I should avoid dynamic variable names.
This will create a named list of dataframes, where each element is a binded dataset from both waves:
library(tidyverse)
datasets <- c("dataset1_W1.sav", "dataset1_W2.sav",
"dataset2_W1.sav", "dataset2_W2.sav",
"dataset3_W1.sav", "dataset3_W2.sav",
"dataset4_W1.sav", "dataset4_W2.sav")
dataset_id <- str_extract(datasets, "[^0-9]*[0-9]")
list_of_dfs <- datasets %>%
split(dataset_id) %>%
map_depth(.depth = 2,. f = haven::read_spss) %>%
map(bind_rows)
Related
I have this data frame
This is a minimal reproducible example of my data frame
value <- c(rnorm(39, 5, 2))
Date <- seq(as.POSIXct('2021-01-18'), as.POSIXct('2021-10-15'), by = "7 days")
df <- data.frame(Date, value)
# This is the vector I have to compare with the Date of the dataframe
dates_tour <- as.POSIXct(c('2021-01-18', '2021-05-18', '2021-08-18', '2021-10-15'))
df <- df %>%
mutate(
tour = cut(Date, breaks = dates_tour, labels = seq_along(dates_tour[-1]))
)
Now that I have the data frame label on each group based on the dates_tour I want to split the data frame based on the tour factor but I need that each list contains the data frame of the previous data frame.
For instance df_list[[1]] contains the rows with tour == 1The second list needs to contain the first and the second data frame tour == 1 | tour == 2. The third list needs to contain the first, second, and third data frames and so on. I need to work writing a general code that works with different lengths of dates_tour as sometimes it can contain different lengths of values.
This code creates a list based on the tour value
df_list = split(df, df$tour)
But is not useful to create what I need
You could also do:
Reduce(rbind, split(df, ~tour), accumulate = TRUE)
if you have an older version of R:
Reduce(rbind, split(df, df$tour), accumulate = TRUE)
You could also use accumulate from purrr:
library(purrr)
accumulate(split(df, ~tour), rbind)
We may use a loop for that
df_list <- lapply(unique(df$tour), function(x) subset(df, tour %in% seq_len(x)))
I am looking for a way to run through series of data frames in R in order to restructure them in preparation for pushing them through multiple linear regression models. Here is the basic structure.
Let's say you have 3 data frames:
StateList <- c(AL, AR, AZ)
Where each state represents a different data frame (same columns with varying record counts). I want to restructure all 3 data frames from its RAW forms of columns to an ETL version where I am only selecting certain columns in a different order then was in the RAW format. I can easily do this by running this:
AL <- AL[var5,var3,var2]
AR <- AR[var5,var3,var2]
AZ <- AZ[var5,var3,var2]
Is there any easy way that I can loop through all the data frames (which have different names) using a list like in the StateList from above and update all 2 data frames into the ETL format?
I tried doing the below but it doesn't seem to work:
VariableList <- c(var5,var3,var2)
for (df in StateList) {
df[VariableList]}
Something like this?
library(dplyr)
data(mtcars)
df1 <- mtcars %>% filter(cyl == 4)
df2 <- mtcars %>% filter(cyl == 6)
df3 <- mtcars %>% filter(cyl == 8)
df_names <- c("df1", "df2", "df3")
df_list <- lapply(df_names, get)
names(df_list) <- df_names
You can then use lapply or map functions to apply whatever function you require to each of the list elements (which are your data frames).
Add the dataframe in a list, you can iterate over them using lapply, arrange the data in a specific order and do whatever tasks you would like to do on it.
StateList <- list(AL, AR, AZ)
VariableList <- c("var5","var3","var2")
result <- lapply(StateList, function(x) {
new_data <- new_data <- data[, VariableList]
#Add code to perform on each dataframe
#....
})
I'm trying to modify data frames and struggle with combining my operations into a for loop. I want to subset a data frame according to one particular column, attach different rows to each subset and combine the modified subsets into one single data frame again. Let's use the iris data as an example:
#Create data frame subsets based on Species column
iris_subs <- split(iris, iris$Species)
#create an empty data frame with the same columns as in iris and one empty row
emptydf <- iris[FALSE,]
emptydf[nrow(emptydf)+1,] <- NA
#create a data frame with sums for each species
iris %>% group_by(Species) %>% summarise_all(sum) -> iris_sums
iris_sums <- iris_sums[,-c(1)] #delete column with species names
#Combine data frames into one data frame with original data, sum for this species and an empty row for each subset
iris_setosa <- bind_rows(iris_subs[1], iris_sums[1,], emptydf)
iris_versicolor <- bind_rows(iris_subs[2], iris_sums[2,], emptydf)
iris_virginica <- bind_rows(iris_subs[3], iris_sums[3,], emptydf)
new_iris <- bind_rows(iris_setosa, iris_versicolor, iris_virginica)
This code does the job. However, I have a couple of hundreds of data frames which I want to process in this way and the number of different species varies for each data frame. How can I automate the last part in a for loop?
I would like something like this
#empty data frame to store output
new_iris <- iris[FALSE,]
for (i in iris_subs) {
new_iris[i] <- bind_rows(iris_subs[i], iris_sums[i,], emptydf)
new_iris <- merge(new_iris[i])
}
Error in iris_subs[i] : invalid subscript type 'list'
Apart from the error, this is probably way too simple... I'm an R beginner and have searched the net for days now, but cannot find any answer to this. Does anyone have a suggestion for how to achieve this? Thank you for any hints!
We can create a function and repeat it for all the dataframes. Here is a shorter version of what you were trying to do
library(dplyr)
repeat_process <- function(df) {
iris_sums <- df %>% group_by(Species) %>% summarise_all(sum) %>% select(-Species)
df %>% bind_rows(iris_sums, emptydf[rep(1:nrow(emptydf), n_distinct(df$Species)), ])
}
Now let's assume you have a list of dataframes
list_df <- list(iris, iris)
You can apply this function to each dataframe in the list
lapply(list_df, repeat_process)
You can define a function that will sum up all numeric columns of a data.frame, and leave other columns as NA, append this to original data frame:
numericCols = sapply(iris,is.numeric)
func = function(df,numCols){
iris_sums <- colSums(df[,numCols])
result <- rep(NA,ncol(df))
names(result) <- colnames(df)
result[names(iris_sums)] <- iris_sums
rbind(df,result,rep(NA,ncol(df)))
}
Then we use purrr to map each subset:
split(iris,iris$Species) %>% map_dfr(func,numCols=numericCols)
Here is my sample data
library(dplyr)
Singer <- c("A","B","C","A","B","D")
Rank <- c(1,2,3,3,2,1)
data <- data_frame(Singer,Rank)
I would like to split the data into three separate csv files, and each of them should have two rows. I tried to use the split function, but it did not word out as I expected.
d <- split(data,rep(1:2,each=2))
Group first, then use do to apply the writing function to each pair of rows.
library(dplyr)
library(readr)
data %>%
group_by(g = ceiling(row_number() / 2)) %>%
do(write_csv(., paste0(.$g[1], '.csv')))
I have a data frame with 3 variables (subject, trialtype, and RT), and I need to select randomly half of the RT observations for the each subject, and then re-create the data frame from that selection.
In browsing the list I've got up to here
split_df <- split(bucnidata_rt,
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
(this gives a series of split_df[1], split_df[2], ....)
But then I can not subset using this
split_df[1] <- sample(nrow(split_df[1]), 24), ]
I think because sample only works on data frames and this split_df[1] is a list.
To re-merge I would do:
remerged_df <- unsplit(split_df[1],
list(bucnidata_rt$Subject, bucnidata_rt$trialtype))
Could you please help me with step 2?
I propose a slightly different approach using dplyr if you don't mind. You can group by subject and then randomly select 50% of observations of each group:
library(dplyr)
bucnidata_rt %>%
group_by(Subject) %>%
sample_frac(size = 0.5)
Edit
Here's another way, closer to what you started. I use the mtcars dataset in this case:
split_df <- split(mtcars, mtcars$cyl) #split by `cyl`
#randomly select 50% of rows per group, without replacement
split_df <- lapply(split_df, function(x) x[sample(seq_len(nrow(x)), nrow(x)/2, replace=FALSE),])
#merge the randomly selected list elements back into one data.frame
remerged_df <- do.call(rbind, split_df)
#check the result
nrow(remerged_df)
#[1] 15
Edit #2 corrected dplyr method after comment by #Gregor