having a small issue binding some dfs together.
I used the following procedure to create simplified dfs:
PWI_1 <- read.xlsx("/Users/myname/Desktop/PrelimPWI/PWI_1_V1A.xlsx", colNames = TRUE, startRow = 2)
Slim_1 <- data.frame(PWI_1$ExperimentName,PWI_1$Subject, PWI_1$error,PWI_1$Block, PWI_1$Category, PWI_1$Trial,PWI_1$prompt1.RT)
...etc for the following files.
I then used this code to try and bind the dfs:
merged <- bind_rows(list(Slim_1,Slim_10,Slim_11...))
However, the dfs are concatenated to the right, instead of appended on the end in one long format df.
*Note the PWI_V1x is the name of the experiment version, which needs to be lined up
I think the error is caused by the variable trimming process (ie creating a 'slim' df), but unfortunately the untrimmed files have different numbers of columns, so I get an error when trying to bind the original dfs. Any advice would be appreciated!
bind_rows requires columns names to be the same. Instead of "slimming" your data frame, use dplyr::select so you pick out the same column names every time.
Slim_PWI_1 <- read.xlsx("/Users/myname/Desktop/PrelimPWI/PWI_1_V1A.xlsx", colNames = TRUE, startRow = 2) %>%
select(ExperimentName, Subject, error, Block, Category, Trial, prompt1.RT)
then this should work:
merged <- bind_rows(Slim_PWI_1, ...)
Edit:
If you have multiple files as you indicate, you can read and slim them all together like this:
Slim_PWI_list <- dir(path = "/Users/myname/Desktop/PrelimPWI/", pattern = "PWI.*xlsx", full.names = TRUE) %>%
map(~read.xlsx(., colNames = TRUE, startRow = 2)) %>%
map(~select(., ExperimentName, Subject, error, Block, Category, Trial, prompt1.RT))
merged <- bind_rows(Slim_PWI_list)
Related
I am using fread() from the data.table package with map_df() from purr. I have 10000's of csv files whith 1000's of rows. They are all mostly correct execpt occationally the system started writing the new string before the previous row has finished. So one row out of 100000's has glitched. There is not a pattern to it either.
I understand what the issue is and why mapping doesnt work, but I have no idea how to solve it due to the large numbers of files I have. Finding these odd rows and removing them is not possible manually.
I am not too sure how to create an example through code so have included a link to a DropBox folder. There are five csv folders names test1 : test 5. Test 5 has the error in it.
data <- fs::dir_ls(path = your_path, recurse = TRUE) %>%
map_df(~fread(., header = TRUE, fill = TRUE))
When mapping the data i get this error message:
Error: Can't combine `..1$b` <integer> and `..5$b` <character>.
I hope I have made things clear. Please let me know if anymore information is needed.
Any help would be appriciated.
The issue is that your column b has differing datatypes. In case of your example data that's because there is a date string in test5.csv. One option would be to read your files as a list, convert all or just the problematic columns to a character and then apply bind_rows to bind them into one data.frame. Afterwards you could figure out what's wrong with the problematic column(s) and how to deal with the issue.
library(data.table)
library(purrr)
library(dplyr)
data <- fs::dir_ls(path = "fread_error", recurse = TRUE) %>%
map(~fread(., header = TRUE, fill = TRUE)) %>%
map(~mutate(.x, across("b", as.character))) %>%
bind_rows()
This approach reads the frames into a list, and then for each data.table, it retains only the rows where any extra columns have all NAs. It assumes there is a base number of columns that you expect (in your example this is 5)
library(data.table)
rbindlist(lapply(lapply(dir(path="fread_error/",recursive = T,full.names = T), fread), function(x) {
if(ncol(x)>5) x[rowSums(x[, lapply(.SD, function(x) !is.na(x)), .SDcols = -c(1:5)])==0][,c(1:5)]
else x
}))
Given that you have tens of thousands of files, you might want to do this in parallel; one option is here:
library(data.table)
library(doParallel)
library(foreach)
registerDoParallel(cores=detectCores())
rbindlist(foreach(fname = dir(path="fread_error/",recursive = T,full.names = T)) %dopar% {
x=fread(fname)
if(ncol(x)>5) x[rowSums(x[, lapply(.SD, function(x) !is.na(x)), .SDcols = -c(1:5)])==0][,c(1:5)]
else x
})
I'm sure there are much better ways of doing this, I'm open to suggestions.
I have these vectors:
vkt1 <- c("df1", "df2", "df3")
vector2 <- paste("sample", wSheatx, sep="_")
The first vector contains a list of the names of dataframes stored in the environment. These are stored as strings, but I'd like to call them as variable names.
The second vector is just the first one adding "sample" at the beggining, equivalent to:
vector2 <- c('sample_df1', 'sample_df2', 'sample_df3')
These strings from vector2 would serve as the names of new data frames to be created.
Alrighty, so now I want to do something like this:
for (i in 1:length(vector){ # meaning for i in 1,2,3
vector2[i] = data.frame(which(eval(parse(text = vkt1[i])) == "Some_String", arr.ind=TRUE))
addStyle(wb, vkt1[i], cols = 1:ncol(eval(parse(text = vkt1[i]))), rows = vector2[[i]][,1]+1, style = duppedStyle, gridExpand = TRUE)
}
It may look complicated, but the idea is to make a data frames named as the strings contained in vector2, being a subset of the data frames from vkt1 when "Some_String" is found.
Then, use that created data frame and add a style to the entire row when said string is present.
vector2[[i]][,1]+1 is intended to deploy as sample_df1[,1]+1 (in the first iteration)
Note that I'm using eval(parse(text = vkt1[i])) to get the variables from the strings of vkt1. So, say, eval(parse(text = vkt1[1])) is equal do df1 (the data frame, not the string)
Like this, the code gives the following error:
In file(filename, "r") :
cannot open file 'noCoinColor_Concat': No such file or directory
Been trying to get it working like so, but I'm beginning to feel this approach might be very wrong.
It is easier to manage code and data when you keep them in a list instead of separate dataframes.
You can use mget to get all the dataframes in vkt1 in a string and let's say you want to search for 'Some_String' in the first column of each dataframe, so you can do :
new_data <- lapply(mget(vkt1), function(df) df[df[[1]] == 'Some_String', ])
I haven't included the addStyle code here because I don't know from which package it is and what it does but you can easily include it in lapply's anonymous function.
Is it not easier to combine your data frames into a list and then use apply or map family functions to adjust your data frames?
data(mtcars)
df1 <- mtcars %>% filter(cyl == 4)
df2 <- mtcars %>% filter(cyl == 6)
df3 <- mtcars %>% filter(cyl == 8)
df_old_names <- c("df1", "df2", "df3")
df_new_names <- c("df_cyl_4", "df_cyl_6", "df_cyl_8")
df_list <- lapply(df_old_names, get)
names(df_list) <- df_new_names
I am currently working on combining my data from multiple excel files into one df. Problem is, the number of columns differ across the files (due to different experiment versions), so I need to bind only certain columns/variables from each file (they have the same names).
I tried doing this "manually" at first, using:
library(openxlsx)
PWI <- read.xlsx("/Users/myname/Desktop/PrelimPWI/PWI_1_V1A.xlsx", colNames = TRUE, startRow = 2)
Slim_1 <- data.frame(PWI$Subject, PWI$Block, PWI$Category, PWI$Trial,PWI$prompt1.RT)
#read in and pull out variables of interest for one subject
mergedFullData = merge(mergedDataA, mergedDataB)
#add two together, then add the third to the merged file, add 4th to that merged file, etc
Obviously, it seems like there's a simpler way to combine the files. I've been working on using:
library(openxlsx)
path <- "/Users/myname/Desktop/PrelimPWI"
merge_file_name <- "/Users/myname/Desktop/PrelimPWI/merge_file_name.xlsx"
filenames_list <- list.files(path= path, full.names=TRUE)
All <- lapply(filenames_list,function(merge_file_name$Subject){
print(paste("Merging",merge_file_name,sep = " "))
read.xlsx(merge_file_name, colNames=TRUE, startRow = 2)
})
PWI <- do.call(rbind.data.frame, All)
write.xlsx(PWI,merge_file_name)
However, I keep getting the error that the number of columns doesn't match, but I'm not sure where to pull out the specific variables I need (the ones listed in the earlier code). Any other tweaks I've tried has resulted in only the first file being written into the xlsx, or a completely blank df. Any help would be greatly appreciated!
library(tidyverse)
df1 <- tibble(
a = c(1,2,3),
x = c(4,5,6)
)
df2 <- tibble(
x = c(7,8,9),
y = c("d","e","f")
)
bind_rows(df1, df2)
The bind functions from dplyr should be able to help you. They can bind dataframes together by row or column, and are flexible about different column names.
You can then select the actual columns you want to keep.
I have an analytics script that processes batches of data with similar structure, but different column names. I need to preserve the column names for later ETL scripts, but we want to do do some processing, e.g,:
results <- data.frame();
for (name in names(data[[1]])) {
# Start by combining each column into a single matrix
working <- lapply(data, function(item)item[[name]]);
working <- matrix(unlist(working), ncol = 50, byrow = TRUE);
# Dump the data for the archive
write.csv(working, file = paste(PATH, prefix, name, '.csv', sep = ''), row.names = FALSE);
# Calculate the mean and SD for each year, bind to the results
df <- data.frame(colMeans(working), colSds(working));
names(df) <- c(paste(name, '.mean', sep = ''), paste(name, '.sd', sep = ''));
# Combine the working df with the processing one
}
Per the last comment in the example, how can I combine data frames? I've tried rbind and rbind.fill but neither work and their may be 10's to 100's of different column names in the data files.
This might have been more of an issue with searching for the right keyword, but the cbind method was actually the way to go along with a matrix,
# Allocate for the number of rows needed
results <- matrix(nrow = rows)
for (name in names(data[[1]])) {
# Data processing
# Append the results to the working data
results <- cbind(results, df)
}
# Drop the first placeholder column created upon allocation
results <- results[, -1];
Obviously the catch is that the columns need to have the same number of rows, but otherwise it is just a matter of appending the columns to the matrix.
I've been working on a for-loop that will automatically pull data from excel sheets (each excel file is one observation) and summarize it into a larger data frame. Eventually I would like to create a data frame where each row contains the summary data of each log. I have written the code to accurately summarize the excel files but hit a problem when joining the rows because the summary data frames don't contain all the same columns so I can't use rbind. Below is an example of the format that I have ended up with for my summarized excel sheets:
final <- data.frame("BCE_2_Dur" = c(92013), "BCE_2_Freq" = c(1), "BCD_1_Dur" = c(228804), "BCD_1_Freq"= c(7), "BSL_3_Dur" = c(100191), "BSL_3_Freq" = c(3))
Where each excel summary may have different codes (behaviors we saw in animals) at the top that match an existing full ethogram but will not necessarily include behaviors from the whole ethogram (if they're not seen).
Since this is in a for-loop I've been trying to solve the problem by just creating an empty data frame that looks like this:
empty <- data.frame("BCE_1_Dur" = c(0), "BCE_1_Freq" = c(0), "BCE_2_Dur" = c(0), "BCE_2_Freq" = c(0), "BCE_3_Dur" = c(0), "BCE_3_Freq" = c(0), "BCD_1_Dur" = c(0), "BCD_1_Freq"= c(0),"BCD_2_Dur" = c(0), "BCD_2_Freq"= c(0),"BCD_3_Dur" = c(0), "BCD_3_Freq"= c(0),"BSL_1_Dur" = c(0), "BSL_1_Freq" = c(0),"BSL_2_Dur" = c(0), "BSL_2_Freq" = c(0),"BSL_3_Dur" = c(0), "BSL_3_Freq" = c(0))
And then trying to bind them together using left_join since I want to keep all the columns in empty but fill in with columns that match in final. To provide values for the "by" argument in left_join I create a list (again this has to function within the for-loop so the list would change for every loop passed) by the column names of final:
namesfinal<-names(final)
namesfinal<-paste("'",as.character(namesfinal),"'",collapse=", ",sep="")
namesfinal<-paste("c","(",namesfinal,")",sep="")
Then I run the list into the left_join code:
Sum_Final <- left_join(x = empty, y = final, by = namesfinal)
This throws an error:
Error: by can't contain join column c('BCE_2_Dur', 'BCE_2_Freq', 'BCD_1_Dur', 'BCD_1_Freq', 'BSL_3_Dur', 'BSL_3_Freq') which is missing from LHS
My intention was to then rbind() Sum_Final to itself at the end of the loop. however, I can't get past the error. I've tried looking it up and running different versions of namesfinal through the code (e.g. 'BCE_2_Dur'='BCE_2_Dur') but get the same errors. Does anyone have a fix and/or another solution that may work within a for-loop?
You don't need a for loop or a join. You can do this using lapply and plyr::rbind.fill() -
filenames <- list.files("path to folder with all files", pattern="*.csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv)
final_df <- plyr::rbind.fill(ldf)
rbind.fill will bind all the dataframes and fill non-matching columns with NA