Error merging multiple data frames in a list - r

I've been trying to merge a list of dataframes and keep getting the error:
"Error in [.data.frame(y, ids$y, y.cols, drop = FALSE) :
undefined columns selected".
Below is the code I've used
read_OO <- function(filename){
read.delim(filename, skip=14)
}# Skip first 14 lines of metadata in data files
filenames <- list.files(folderpath, pattern="*.txt", full.names=TRUE)
filelist <- lapply(filenames, read_OO)
SampleIDs <- stringr::str_remove(str_remove(filenames, folderpath), ".txt")
names(filelist) <- SampleIDs
filelist <- mapply(cbind, filelist, SampleIDs, SIMPLIFY=F)
colnames <- c("Wavelength","Absorbance", "SampleIDs")
filelist <- lapply(filelist, setNames, colnames)
abs2017 <- plyr::join_all(filelist, by = c("Wavelength","Absorbance", "SampleIDs"), type = "full", match = "all")
The error comes on the last line
I've also tried merging by
t <- Reduce(function(x, y) merge(x, y,
by=c("Wavelength","Absorbance", "SampleIDs"),
all = TRUE), filelist)
But it stops the code at an "approximate location" (it doesn't provide a specific error and says it can't find the source)
Is there something I can look for in my file structure that may be the problem? I can't find any inconsistencies between the files (they're all identical outputs from a machine)

There was in fact a single file with a slightly different format than all the other files, so this is now solved. Once that was corrected, the code above worked.
If anyone has any comments on how to scan through a list and check for structure discrepancies that would be appreciated!

Related

Create a list of tibbles with unique names using a for loop

I'm working on a project where I want to create a list of tibbles containing data that I read in from Excel. The idea will be to call on the columns of these different tibbles to perform analyses on them. But I'm stuck on how to name tibbles in a for loop with a name that changes based on the for loop variable. I'm not certain I'm going about this the correct way. Here is the code I've got so far.
filenames <- list.files(path = getwd(), pattern = "xlsx")
RawData <- list()
for(i in filenames) {
RawData <- list(i <- tibble(read_xlsx(path = i, col_names = c('time', 'intesity'))))
}
I've also got the issue where, right now, the for loop overwrites RawData with each turn of the loop but I think that is something I can remedy if I can get the naming convention to work. If there is another method or data structure that would better suite this task, I'm open to suggestions.
Cheers,
Your code overwrites RawData in each iteration. You should use something like this to add the new tibble to the list RawData <- c(RawData, read_xlsx(...)).
A simpler way would be to use lapply instead of a for loop :
RawData <-
lapply(
filenames,
read_xlsx,
col_names = c('time', 'intesity')
)
Here is an approach with map from package purrr
library(tidyverse)
filenames <- list.files(path = getwd(), pattern = "xlsx")
mylist <- map(filenames, ~ read_xlsx(.x, col_names = c('time', 'intesity')) %>%
set_names(filenames)
Similar to the answer by #py_b, but add a column with the original file name to each element of the list.
filenames <- list.files(path = getwd(), pattern = "xlsx")
Raw_Data <- lapply(filenames, function(x) {
out_tibble <- read_xlsx(path = x, col_names = c('time', 'intesity'))
out_tibble$source_file <- basename(x) # add a column with the excel file name
return(out_tibble)
})
If you want to merge the list of tibbles into one big one you can use do.call('rbind', Raw_Data)

Rbind multiple Data Frames in a loop

I have a bunch of data frames that are named in the same pattern "dfX.csv" where X represents a number from 1 to 67. I loaded them into seperate dataframes using following piece of code:
folder <- mypath
file_list <- list.files(path=folder, pattern="*.csv")
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep=',', header=TRUE))
)}
What I'm trying to do is merge/rbind them into a single huge dataframe.
for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv)
}
However using that I'm getting an error:
Error: unexpected symbol in:
"for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv"
Any idea what might be causing an issue & whether there's a simpler way of doing things.
If file_list is a character vector of filenames that have since been loaded into variables in the local environment, then perhaps one of
do.call(rbind.data.frame, mget(ls(pattern = "^df\\s+\\.csv")))
do.call(rbind.data.frame, mget(paste0("df", seq_along(file_list), ".csv")))
The first assumes anything found (as df*.csv) in R's environment is appropriate to grab. It might not grab then in the correct order, so consider using sort or somehow ordering them yourself.
mget takes a string vector and retrieves the value of the object with each name from the given environment (current, by default), returning a list of values.
do.call(rbind.data.frame, ...) does one call to rbind, which is much much faster than iteratively rbinding.
Here I use map() to iterate over your files reading each one into a list of dataframes and bind_rows is used to bind all df together
library(tidyverse)
df <- map(list.files(), read_csv) %>%
bind_rows()
If you have a lot of data (lot of rows), here's a data.table approach that works great:
library(data.table)
basedir <- choose.dir() # directory with all the csv files
file_names <- list.files(path = basedir, pattern= '*.csv', full.names = F, recursive = F)
big_list <- lapply(file_names, function(file_name){
dat <- fread(file = file.path(basedir, file_name), header = T)
# Add a 'filename' column to each data.table to back-track where it was read from
# this is why we set full.names = F in the list.files line above
dat$filename <- gsub('.csv', '', file_name)
return(dat)
})
big_data <- rbindlist(l = big_list, use.names = T, fill = T)
If you want to read only some columns and not all, you can use the select argument in fread - helps improve speed since empty columns are not read in, similarly skip lets you skip reading in a bunch of rows.

Combining .txt files in R using a loop

I'm currently trying to use R to combine dozens of .txt files into one single .txt file. Attached below is the code that I've been experimenting with so far. The files that I'm trying to combine have very similar names, for example: "e20171ny0001000.txt" and "e20171ct0001000.txt". As you can see, the only difference in the file names are the different state abbreviations. This is why I've been trying to use a for loop, in order to try to go through all the state abbreviations.
setwd("/Users/tim/Downloads/All_Geographies")
statelist = c('ak','al','ar','az','ca','co','ct','dc','de','fl','ga','hi','ia','id','il','in','ks','ky','la','ma','md','me','mi','mn','mo','ms','mt','nc','nd','ne','nh','nj','nm','nv','ny','oh','ok','or','pa','ri','sc','sd','tn','tx','ut','va','vt','wa','wi','wv','wy')
for (i in statelist){
file_names <- list.files(getwd())
file_names <- file_names[grepl(paste0("e20171", i, "0001000.txt"),file_names)]
files <- lapply(file_names, read.csv, header=F, stringsAsFactors = F)
files <- do.call(rbind,files)
}
write.table(files, file = "RandomFile.txt", sep="\t")
When I run the code, there isn't a specific error that pops up. Instead the entire code runs and nothing happens. I feel like my code is missing something that is preventing it from running correctly.
We need to create a list to update. In the OP's code,files is a list of data.frame that gets updated in the for loop. Instead, the output needss to be stored in a list. For this, we can create a list of NULL 'out' and then assign the output to each element of 'out'
out <- vector('list', length(statelist))
for (i in seq_along(statelist)){
file_names <- list.files(getwd())
file_names <- file_names[grepl(paste0("e20171", statelist[i],
"0001000.txt"),file_names)]
files <- lapply(file_names, read.csv, header=FALSE, stringsAsFactors = FALSE)
out[[i]] <- do.call(rbind, files)
}
As out is a list of data.frame, we need to loop over the list and then write it back to file
newfilenames <- paste0(statelist, "_new", ".txt")
lapply(seq_along(out), function(i) write.table(out[[i]],
file = newfilenames[i], quote = FALSE, row.names = FALSE))

R- Reading one column, and using qpcr cbind

So I am trying to read several csv files, take their first column and create a new file. I have succeeded using qpcR and data.table using the following code:
FileNames <- dir(pattern = "*.csv")
x <- integer()
for (FileName in FileNames) {
data <- read.csv(file = FileName, header=FALSE, skip=1)
y <- data[,1]
x<-qpcR:::cbind.na(x, y)
rm(data)
}
write.csv(x, file = 'test.csv')
This works fine, however I have discovered that I can read just the first column of my data using the data.table library.
x <- integer()
for (FileName in FileNames) {
data <- fread(FileName,select=1,skip=1, header=FALSE)
y <- data[1:nrow(data),]
x<-qpcR:::cbind.na(x, y)
rm(data)
}
write.csv(x, file = 'test.csv')
However this seems to treat y as a data value or integer, which throws up the error:
Error in data.table::data.table(...) :
Item 2 has no length. Provide at least one item (such as NA, NA_integer_ etc) to be repeated to match the 11 rows in the longest column. Or, all columns can be 0 length, for insert()ing rows into.
Any help on this would be great thanks.
Turns out after investigating using typeof(), that I needed to convert the list generated by fread, to a numeric by adding the following line.
data <- as.numeric(unlist(data))
This then worked

Extracting file numbers from file names in r and looping through files

I have a folder full of .txt files that I want to loop through and compress into one data frame, but each .txt file is data for one subject and there are no columns in the text files that indicate subject number or time point in the study (e.g. 1-5). I need to add a line or two of code into my loop that looks for strings of four numbers (i.e. each file is labeled something like: "4325.5_ERN_No_Startle") and just creates a column with 4325 and another column with 5 that will appear for every data point for that subject until the loop gets to the next one. I have been looking for awhile but am still coming up empty, any suggestions?
I also have not quite gotten the loop to work:
path = "/Users/me/Desktop/Event Codes/ERN task/ERN text files transferred"
out.file <- ""
file <- ""
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- read.table(file.names[i],header=FALSE, fill = TRUE)
out.file <- rbind(out.file, file)
}
which runs okay until I get this error message part way through:
Error in read.table(file.names[i], header = FALSE, fill = TRUE) :
no lines available in input
Consider using regex to parse the file name for study period and subject, both of which are then binded in a lapply of list.files:
path = "path/to/text/files"
# ANY TXT FILE WITH PATTERN OF 4 DIGITS FOLLOWED BY A PERIOD AND ONE DIGIT
file.names <- list.files(path, pattern="*[0-9]{4}\\.[0-9]{1}.*txt", full.names=TRUE)
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES AND BINDS THE REGEX EXTRACTS
dfList <- lapply(file.names, function(x) {
if (file.exists(x)) {
data.frame(period=regmatches(x, gregexpr('[0-9]{4}', x))[[1]],
subject=regmatches(x, gregexpr('\\.[0-9]{1}', x))[[1]],
read.table(x, header=FALSE, fill=TRUE),
stringsAsFactors = FALSE)
}
})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# REMOVE PERIOD IN SUBJECT (NEEDED EARLIER FOR SPECIAL DIGIT)
df['subject'] <- sapply(df['subject'],
function(x) gsub("\\.", "", x))
You can try to use tryCatchwhich basically would give you a NULL instead of an error.
file <- tryCatch(read.table(file.names[i],header=FALSE, fill = TRUE), error=function(e) NULL))

Resources