I have merged a bunch of csv files but cant get them to export to one file correctly what am i doing wrong?The data shows up in my console but I get a error that says "Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""function"" to a data.fram",
setwd("c:/users/adam/documents/r data/NBA/DK/TEMP")
filenames <- list.files("c:/users/adam/documents/r data/NBA/DK/TEMP")
do.call("rbind",lapply(filenames, read.csv, header = TRUE))
write.csv(read.csv, file ='Lineups.csv')
You did not assign the results of do.call function to anything. Fairly common R noob error. Failure to understand the functional programming paradigm. Results need to be assigned to R names or they just get garbage-collected.
The error is actually from the code that you didn't put in a code block:
write.csv(read.csv, file ='Lineups.csv')
The 'read.csv' was presumably your intended name for the result of the do.call-operation, except it is by default a function name rather than your expectation. You could assign the do.call-results to the name 'read.csv' but doing so is very poor practice. Choose a more descriptive name like 'TEMP_files_appended'.
TEMP_files_appended <- do.call("rbind",lapply(filenames, read.csv, header = TRUE))
write.csv(TEMP_files_appended, file ='Lineups.csv')
(I will observe that using header=TRUE for read.csv is not needed since that is the default for that function.)
Related
I'm having trouble reading multiple .csv files in from a directory. It's odd because I read in files from two other directories using the same code with no issue immediately prior to running this code chunk.
setwd("C:\\Users\\User\\Documents\\College\\MLMLMasters\\Thesis\\TaggingEffectsData\\DiveStat")
my_dive <- list.files(pattern="*.csv")
my_dive
head(my_dive)
if(!require(plyr)){install.packages("plyr")}
DB = do.call(rbind.fill, lapply(my_dive, function(x) read.csv(x, stringsAsFactors = FALSE)))
DB
detach("package:plyr") ### I run this after I have finished creating all the dataframes because I sometimes have issues with plyr and dplyr not playing nice
if(!require(dplyr)){install.packages("dplyr")}
Then it throws this error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Which doesn't make any sense because the list.files function works and when I run head(my_dive) I get this output:
head(my_dive)
[1] "2004001_R881_TV3.csv" "2004002_R 57_TV3.csv" "2004002_R57_TV3.csv" "2004003_W1095_TV3.csv"
[5] "2004004_99AB_TV3.csv" "2004005_O176_TV3.csv"
Plus the Environment clearly shows that my list is populated with all 614 files as I would expect it to be.
All of the csv file sets have identical file names but different data so they have to be read in as separate data frames from separate directories (not my decision that's just how this dataset was organized). For that reason I can't seem to figure out why this set of files is giving me grief when the other two sets read in just fine with no issues. The only difference should be the working directory and the names of the lists and dataframes. I thought it might be something within the actual directory, but I checked and there are only the .csv files in the directory and the list.files functions works fine. I saw a previous question that was similar to mine but the poster didn't initially use the (pattern = "*.csv") argument and that was the cause for the error. I always use this argument so that seems unlikely to be the cause.
I'm not sure how to go about making this reproduceable but I appreciate any help offered.
I used the following function to merge all .csv files in my directory into one dataframe:
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,fread),fill = TRUE) }
dataframe = multmerge(path)
This code produces this error:
Error in rbindlist(lapply(filenames, fread), fill = TRUE) : Internal error: column 25 of result is determined to be integer64 but maxType=='character' != REALSXP
The code has worked on the same csv files before...I'm not sure what's changed and what the error message means.
So in looking at the documentation of fread I just noticed there is an integer64 option so are you dealing with integers greater than 2^31?
EDIT: I added the tryCatch which will print a formatted message to the console indicating which files are causing the error with the actual error message. However for rbindlist to then execute over the normal files you need to create a dummy list that will produce an extra column called ERROR which will have NAs in all rows except the bottom one(s) which will have the name of the problem file as its value(s).
I suggest after you run this code through once, delete the ERROR column and extra row(s) from the data.table and then save this combined file as a .csv. I would then move all the files that combined properly into a different folder and only have the current combined file and the ones that didn't load properly in the path. Then rerun the function without the colClasses specified. I combined everything into one script so it's hopefully less confusing:
#First Initial run without colClasses
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,function(i) tryCatch(fread(i),
error = function(e) {
cat("\nError reading in file:",i,"\t") #Identifies problem files by name
message(e) #Prints error message without stopping loop
list(ERROR=i) #Adds a placeholder column so rbindlist will execute
})), #End of tryCatch and lapply
fill = TRUE) #rbindlist arguments
} #End of function
#You should get the original error message and identify the filename.
dataframe = multmerge(path)
#Delete placeholder column and extra rows
#You will get as many extra rows as you have problem files -
#most likely just the one with column 25 or any others that had that same issue with column 25.
#Note the out of bounds error message will probably go away with the colClasses argument pulled out.)
#Save this cleaned file to something like: fwrite(dataframe,"CurrentCombinedData.csv")
#Move all files but problem file into new folder
#Now you should only have the big one and only one in your path.
#Rerun the function but add the colClasses argument this time
#Second run to accommodate the problem file(s) - We know the column 25 error this time but maybe in the future you will have to adapt this by adding the appropriate column.
multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,function(i) tryCatch(fread(i,colClasses = list(character = c(25))),
error = function(e) {
cat("\nError reading in file:",i,"\t") #Identifies problem files by name
message(e) #Prints error message without stopping loop
list(ERROR=i) #Adds a placeholder column so rbindlist will execute
})), #End of tryCatch and lapply
fill = TRUE) #rbindlist arguments
} #End of function
dataframe2 = multmerge(path)
Now we know the source of the error is column 25 which we can specify in colClasses. If you run the code and you get the same error message for a different column simply add the number of that column after the 25. Once you have the dataframe inputted I would check what is going on in that column (or any others if you must add other columns). Maybe there was a data entry error in one of the files or different encoding of an NA value. That's why I say to initially convert that column to character first because you will lose less information than converting to numeric first.
Once you have no errors always write the cleaned combined data.table to a csv that is contained in your folder and always move the individual files that have been combined into the other folder. That way when you add new files you will only be combining the big one and a few others so that in the future you can see what is going on easier. Just keep notes as to which files gave you trouble and which columns. Does that make sense?
Because files are often so idiosyncratic you will have to be flexible but this approach to the workflow should make it easy to identify problem files and add what you need to add to the fread to make it work. Basically archive the files that have been processed and keep track of exceptions like the column 25 one and keep the most current combined file and ones that haven't been processed together in the active path. Hope that helps and good luck!
Is it possible run lapply such that the X list argument is used as the second argument to FUN and the first argument to FUN is skipped?
One example is rjson::fromJSON(json_str, file, [other arguments]). I have a list containing several file paths of json files and would like to read each of them, collapsing the results into a list.
Normally, lapply would be ideal for this. However, in order to read from a file, the json_str argument cannot be given, even a null value. This is because fromJSON uses missing to check whether arguments are given. If both file and json_str are given, an error is thrown.
That means that lapply(files, fromJSON, json_str = NULL) will not work. I'm aware that I could work around this by manually making my own function as follows.
result <- lapply(files, function(file) {
fromJSON(file = file)
})
However, this seems cumbersome and unnecessary. Is there some cleaner way of doing this?
I'm trying to import a large number of text files and merge them into a single datatable using the script below, so I can parse the text . The files were originally eml files so the formatting is a mess. I'm not interested in separating the text into fields, it would be perfectly fine if the datatable only had one field with all the text from the files in it. When I run the script below I keep getting the following error.
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
I've tried setting sep= various things or running it without it, but it still gives the same error. I've also tried running the same code except replacing read.table with read.csv, but again I get the same error. Any tips would be greatly appreciated.
setwd("~/stuff/folder")
file_list <- list.files()
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("dataset")){
dataset <- read.table(file, header=FALSE,fill=TRUE,comment.char="",strip.white = TRUE)
}
# if the merged dataset does exist, append to it
if (exists("dataset")){
temp_dataset <-read.table(file, header=FALSE,fill=TRUE,comment.char="",strip.white = TRUE)
dataset<-rbind(dataset, temp_dataset)
rm(temp_dataset)
}
}
I think something lighter could work for you and may avoid this specific error:
them.files <- lapply(1:number.of.files,function(x)
read.table(paste(paste("lolz",x,sep=""),'txt',sep='.')),header=FALSE,fill=TRUE,comment.char="",strip.white = TRUE)
Adapt the function to whatever your files names are.
Edit:
Actually maybe something like this could be better:
them.files <- lapply(1:length(file_list),function(x)
read.table(file_list[x],header=FALSE,fill=TRUE,comment.char="",strip.white = TRUE)
Merging step:
everyday.Im.merging <- do.call(rbind,them.files)
I am sure there are beautiful ways to do it with dplyr or data.table but I am a caveman.
If I may add something, I would also fancy a checking step prior the previous line of code:
sapply(them.files,str)
This code works, however, I wonder if there is a more efficient way. I have a CSV file that has a single column of ticker symbols. I then read this csv into R and apply functions to each ticker using a for loop.
I read in the csv, and then go into the data frame and pull out the character vector that the for loop needs to run properly.
SymbolListDataFrame = read.csv("DJIA.csv", header = FALSE, stringsAsFactors=F)
SymbolList = SymbolListDataFrame[[1]]
for (Symbol in SymbolList){...}
Is there a way to combine the first two lines I have written into one? Maybe read.csv is not the best command for this?
Thank you.
UPDATE
I am using the readlines method suggested by Jake and Bartek. There is a warning "incomplete final line found on" the csv file but I ignore it since the data is correct.
SymbolList <- readLines("DJIA.csv")
SymbolList <- read.csv("DJIA.csv", header = FALSE, stringsAsFactors=F)[[1]]
readLines function is the best solution here.
Please note that read.csv function is not only for reading files with csv extensions. This is simply read.table function with parameters like header or sep set differently. Check the documentation for more info.