merging multiple dataframes in directory - r

I would like to merge multiple data frames in a directory. Some of these dataframes have duplicate rows. All dataframes have the same column information.
I found the code below on the following site however, I do not know how to modify it so that duplicate rows do not cause error.
I am getting the following response: Error in read.table(file = file, header = header, sep = sep, quote = quote, duplicate 'row.names' are not allowed
Here is the code to read in multiple data frames from a single directory. How can I modify it to circumvent the duplicate rows issue?
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)}
mymergeddata <- multmerge("/Users/Danielle/Desktop/Working
Directory/Ecuador/datasets to merge")

The Problem
The problem lies not in the merging, but rather in one or more of the individual csv where you have duplicate row names. Essentially, if you trying to do a simple read.csv() on a file that contains duplicate row names, you are going to get this exact error:
Error in read.table(file = file, header = header, sep = sep, quote =
quote, : duplicate 'row.names' are not allowed
The Solution
So how do you circumvent it? You can either fix the individual csv, which may be more challenging than it sounds if you have say 20 csv in that directory. What I would suggest you do in that case is to not use row names during the reading process and if it's really necessary, then set the row names after the reading operation is done. For example:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T, row.names = NULL)})
Reduce(function(x,y) {rbind(x,y)}, datalist)}
mymergeddata <- multmerge("~/Desktop")
mymergeddata[mymergeddata$Day.Index == "2014-01-07",]
Day.Index Sessions year
1 2014-01-07 57 2014
1091 2014-01-07 57 2014
See? Two completely identical values in Day.Index but because they are not row names you do not get an error. If you have changed the code to use the first column (Day.Index) as row names (by specifying row.names=1) then I am going to be able to replicate your error:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T, row.names = 1)})
Reduce(function(x,y) {rbind(x,y)}, datalist)}
mymergeddata <- multmerge("~/Desktop")
nrow(mymergeddata)
> Error in read.table(file = file, header = header,
sep = sep, quote = quote, : duplicate 'row.names' are not allowed
Trivial
I am using rbind() to append by row, but you could have swapped that with merge() in-place and the answer is still correct.
Essentially: R requires that the row names of its data frame is unique.

Related

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

How to import multiple csv files into R without getting duplicate row names error

Ive seen the multiple answers to a similar question where people have the error of duplicate 'row.names' are not allowed when importing one csv file into R, but I haven't seen a question for when you're trying to import multiple csv files into one data frame. So essentially, I'm trying to importing 104 files from the same directory and I get the duplicate 'row.names' are not allowed. I woud be able to solve the problem if i was only importing one file as the code is extremely simple, but when it comes to muliple files I struggle. I've tried a number of different ways of importing the data properly, here are a couple of them:
setwd("path")
loaddata <- function(file ="directory") {
files <- dir("directory", pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
dplyr::bind_rows
}
data <- loaddata("PhaseReports")
Error:
Error in read.table(file = file, header = header, sep = sep, quote =
quote, duplicate 'row.names' are not allowed
Another attempt:
path <- "path"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
Error:
Error in read.table(file = file, header = header, sep = sep, quote =
quote, duplicate 'row.names' are not allowed
EDIT: For the second method, when I try read.csv(paste(path,file,sep=""), row.names=NULL)) it changes the title of my first column to row.names and shifts the data one column to the right. I tried putting
colnames(rec) <- c(colnames(rec)[-1],"x")
rec$x <- NULL
under the last line and I get this error:
Error in `colnames<-`(`*tmp*`, value = "x") :
attempt to set 'colnames' on an object with less than two dimensions
If there is a much easier way to import multiple csv files into R and I'm over complicating things don't be afraid to let me know.
I know this is a combination of two questions which have been answered plenty of times on stack, I didn't see if anyone had asked this specific question. Thanks in advance!
EDIT 2:
All of the individual files contain data like this:
Half,Play,Type,Time
1,1,Start,00:00:0`
1,2,,0:23:5
1,3,pass,00:03:76
2,4,start,00:04:76
2,5,pass,00:06:92
2,6,end,00:08:00
Although this may not solve your problem, you could try to skip the headers while you are reading the files and put it afterwards. So something like (in some of your approaches):
read.csv("Your files/file/paste", header = F, skip = 1)
This will skip the header and hopefully will help with the duplicate row names. The full code to do it could be:
my_files <- dir("Your path/folder etc", pattern = '\\.csv', full.names = TRUE)
result <- do.call(rbind, lapply(my_files, read.csv, header = F, skip = 1))
names(result) <- c("Half","Play","Type","Time")
You can put the header later (the names(result) line does that).
If you still have problems I would suggest creating a loop like this:
for (i in my_files){
print(i)
read.csv(i)
}
And then see what is the last file name printed before you get an error. This one should be the one you should investigate. You could look whether a row has more than 3 commas because I think that this will be the problem. Hope it helps!

R - Importing Multiple Tables from a Single CSV file

I was hoping there may be a way to do this, but after trying for a while I have had no luck.
I am working with a datafile (.csv format) that is being supplied with multiple tables in a single file. Each table has its own header row, and data associated with it. Is there a way to import this file and create separate data frames for each header/dataset?
Any help or ideas that can be provided would be greatly appreciated.
A sample of the datafile and it's structure can be found Here
When trying to use read.csv I get the following error:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names"
Read the help for read.table:
nrows: number of rows to parse
skip: number of rows to skip
You can parse your file as follows:
first <- read.table(myFile, nrows=2)
second <- read.table(myFile, skip=3, nrows=2)
third <- read.table(myFile, skip=6, nrows=8)
You can always automate this by using grep() to search for the table seperators.
You can also read the table using fill=TRUE, and then split out the tables afterwards.

Error when reading in a .txt file and splitting it into columns in R

I would like to read in a .txt file into R, and have done so numerous times.
At the moment however, I am not getting the desired output.
I have a .txt file which contains data X that I want, and other data that I do not, which is in front and after this data X.
Here is a printscreen of the .txt file
I am able to read in the txt file as followed:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266)
This gives me a dataframe with 266 obs of 1 variable.
But I want these 266 observations in 4 columns (ID, Species, Endpoint, BLM NOEC).
So I tried the following script:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266, sep = " ")
But then I get the error
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
Using sep = "\t" also gives the same error.
And I am not sure how I can fix this.
Any help is much appreciated!
Try read.fwf and specify the widths of each column. Start reading at the Aelososoma sp. row and add the column names afterwards with
something like:
df <- read.fwf("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, n=266,widths=c(2,35,15,15))
colnames(df) <- c("ID", "Species", "Endpoint", "BLM NOEC")
Provide the txt file for a more complete answer.

Reading csv files

I'm having trouble reading .csv files into R, e.g.
df1991 <- read.csv("http://dl.dropbox.com/s/vwdw2tsmgiiuxfa/1991.csv")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
fishdata <- read.csv("http://dl.dropbox.com/s/pin16l691p6j4ll/fishdata.csv", row.names=NULL)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I've tried all sorts of variations of the header & row.names arguments.
I want to import the .csv files from dropbox for convenience, I have done so in the past without trouble. Any suggestions?
Acceptable CSV, so perhaps your default settings. Locale (for comma interpreted as decimal)?
Could it be that the error message should be the other way around, that there are more column names than columns?
Grasping this straw, the first column of data might be interpreted as row labels, for which no column name might be required. It would then expect all the given column-names to relate to the columns of data that come after the first column. So, more column names than columns. Resolved by something like a 'row-names=1' import parameter.

Resources