Reading csv files - r

I'm having trouble reading .csv files into R, e.g.
df1991 <- read.csv("http://dl.dropbox.com/s/vwdw2tsmgiiuxfa/1991.csv")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
fishdata <- read.csv("http://dl.dropbox.com/s/pin16l691p6j4ll/fishdata.csv", row.names=NULL)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I've tried all sorts of variations of the header & row.names arguments.
I want to import the .csv files from dropbox for convenience, I have done so in the past without trouble. Any suggestions?

Acceptable CSV, so perhaps your default settings. Locale (for comma interpreted as decimal)?
Could it be that the error message should be the other way around, that there are more column names than columns?
Grasping this straw, the first column of data might be interpreted as row labels, for which no column name might be required. It would then expect all the given column-names to relate to the columns of data that come after the first column. So, more column names than columns. Resolved by something like a 'row-names=1' import parameter.

Related

Why is read.csv getting wrong classes?

I have to read a big .csv file and read.csv is taking a while. I read that I should use read.csv to read a few rows, get the column classes, and then read the whole file. I tried to do that:
read.csv(full_path_astro_data,
header=TRUE,
sep=",",
comment.char="",
nrow=100,
stringsAsFactors=FALSE) %>%
sapply(class) -> col.classes
df_astro_data <- read.csv(full_path_astro_data,
header=TRUE,
sep=",",
colClasses=col.classes,
comment.char="",
nrow=47000,
stringsAsFactors=FALSE)
But then I got an error message:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '0.0776562500000022'
It looks like a column that contains numeric (double?) data was incorrectly classified as integer. This could be because some numeric columns have many zeros at the beginning. So I tried to increase the number of rows in the first read.csv command, but that did not work. One solution I found was to do
col.classes %>%
sapply(function(x) ifelse(x=="integer", "numeric", x)) -> col.classes
With this the file is read much faster than without specifying column classes. Still, it would be best if all columns were classified correctly.
Any insights?
Thanks
I suspect you are correct that in your row sample some columns contain only integers, but outside your row sample they contain non-integers. This is a common problem with large files. You need to either increase your row sample size or explicitly specify column type for certain columns where you see this happening.
It should be noted that readr's read_csv does this row sampling automatically. From the docs: "all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself." You can do that like this:
read_csv( YourPathName,
col_types = cols(YourProblemColumn1 = col_double(),
YourProblemColumn2 = col_double())
)

Set column types for csv with read.csv.ffdf

I am using a payments dataset from Austin Text Open Data. I am trying to load the data with the following code:-
library(ff)
asd <- read.table.ffdf(file = "~/Downloads/Fiscal_Year_2010_eCheckbook_Payments.csv", first.rows = 100, next.ros = 50, FUN = "read.csv", VERBOSE = TRUE)
This shows me the following error:-
read.table.ffdf 301..Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got '7AHM'
This happens on 339th line of csv file at 5th column of the dataset. The reason why I think this is happening is that all the values of the 5th column are integers where as this happens to be string. But the actual type of the column should be string.
So I wanted to know if there was a way I could set the types of the column
Below I am providing the types for all the columns in a vector:-
c("character","integer","integer","character","character", "character","character","character","character","character","integer","character","character","character","character","character","character","character","integer","character","character","character","character","character","integer","integer","integer","character","character","character","character","double","character","integer")
You can also find the type of each column from the description of the dataset.
Please also keep in mind that I am very new to this library. Practically just found out about it today.
Maybe you need to transform your data type...The following is just an example that maybe to help you.
data <- transform(
data,
age=as.integer(age),
sex=as.factor(sex),
cp=as.factor(cp),
trestbps=as.integer(trestbps),
choi=as.integer(choi),
fbs=as.factor(fbs),
restecg=as.factor(restecg),
thalach=as.integer(thalach),
exang=as.factor(exang),
oldpeak=as.numeric(oldpeak),
slope=as.factor(slope),
ca=as.factor(ca),
thai=as.factor(thai),
num=as.factor(num)
)
sapply(data, class)

merging multiple dataframes in directory

I would like to merge multiple data frames in a directory. Some of these dataframes have duplicate rows. All dataframes have the same column information.
I found the code below on the following site however, I do not know how to modify it so that duplicate rows do not cause error.
I am getting the following response: Error in read.table(file = file, header = header, sep = sep, quote = quote, duplicate 'row.names' are not allowed
Here is the code to read in multiple data frames from a single directory. How can I modify it to circumvent the duplicate rows issue?
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)}
mymergeddata <- multmerge("/Users/Danielle/Desktop/Working
Directory/Ecuador/datasets to merge")
The Problem
The problem lies not in the merging, but rather in one or more of the individual csv where you have duplicate row names. Essentially, if you trying to do a simple read.csv() on a file that contains duplicate row names, you are going to get this exact error:
Error in read.table(file = file, header = header, sep = sep, quote =
quote, : duplicate 'row.names' are not allowed
The Solution
So how do you circumvent it? You can either fix the individual csv, which may be more challenging than it sounds if you have say 20 csv in that directory. What I would suggest you do in that case is to not use row names during the reading process and if it's really necessary, then set the row names after the reading operation is done. For example:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T, row.names = NULL)})
Reduce(function(x,y) {rbind(x,y)}, datalist)}
mymergeddata <- multmerge("~/Desktop")
mymergeddata[mymergeddata$Day.Index == "2014-01-07",]
Day.Index Sessions year
1 2014-01-07 57 2014
1091 2014-01-07 57 2014
See? Two completely identical values in Day.Index but because they are not row names you do not get an error. If you have changed the code to use the first column (Day.Index) as row names (by specifying row.names=1) then I am going to be able to replicate your error:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T, row.names = 1)})
Reduce(function(x,y) {rbind(x,y)}, datalist)}
mymergeddata <- multmerge("~/Desktop")
nrow(mymergeddata)
> Error in read.table(file = file, header = header,
sep = sep, quote = quote, : duplicate 'row.names' are not allowed
Trivial
I am using rbind() to append by row, but you could have swapped that with merge() in-place and the answer is still correct.
Essentially: R requires that the row names of its data frame is unique.

R - Importing Multiple Tables from a Single CSV file

I was hoping there may be a way to do this, but after trying for a while I have had no luck.
I am working with a datafile (.csv format) that is being supplied with multiple tables in a single file. Each table has its own header row, and data associated with it. Is there a way to import this file and create separate data frames for each header/dataset?
Any help or ideas that can be provided would be greatly appreciated.
A sample of the datafile and it's structure can be found Here
When trying to use read.csv I get the following error:
"Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names"
Read the help for read.table:
nrows: number of rows to parse
skip: number of rows to skip
You can parse your file as follows:
first <- read.table(myFile, nrows=2)
second <- read.table(myFile, skip=3, nrows=2)
third <- read.table(myFile, skip=6, nrows=8)
You can always automate this by using grep() to search for the table seperators.
You can also read the table using fill=TRUE, and then split out the tables afterwards.

Error when reading in a .txt file and splitting it into columns in R

I would like to read in a .txt file into R, and have done so numerous times.
At the moment however, I am not getting the desired output.
I have a .txt file which contains data X that I want, and other data that I do not, which is in front and after this data X.
Here is a printscreen of the .txt file
I am able to read in the txt file as followed:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266)
This gives me a dataframe with 266 obs of 1 variable.
But I want these 266 observations in 4 columns (ID, Species, Endpoint, BLM NOEC).
So I tried the following script:
read.delim("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, nrows=266, sep = " ")
But then I get the error
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
Using sep = "\t" also gives the same error.
And I am not sure how I can fix this.
Any help is much appreciated!
Try read.fwf and specify the widths of each column. Start reading at the Aelososoma sp. row and add the column names afterwards with
something like:
df <- read.fwf("C:/Users/toxicologie/Cobalt/WB1", header=TRUE,skip=88, n=266,widths=c(2,35,15,15))
colnames(df) <- c("ID", "Species", "Endpoint", "BLM NOEC")
Provide the txt file for a more complete answer.

Resources