I am looking for a method and not a code as a solution. Any suggestions are welcome.
Here is a sample data that is corrupted (commas should not have been there). By the way I don't have any control over the csv files I receive.
A B C
1.1 1,859.3 52.1
0 12.2 123
In csv format it looks like:
A,B,C
1.1 ,1,859.3,52.1
0,12.2,123
But then, when I read it in using R, row 1 has an extra column and that is an error. Is there any comfortable way to identify if the csv file has any error like this extra column. I could write a bunch of nested loops that parse through length of each row but then I am talking about 1000 csvs with 100000 rows. It will take for ever. Please help. Any method is appreciated.
Save to csv using a different separator, e.g. ;
Then you would have something like
A;B;C
1.1;1,859.3;52.1
0;12.2;123
The code is simple
write.csv(..., sep = ";")
read.csv( ..., sep = ";")
Related
I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.
I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.
When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.
The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)
Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).
An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.
See a simplified example below:
##create a dtafrane
df <- data.frame( id = numeric(), string = character(),
stringsAsFactors=FALSE)
##poluate columns
df[1:1500,1] <- seq(1:1500)
df[1500,2] <- "something"
# variable string contains the first value in obs. 1500
df[1500,]
## check the numbers of NA in variable string
sum(is.na(df$string)) # 1499
##write the df
write_csv(df, "df.csv")
##read the df with read_csv and read.csv
df_readr <- read_csv('df.csv')
df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
sum(is.na(df_readr$string)) #1500
sum(is.na(df_read_standard$string)) #1499
## the read_csv files is all NA for variable string
problems(df_readr) ##What should that tell me? How to fix it?
Thanks to MrFlick for giving the answering comment on my questions:
The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.
Also guess_max = Inf overcomes the problem, but the speed advantage of read_csv seems to be lost.
I´m trying to do redundancy analysis (RDA) on my data in R. The data frame I´m using was uploaded as a Microsoft Excel csv file. The data frame looks something like this:
site biomass index
1 0.001 1.5
2 0.122 2.3
3 0.255 4.9
When trying to create a formula for the RDA, I constantly get the following message: "Error in formula.data.frame(object, env = baseenv()) :
cannot create a formula from a zero-column data frame"
Does anyone know how I can change my data frame so that I no longer get this error message?
Thanks in advance!
You load all your data to row.names instead of columns
you used default separator ',' while your data is separated by
';'
you specified row.names = 1 so that first column (and the
only one as the separator is wrong) goes to row.names.
that's why your data.frame has no columns. To fix this, use
read.csv('data.csv', sep = ';', row.names=NULL)
Problem solved :) Thanks for the your answers. I tried the suggestions but what ended up working was to import the data with the <- read.csv2("data.csv", row.names=1), so basically to use read.csv2 instead of read.csv, as it seems to be used in many European locales.
There is more info on the difference between csv and csv2 here: Difference between read.csv() and read.csv2() in R
I'm currently trying to read two .csv files, edit the data, and then write it into a new .csv.
here the code :
data <- read.csv("file.csv"), fill=TRUE, header=TRUE, row.names=NULL, stringAsFactors=FALSE, sep=",", quote="")
write.csv(data, file="out.csv")
Here's the problem :
Everything is fine with the first file (20 columns, 572 observations)
However, the other file has 163 columns and 1578 lines but when I read it with read.csv, R displays "2301 observations of 163 variables".
I tried to write this dataframe into a new csv file, and it is a total mess :
the rows have not been written entirely, the last values are written on a new row
there is a new column with integers from 1 to 2301
some data which is supposed to be in file$n is written in file$(n-1) or file$(n-2)
I'm a newbie, and I must admit I'm kind of lost : any help would be highly appreciated!
Thanks
Clément
I need to write to csv without the columns names row.
The following snippet:
CSAT <- data[j,1]
Verbatim <- data[j,2]
write.csv (Verbatim, paste(CSAT,'.csv',sep = ""), row.names=FALSE)
CSAT is a variable containing dynamic value
(changes in runtime) for the file name.
write data to csv, but the the csv looks like this (2 rows instead of desired 1 row):
"x"
"I am very disappointed in your service. I had been with xxxxx for 9 years and even though I had problems a few times I knew they would look after me as a customer. I can't say the same about yyyyy. Only been with you 6 months and already feel let down by "
Where did "x" came from? it's not part of the data..
thanks
You cannot turn off col.names with write.csv because CSV files require them. So if if your data doesn't have headers, write.csv will add them. That's where the "x" came from. Try using the more robust write.table
write.table(Verbatim, paste(CSAT), row.names=FALSE, col.names=FALSE)
I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.
What I want to do:
I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:
Import the data into R from Excel
Add a variable to each file containing the filename
Merge all of the data (add observations/rows - do not need to match on variables)
I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.
What the Data Look Like:
Row 1 Title
Row 2 StudentID Var1 Var2 Var3 Var4 Var5
Row 3 11234 1 9/8/2011 343 159-167 32
Row 4 11235 2 9/16/2011 112 152-160 12
Row 5 11236 1 9/8/2011 325 164-171 44
Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.
What I have so far:
At first I simply tried to import data from excel. Using the XLSX package, this works nicely:
dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1,
sheetName=NULL, startRow=2,
endRow=NULL, as.data.frame=TRUE,
header=TRUE)
Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.
setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));
I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.
When I do this I get the following error:
Error in data.frame(res, ...) :
arguments imply differing number of rows: 616, 1, 5
I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.
What next?
If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!
I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine
You need to pass a single FILENAMEVAR to read.xlsx2.
lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))