R: read.csv adding sub-script "X" in header - r

I have a data frame that has headers as this
Name 0x1 1x2
read.csv changes the header to be
Name X0x1 X1x2
Is there a way, where this can be avoided?
Thanks.

according to #Joshua
read.csv("filename.csv",check.names=FALSE)

I had the same issue on my Mac. There was a X... at the beginning of the first variable. The problem was that the CSV file was actually a CSV UTF-8 (Comma delimited) file. Saving the file as a CSV (Comma separated values) solved it.

Using the quote="" option will also prepend an X. for each column of your data.frame. If you can, try to remove that from your read.csv options, else add the check.names=F option which will override that behavior.

I met the same problem. The solution for me on MAC machine is to save the file with fileEncoding = "macintosh". Then read it by doing check.names = F.

read_csv('file_name.csv",check,names=F)
check.names=F, strips the white space character and removes the "x"

Related

How to check if CSV file has a comma or a semicolon as separator?

I have to read in a lot of CSV files automatically. Some have a comma as a delimiter, then I use the command read.csv().
Some have a semicolon as a delimiter, then I use read.csv2().
I want to write a piece of code that recognizes if the CSV file has a comma or a semicolon as a a delimiter (before I read it) so that I don´t have to change the code every time.
My approach would be something like this:
try to read.csv("xyz")
if error
read.csv2("xyz")
Is something like that possible? Has somebody done this before?
How can I check if there was an error without actually seeing it?
Here are a few approaches assuming that the only difference among the format of the files is whether the separator is semicolon and the decimal is a comma or the separator is a comma and the decimal is a point.
1) fread As mentioned in the comments fread in data.table package will automatically detect the separator for common separators and then read the file in using the separator it detected. This can also handle certain other changes in format such as automatically detecting whether the file has a header.
2) grepl Look at the first line and see if it has a comma or semicolon and then re-read the file:
L <- readLines("myfile", n = 1)
if (grepl(";", L)) read.csv2("myfile") else read.csv("myfile")
3) count.fields We can assume semicolon and then count the fields in the first line. If there is one field then it is comma separated and if not then it is semicolon separated.
L <- readLines("myfile", n = 1)
numfields <- count.fields(textConnection(L), sep = ";")
if (numfields == 1) read.csv("myfile") else read.csv2("myfile")
Update Added (3) and made improvements to all three.
A word of caution. read.csv2() is designed to handle commas as decimal point and semicolons as separators (default values). If by any chance, your csv files have semicolons as separators AND points as decimal point, you may get problems because of dec = "," setting. If this is the case and you indeed have separator as the ONLY difference between the files, it is better to change the "sep" option directly using read.table()

What does the "More Columns than Column Names" error mean?

I'm trying to read in a .csv file from the IRS and it doesn't appear to be formatted in any weird way.
I'm using the read.table() function, which I have used several times in the past but it isn't working this time; instead, I get this error:
data_0910<-read.table("/Users/blahblahblah/countyinflow0910.csv",header=T,stringsAsFactors=FALSE,colClasses="character")
Error in read.table("/Users/blahblahblah/countyinflow0910.csv", :
more columns than column names
Why is it doing this?
For reference, the .csv files can be found at:
http://www.irs.gov/uac/SOI-Tax-Stats-County-to-County-Migration-Data-Files
(The ones I need are under the county to county migration .csv section - either inflow or outflow.)
It uses commas as separators. So you can either set sep="," or just use read.csv:
x <- read.csv(file="http://www.irs.gov/file_source/pub/irs-soi/countyinflow1011.csv")
dim(x)
## [1] 113593 9
The error is caused by spaces in some of the values, and unmatched quotes. There are no spaces in the header, so read.table thinks that there is one column. Then it thinks it sees multiple columns in some of the rows. For example, the first two lines (header and first row):
State_Code_Dest,County_Code_Dest,State_Code_Origin,County_Code_Origin,State_Abbrv,County_Name,Return_Num,Exmpt_Num,Aggr_AGI
00,000,96,000,US,Total Mig - US & For,6973489,12948316,303495582
And unmatched quotes, for example on line 1336 (row 1335) which will confuse read.table with the default quote argument (but not read.csv):
01,089,24,033,MD,Prince George's County,13,30,1040
you have have strange characters in your heading # % -- or ,
For the Germans:
you have to change your decimal commas into a Full stop in your csv-file (in Excel:File -> Options -> Advanced -> "Decimal seperator") , then the error is solved.
Depending on the data (e.g. tsv extension) it may use tab as separators, so you may try sep = '\t' with read.csv.
This error can get thrown if your data frame has sf geometry columns.

Error while reading csv file in R

I am having some problems in reading a csv file with R.
x=read.csv("LorenzoFerrone.csv",header=T)
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<ff><fe>N'
I can read the file using libre office with no problems.
I can not upload the file because it is full of sensible information.
What can I do?
Setting encoding seem like the solution to the problem.
> x=read.csv("LorenzoFerrone.csv",fileEncoding = "UCS-2LE")
> x[2,1]
[1] Adriano Caruso
100 Levels: Ada Adriano Caruso adriano diaz Adriano Diaz alberto ferrone Alexey ... Zia Tina
This will read the column names as-is and won't return any errors:
x = read.csv(check.names = F)
To remove/replace troublesome characters in column names, use this:
iconv(names(x), to = "ASCII", sub = "")
The cause is an invalid encoding. I have solved replacing all the "è" with e
I found this problem is caused by code of file, and I solved that by opening it with Windows note, saving with UTF-8, and reopening with Excel(it became garbled at first), and resaving with UTF-8, then it worked!
You can always use the "Latin1" encoding while reading the csv:
x = read.csv("LorenzoFerrone.csv", fileEncoding = "Latin1", check.names = F)
I am adding check.names = F to avoid replacing spaces by dots within your header.
You need to specify the correct delimiter in the sep argument.
Typically an encoding issue. You can try to change encoding or else deleting the offending character (just use your favorite editor and replace all instances). In some cases R will spit the char location, for example:
invalid multibyte string 1847
Which should make your life easier.
Also note that you may be required to repeat this process several times (deleting all offending characters or trying several encodings).
Change the file format to - CSV UTF-8. It worked for me.
Not sure if this is helpful, but I had a similar problem and figured out that it was because my "csv" file had a .csv suffix, but was actually a .xls file!
Not sure if this helps, just had a similar issue which I solved by removing " from the csv I was trying to import. The first row of the database had the column names written as "colname","colname2","etc" and I removed all the " and the csv was read in R just fine then.
I solved the problem by removing any graphical signs in the writing (i.e. accent marks). My headers were written in Spanish and had some accent marks in there. I replaced with simple words (México=Mexico) and problem was solved.
I know this is an old post, but just wanted to say to non-English natives, that if you use "," as decimal seperator,

Special characters in R language

I have a table, which looks like this:
1β 2β
1.0199e-01 2.2545e-01
2.5303e-01 6.5301e-01
1.2151e+00 1.1490e+00
and so on...
I want to make a boxplot of this data. The commands I am using is this:
pdf('rtest.pdf')
w1<-read.table("data_CMR",header=T)
w2<-read.table("data_C",header=T)
boxplot(w1[,], w2[,], w3[,],outline=FALSE,names=c(colnames(w1),colnames(w2),colnames(w3)))
dev.off()
The problem is instead of symbol beta (β), I get two dots (..) in the output.
Any suggestions, to solve this problem.
Thank you in advance.
The suggestion to use check.names will prevent the appending of "X" to the "1β" and "2β" which would otherwise occur even once the encoding is sorted out (since column names are not supposed to start with numbers. (One could also have just used the"names" argument to boxplot.)
w1<-read.table(text="1β 2β
1.0199e-01 2.2545e-01
2.5303e-01 6.5301e-01
1.2151e+00 1.1490e+00",header=TRUE, check.names=FALSE, fileEncoding="UTF-8")
boxplot(w1)
This also works
pdf('rtest.pdf')
w1<-read.table("data_CMR",header=T)
w2<-read.table("data_C",header=T)
one<-expression(paste("1", beta,sep=""))
two <- expression(paste("2", beta,sep=""))
boxplot(w1[,], w2[,], w3[,],outline=FALSE, names=c(one,two))
dev.off()
This could be an encoding problem. Try adding encoding='UTF-8' to your read.table statements.
w1<-read.table("data_CMR",header=T,encoding='UTF-8')

R - read.table imports half of the dataset - no errors nor warnings

I have a csv file with ~200 columns and ~170K rows. The data has been extensively groomed and I know that it is well-formed. When read.table completes, I see that approximately half of the rows have been imported. There are no warnings nor errors. I set options( warn = 2 ). I'm using 64-bit latest version and I increased the memory limit to 10gig. Scratching my head here...no idea how to proceed debugging this.
Edit
When I said half the file, I don't mean the first half. The last observation read is towards the end of the file....so its seemingly random.
You may have a comment character (#) in the file (try setting the option comment.char = "" in read.table). Also, check that the quote option is set correctly.
I've had this problem before how I approached it was to read in a set number of lines at a time and then combine after the fact.
df1 <- read.csv(..., nrows=85000)
df2 <- read.csv(..., skip=84999, nrows=85000)
colnames(df1) <- colnames(df2)
df <- rbind(df1,df2)
rm(df1,df2)
I had a similar problem when reading in a large txt file which had a "|" separator. Scattered about the txt file were some text blocks that contained a quote (") which caused the read.xxx function to stop at the prior record without throwing an error. Note that the text blocks mentioned were not encased in double quotes; rather, they just contained one double quote character here and there (") which tripped it up.
I did a global search and replace on the txt file, replacing the double quote (") with a single quote ('), solving the problem (all rows were then read in without aborting).

Resources