Special characters with read.csv loading as full stops when header = TRUE - r

I have a \t delimited .csv file with names of columns in the first row and some , decimal sign numbers in others. I am trying to read it with read.csv() command like so:
x = read.csv("Export.csv", header = TRUE, sep = "\t", dec = ",")
in the input (file Export.csv) I have for example
"$\{,}_"
45,2
which gives me
<header>X....._</header>
45.2
I had expected it would interpret quoted values as strings and numbers as numbers.
It correctly interprets 45,2 as a number but messes up all special characters except underscore.
I thought it's an encoding issue so I tried few different encoding options with the same result.
Moreover if I change header parameter to TRUE I get everything displayed correctly, however all data are then interpreted as strings and (as expected) the first row is not header.
How can I load special characters to header in these circumstances?
Issue on: RStudio Version 0.98.501, R Version 3.0.2 x64, OS: Win 7 x64

All elements of one column of a data.frame must all have the same type. So, when you try to read in a column, it has to guess which one you want. In your second example, it reads in the first line as the header, and then guesses that the column is a numeric. It then mangles the name because check.names is set to TRUE, and your header name isn't a "valid" name (it might cause problems), so it tries to fix it.
In your first example, it reads in the first line, guesses that it is a character (because it isn't a number) and then the whole column becomes a character.
If you want to read in this column, with $\{,}_ as the header name, you can do:
read.table(
textConnection('\"$\\{,}_\"
45,2'),header=TRUE,check.names=FALSE,dec=',')
If you want to read this data in, and convert the elements to a numeric or a character, you'll have to read it in as a character, and then convert it yourself, placing the elements in a list.

Related

Dealing with quotation marks in a quote-surrounded string

Take this CSV file:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes"",300
4,""Surrounded with quotes"",300
It loads just fine in most statistical programs (R, SAS, etc.) but in Excel the third row is misinterpreted because it has two quotation marks. Escaping the last quote as \" will also not work in Excel. The only way I have found so far is to replace the one double quote with two double quotes:
ID,NAME,VALUE
1,Blah,100
2,"Has space",200
3,"Ends with quotes""",300
4,"""Surrounded with quotes""",300
But that would render the file completely useless for all other programs (R, SAS, etc.)
Is there a way to format the CSV file where strings can begin or end with the same characters as that used to surround them, such that it would work in Excel as well as commonly used statistical software?
Your second representation is the normal way to generate a CSV file and so should be easy to work with in any software. See the RFC 4180 specifications. https://www.ietf.org/rfc/rfc4180.txt
So your second example represents this data:
Obs id name value
1 1 Blah 100
2 2 Has space 200
3 3 Ends with quotes" 300
4 4 "Surrounded with quotes" 300
If you want to represent it as a delimited file where none of the values are allowed to contain the delimiter (in other words NOT as a standard CSV file) than it would look like:
id,name,value
1,Blah,100
2,Has space,200
3,Ends with quotes",300
4,"Surrounded with quotes",300
But if you want to allow the values to contain the delimiter then you need some way to distinguish embedded delimiters from real delimiters. So the standard forces values that contain the delimiter to be quoted. But once you do that you also need to also add quotes around fields that contain the quote character itself (and double the embedded quotes) to avoid making an ambiguous file. For example the quotes in the 4th observation in your first file look like they are optional quotes around a value instead of part of the value.
Many programs try to handle ambiguous situations. For example SAS does not allow values to contain embedded line breaks so you will always get four observations with your first example file.
But EXCEL allows the embedding of the end of line character(s) inside of quoted values. So in your original file the value of the second field in the third observations looks like what you would start to get if you added quotes around this value:
Ends with quotes",300
4,"Surrounded with quotes",300
So instead of 4 complete observations of three fields values in each there are only three observations and the last observation has only two field values.
This is caused by the fact that escape character for " in Excel is "": Escaping quotes and delimiters in CSV files with Excel
A quick and simple workaround that comes to mind in R is to first read the content of the csv with readLines, then replace the double (escaped) double quotes with just one double quotes, and then read.table:
read.table(
text = gsub(pattern = "\"\"", "\"", readLines("data.csv")),
sep = ",",
header = TRUE
)

write_csv - Exporting trailing spaces (no elimination)

I am trying to export a table to CSV format, but one of my columns is special - it's like a number string except that the length of the string needs to be the same every time, so I add trailing spaces to shorter numbers to get it to a certain length (in this case I make it length 5).
library(dplyr)
library(readr)
df <- read.table(text="ID Something
22 Red
55555 Red
123 Blue
",header=T)
df <- mutate(df,ID=str_pad(ID,5,"right"," "))
df
ID Something
1 22 Red
2 55555 Red
3 123 Blue
Unfortunately, when I try to do write_csv somewhere, the trailing spaces disappear which is not good for what I want to use this for. I think it's because I am downloading the csv from the R server and then opening it in Excel, which messes around with the data. Any tips?
str_pad() appears to be a function from stringr package, which is not currently available for R 3.5.0 which I am using - this may be the cause of your issues as well. If it the function actually works for you, please ignore the next step and skip straight to my Excel comments below
Adding spaces. Here is how I have accomplished this task with base R
# a custom function to add arbitrary number of trailing spaces
SpaceAdd <- function(x, desiredLength = 5) {
additionalSpaces <- ifelse(nchar(x) < desiredLength,
paste(rep(" ", desiredLength - nchar(x)), collapse = ""), "")
paste(x, additionalSpaces, sep="")
}
# use the function on your df
df$ID <- mapply(df$ID, FUN = SpaceAdd)
# write csv normally
write.csv(df, "df.csv")
NOTE When you import to Excel, you should be using the 'import from text' wizard rather than just opening the .csv. This is because you need marking your 'ID' column as text in order to keep the spaces
NOTE 2 I have learned today, that having your first column named 'ID' might actually cause further problems with excel, since it may misinterpret the nature of the file, and treat it as SYLK file instead. So it may be best avoiding this column name if possible.
Here is a wiki tl;dr:
A commonly encountered (and spurious) 'occurrence' of the SYLK file happens when a comma-separated value (CSV) format is saved with an unquoted first field name of 'ID', that is the first two characters match the first two characters of the SYLK file format. Microsoft Excel (at least to Office 2016) will then emit misleading error messages relating to the format of the file, such as "The file you are trying to open, 'x.csv', is in a different format than specified by the file extension..."
details: https://en.wikipedia.org/wiki/SYmbolic_LinK_(SYLK)

problems replacing €-symbol in strings

I want to replace every "€" in a string with "[euro]". Now this works perfectly fine with
file.col.name <- gsub("€","[euro]", file.col.name, fixed = TRUE)
Now I am looping over column names from a csv-file and suddenly I have trouble with the string "total€".
It works for other special character (#,?) but the € sign doesn't get recognized.
grep("€",file.column.name)
also returns 0 and if I extract the last letter it prints "€" but
print(lastletter(file.column.name) == "€")
returns FALSE. (lastletter is just a function to extract the last letter of a string.)
Does anyone have an idea why that happens and maybe an idea to solve it? I checked the class of "file.column.name" and it returns "character", also tried to convert it into a character again and stuff like that but didn't help.
Thank you!
Your encodings are probably mixed. Check the encodings of the files, then add the appropriate encoding to, e.g., read.csv using fileEncoding="…" as an argument.
If you are working under Unix/Linux, the file utility will tell you the encoding of text files. Otherwise, any editor should show you the encoding of the files.
Common encodings are UTF-8, ISO-8859-15 and windows-1252. Try "UTF-8", "windows-1252" and "latin-9" as values for fileEncoding (the latter being a portable name for ISO-8859-15 according to R's documentation).

What does the "More Columns than Column Names" error mean?

I'm trying to read in a .csv file from the IRS and it doesn't appear to be formatted in any weird way.
I'm using the read.table() function, which I have used several times in the past but it isn't working this time; instead, I get this error:
data_0910<-read.table("/Users/blahblahblah/countyinflow0910.csv",header=T,stringsAsFactors=FALSE,colClasses="character")
Error in read.table("/Users/blahblahblah/countyinflow0910.csv", :
more columns than column names
Why is it doing this?
For reference, the .csv files can be found at:
http://www.irs.gov/uac/SOI-Tax-Stats-County-to-County-Migration-Data-Files
(The ones I need are under the county to county migration .csv section - either inflow or outflow.)
It uses commas as separators. So you can either set sep="," or just use read.csv:
x <- read.csv(file="http://www.irs.gov/file_source/pub/irs-soi/countyinflow1011.csv")
dim(x)
## [1] 113593 9
The error is caused by spaces in some of the values, and unmatched quotes. There are no spaces in the header, so read.table thinks that there is one column. Then it thinks it sees multiple columns in some of the rows. For example, the first two lines (header and first row):
State_Code_Dest,County_Code_Dest,State_Code_Origin,County_Code_Origin,State_Abbrv,County_Name,Return_Num,Exmpt_Num,Aggr_AGI
00,000,96,000,US,Total Mig - US & For,6973489,12948316,303495582
And unmatched quotes, for example on line 1336 (row 1335) which will confuse read.table with the default quote argument (but not read.csv):
01,089,24,033,MD,Prince George's County,13,30,1040
you have have strange characters in your heading # % -- or ,
For the Germans:
you have to change your decimal commas into a Full stop in your csv-file (in Excel:File -> Options -> Advanced -> "Decimal seperator") , then the error is solved.
Depending on the data (e.g. tsv extension) it may use tab as separators, so you may try sep = '\t' with read.csv.
This error can get thrown if your data frame has sf geometry columns.

duplicate 'row.names' are not allowed error

I am trying to load a csv file that has 14 columns like this:
StartDate, var1, var2, var3, ..., var14
when I issue this command:
systems <- read.table("http://getfile.pl?test.csv", header = TRUE, sep = ",")
I get an error message.
duplicate row.names are not allowed
It seems to me that the first column name is causing the issue. When I manually download the file and remove the StartDate name from the file, R successfully reads the file and replaces the first column name with X. Can someone tell me what is going on? The file is a (comma separated) csv file.
Then tell read.table not to use row.names:
systems <- read.table("http://getfile.pl?test.csv",
header=TRUE, sep=",", row.names=NULL)
and now your rows will simply be numbered.
Also look at read.csv which is a wrapper for read.table which already sets the sep=',' and header=TRUE arguments so that your call simplifies to
systems <- read.csv("http://getfile.pl?test.csv", row.names=NULL)
This related question points out a part of the ?read.table documentation that explains your problem:
If there is a header and the first row contains one fewer field
than the number of columns, the first column in the input is used
for the row names. Otherwise if row.names is missing, the rows are numbered.
Your header row likely has 1 fewer column than the rest of the file and so read.table assumes that the first column is the row.names (which must all be unique), not a column (which can contain duplicated values). You can fix this by using one of the following two Solutions:
adding a delimiter (ie \t or ,) to the front or end of your header row in the source file, or,
removing any trailing delimiters in your data
The choice will depend on the structure of your data.
Example:
Here the header row is interpreted as having one fewer column than the data because the delimiters don't match:
v1,v2,v3 # 3 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
This is how it is interpreted by default:
v1,v2,v3 # 3 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
The first column (with no header) values are interpreted as row.names: a1 and b1. If this column contains duplicates, which is entirely possible, then you get the duplicate 'row.names' are not allowed error.
If you set row.names = FALSE, the shift doesn't happen, but you still have a mismatching number of items in the header and in the data because the delimiters don't match.
Solution 1
Add trailing delimiter to header:
v1,v2,v3, # 4 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
Solution 2
Remove excess trailing delimiter from non-header rows:
v1,v2,v3 # 3 items
a1,a2,a3 # 3 items!!
b1,b2,b3 # 3 items!!
In my case was a comma at the end of every line. By removing that worked
I had this error when opening a CSV file and one of the fields had commas embedded in it. The field had quotes around it, and I had cut and paste the read.table with quote="" in it. Once I took quote="" out, the default behavior of read.table took over and killed the problem. So I went from this:
systems <- read.table("http://getfile.pl?test.csv", header=TRUE, sep=",", quote="")
to this:
systems <- read.table("http://getfile.pl?test.csv", header=TRUE, sep=",")
I used read_csv from the readr package
In my experience, the parameter row.names=NULL in the read.csv function will lead to a wrong reading of the
file if a column name is missing, i.e. every column will be shifted.
read_csv solves this.
Another possible reason for this error is that you have entire rows duplicated. If that is the case, the problem is solved by removing the duplicate rows.
The answer here (https://stackoverflow.com/a/22408965/2236315) by #adrianoesch should help (e.g., solves "If you know of a solution that does not require the awkward workaround mentioned in your comment (shift the column names, copy the data), that would be great." and "...requiring that the data be copied" proposed by #Frank).
Note that if you open in some text editor, you should see that the number of header fields less than number of columns below the header row. In my case, the data set had a "," missing at the end of the last header field.
It seems the problem can arise from more than one reasons. Following two steps worked when I was having same error.
I saved my file as MS-DOS csv. ( Earlier it was saved in as just csv , excel starter 2010 ).
Opened the csv in notepad++. No coma was inconsistent (consistency as described above #Brian).
Noticed I was not using argument sep="," . I used and it worked ( even though that is default argument!)

Resources