How recode unicode char like "\xe9" and "<e9>" to "é" in R? - r

I read "csv" file where one field has values like "J\xe9rome" or "Jrome" at the same time. How to read this file to have values like "Jérome" or make characters transformation then?
I tried to use
df <- fread(file_name, encoding = "UTF-8")
but it does not work.
Thanks!

Related

Import of special characters & NA's of csv into SAS does not work

I have a csv file with many "NA" values and with special characters such as ä, ö or ß. I want to import this csv file into SAS via proc import, but unfortunately I have two problems:
1) The NA's are read as characters and not as missing values
2) Special characters are changed automatically into something like #!+-~
When I import the csv into R I am able to solve both problems with the encoding "UTF-8" - NA's are recognized to be missings and special characters are displayed correctly. My idea was to export the file from R as dbf file and import this dbf file into SAS. This procedure solves the problem with the NA's, but however, special characters are again displayed in a wrong way. I also tried different encodings in SAS, but that also did not work. Any helpd is highly appreciated!
I would use a data step instead of proc import. It could look like:
Data MyCSV;
Infile "C:\MyName\ImportData.CSV"
Delimiter="," LRecL=1000 DSD Missover Firstobs=2; * Firstobs=2 to delete col-names;
Informat qty_txt $9. ; * 9 .. length in characters;
If qty_txt ^= "NA" Then qty=Input(qty_txt,Best15.); Drop qty_txt;
Run;
(If you're exporting from R set na="." in write.csv.)
Regarding the special characters problem, Define the variable as character in the informat-statement should work.

problems replacing €-symbol in strings

I want to replace every "€" in a string with "[euro]". Now this works perfectly fine with
file.col.name <- gsub("€","[euro]", file.col.name, fixed = TRUE)
Now I am looping over column names from a csv-file and suddenly I have trouble with the string "total€".
It works for other special character (#,?) but the € sign doesn't get recognized.
grep("€",file.column.name)
also returns 0 and if I extract the last letter it prints "€" but
print(lastletter(file.column.name) == "€")
returns FALSE. (lastletter is just a function to extract the last letter of a string.)
Does anyone have an idea why that happens and maybe an idea to solve it? I checked the class of "file.column.name" and it returns "character", also tried to convert it into a character again and stuff like that but didn't help.
Thank you!
Your encodings are probably mixed. Check the encodings of the files, then add the appropriate encoding to, e.g., read.csv using fileEncoding="…" as an argument.
If you are working under Unix/Linux, the file utility will tell you the encoding of text files. Otherwise, any editor should show you the encoding of the files.
Common encodings are UTF-8, ISO-8859-15 and windows-1252. Try "UTF-8", "windows-1252" and "latin-9" as values for fileEncoding (the latter being a portable name for ISO-8859-15 according to R's documentation).

Extracting portions of a string of characters

I have a string of characters (311,522 length long). It is in .txt format and all on one line. The data file can be found here. I tried to import it into R like this:
eya4_lagan_HM_cp <- read.table("C:/Documents and Settings/SS/Desktop/Sequence Segmentation/eya4_lagan_HM_cp.txt", quote="\"")
But I get warning messages and it does not import it.
I need to extract portions of this string of characters. That is, I need to extract from 44184 to 44216, meaning the sequence of characters from the 44184th character (inclusive) to the 44216th character (inclusive), then from 151795 to 151844, and so on.
How can I do this?
See https://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string for an information on how to read the file into a string, for example you would use:
fileName <- "C:/Documents and Settings/SS/Desktop/Sequence Segmentation/eya4_lagan_HM_cp.txt"
theData <- readChar(fileName, file.info(fileName)$size)
Also see the readChar docs.
See substr for information on how to extract substrings.
In your case, you could use for example:
mySubstr <- substr(theData, 44184, 44216)

Regarding reading files which contain UTF-8 character

I have a csv file including chinese character saved with UTF-8.
项目 价格
电视 5000
The first row is header, the second row is data. In other words, it is one by two vector.
I read this the file as follows:
amatrix<-read.table("test.csv",encoding="UTF-8",sep=",",header=T,row.names=NULL,stringsAsFactors=FALSE)
However, the output including the unknown marks for the header, i.e.,X.U.FEFF
That is the byte order mark sometimes found in Unicode text files. I'm guessing you're on Windows, since that's the only popular OS where files can end up with them.
What you can do is read the file using readLines and remove the first two characters of the first line.
txt <- readLines("test.csv", encoding="UTF-8")
txt[1] <- substr(txt[1], 3, nchar(txt[1]))
amatrix <- read.csv(text=txt, ...)

How to handle blank items when converting dates in R

I have a csv download of data from a Management Information system. There are some variables which are dates and are written in the csv as strings of the format "2012/11/16 00:00:00".
After reading in the csv file, I convert the date variables into a date using the function as.Date(). This works fine for all variables that do not contain any blank items.
For those which do contain blank items I get the following error message:
"character string is not in a standard unambiguous format"
How can I get R to replace blank items with something like "0000/00/00 00:00:00" so that the as.Date() function does not break? Are there other approaches you might recommend?
If they're strings, does something as simple as
mystr <- c("2012/11/16 00:00:00"," ","")
mystr[grepl("^ *$",mystr)] <- NA
as.Date(mystr)
work? (The regular expression "^ *$" looks for strings consisting of the start of the string (^), zero or more spaces (*), followed by the end of the string ($). More generally I think you could use "^[[:space:]]*$" to capture other kinds of whitespace (tabs etc.)
Even better, have the NAs correctly inserted when you read in the CSV:
read.csv(..., na.strings='')
or to specify a vector of all the values which should be read as NA...
read.csv(..., na.strings=c('',' ',' '))

Resources