This question already has answers here:
When I import text file into R, I get a special character appended to the first value of the first column
(4 answers)
Closed 5 years ago.
I have exported data from a result grid in SQL Server Management Studio to a csv file.
The csv file looks correct.
But when I read the data into an R dataframe using read.csv, the first column name is prepended with "ï..". How do I get rid of this junk text?
Example:
str(trainData)
'data.frame': 64169 obs. of 20 variables:
$ ï..Column1 : int 3232...
$ Column2 : int 4242...
The data looks something like this (nothing special) :
Column1,Column2
100116577,100116577
100116698,100116702
You've got a Unicode UTF-8 BOM at the start of the file:
http://en.wikipedia.org/wiki/Byte_order_mark
A text editor or web browser interpreting the text as ISO-8859-1 or
CP1252 will display the characters  for this
R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters.
Here:
http://r.789695.n4.nabble.com/Writing-Unicode-Text-into-Text-File-from-R-in-Windows-td4684693.html
Duncan Murdoch suggests:
You can declare a file to be in encoding "UTF-8-BOM" if you want to
ignore a BOM on input
So try your read.csv with fileEncoding="UTF-8-BOM" or persuade your SQL wotsit to not output a BOM.
Otherwise you may as well test if the first name starts with ï.. and strip it with substr (as long as you know you'll never have a column that does start like that genuinely...)
Related
This question already has answers here:
Convert non-ASCII to character representations beginning with backslash u (\u) in R?
(1 answer)
How to escape backslashes in R string
(3 answers)
Closed 2 years ago.
I have a character vector which includes regex special characters such as \n.
I want to save this vector to a txt-file which shows the \n.
The code below creates a txt file, but when opening the file e.g. in notepad, the \n are translated into linebreaks. I don't want to see the linebreaks, I want to see \n.
library(tidyverse)
my_text <- "hello\n, nice to meet you.\n\n. how are you?"
my_text
readr::write_file(my_text, "my_text.txt")
this is what notepad shows:
hello
, nice to meet you.
. how are you?
but I want to see
"hello\n, nice to meet you.\n\n. how are you?"
I thought there was some kind of 'raw option', but seems like I am wrong. Thx!
This question already has answers here:
When I import text file into R, I get a special character appended to the first value of the first column
(4 answers)
Closed 5 years ago.
I have exported data from a result grid in SQL Server Management Studio to a csv file.
The csv file looks correct.
But when I read the data into an R dataframe using read.csv, the first column name is prepended with "ï..". How do I get rid of this junk text?
Example:
str(trainData)
'data.frame': 64169 obs. of 20 variables:
$ ï..Column1 : int 3232...
$ Column2 : int 4242...
The data looks something like this (nothing special) :
Column1,Column2
100116577,100116577
100116698,100116702
You've got a Unicode UTF-8 BOM at the start of the file:
http://en.wikipedia.org/wiki/Byte_order_mark
A text editor or web browser interpreting the text as ISO-8859-1 or
CP1252 will display the characters  for this
R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters.
Here:
http://r.789695.n4.nabble.com/Writing-Unicode-Text-into-Text-File-from-R-in-Windows-td4684693.html
Duncan Murdoch suggests:
You can declare a file to be in encoding "UTF-8-BOM" if you want to
ignore a BOM on input
So try your read.csv with fileEncoding="UTF-8-BOM" or persuade your SQL wotsit to not output a BOM.
Otherwise you may as well test if the first name starts with ï.. and strip it with substr (as long as you know you'll never have a column that does start like that genuinely...)
This question already has answers here:
Stop Excel from automatically converting certain text values to dates
(37 answers)
Closed 4 years ago.
I have the following table:
chr [1:1000] "10/16" "1/5" "6/16" "2/5" "7/11" "6/6" "5/5" "2/5" "14/16" "3/5"
"5/5" "7/14" "9/9" "8/9" "7/9" "4/9" "5/5" "1/5" "6/9" "14/16" ...
I want to save this as csv, but when I open csv file the format automatically changes to "date", so I get Oct-2016, 1-May, June-2016 and so on...
How can I export the table "as is"
You write,
when I open the csv file ...
Assuming you are doing this in Excel, understand that the csv file is most likely correct, which you can confirm by opening it in Notepad.
That being the case, one fix is to Import the data by using Data ► Get External Data ► From Text. This will open the Text-to-Columns wizard and you can specify the column containing the values as Text.
Just add a "space" or any other character after each item
chr [1:1000] "10/16." "1/5." "6/16." "2/5." "7/11." ...
Here is the code:
Y<-paste0(X," ")
I'm getting the following when I try to read in a fixed width text file using read.fwf.
Here is the output:
invalid multibyte string at 'ETE<52> O 19950207 19031103 537014290 7950 WILLOWS RD
Here are the most relevant lines of code
fieldWidths <- c(10,50,30,40,6,8,8,9,35,30,9,2)
colNames <- c("certNum", "lastN", "firstN", "middleN", "suffix", "daDeath", "daBirth", "namesSSN", "namesResStr", "namesResCity", "namesResZip", "namesStCode")
dmhpNameDF <- read.fwf(fileName, widths = fieldWidths, col.names=colNames, sep="", comment.char="", quote="", fileEncoding="WINDOWS-1258", encoding="WINDOWS-1258")
I'm running R 3.1.1 on Mac OSX 10.9.4
As you can see, I've experimented with specifying alternative encodings, I've tried latin1 and UTF-8 as well as WINDOWS-1250 through 1258
When I read this file into Excel or Word, or TextEdit everything looks good in general. By using the error message text I can id the offending line (row) of text is row number 5496, and upon inspection, I can see that the offending character shows up as an italic looking letter 'f' Searching for that character reveals that there are about 4 instances of it in this file. I have many such files to process so going through one by one to delete the offending character is not a good solution.
So far, the offending character always shows up in a name field, which is good for me as I don't actually want the name data from this file it is of no interest. If it were a numeric field that was corrupted then I'd have to toss out the row.
Since Word and Excel can read the file (apparently substituting the offending character for italic 'f', surely there must be a way to read it in with R, but I've not figured out a solution. I have searched through the many examples of questions related to "invalid multibyte string", but have not found any info that resolved my problem.
My goal is to be able to read in the data either ignoring this "character error" or substituting the offending character with something else.
Unfortunately the file in question contains sensitive information so I can not post a copy of it for people to play with.
Thanks
I have a csv file including chinese character saved with UTF-8.
项目 价格
电视 5000
The first row is header, the second row is data. In other words, it is one by two vector.
I read this the file as follows:
amatrix<-read.table("test.csv",encoding="UTF-8",sep=",",header=T,row.names=NULL,stringsAsFactors=FALSE)
However, the output including the unknown marks for the header, i.e.,X.U.FEFF
That is the byte order mark sometimes found in Unicode text files. I'm guessing you're on Windows, since that's the only popular OS where files can end up with them.
What you can do is read the file using readLines and remove the first two characters of the first line.
txt <- readLines("test.csv", encoding="UTF-8")
txt[1] <- substr(txt[1], 3, nchar(txt[1]))
amatrix <- read.csv(text=txt, ...)