read.csv in R skipping non-standard rows [duplicate] - r

I was reading in a csv file.
Code is:
mydata = read.csv("mycsv.csv", header=True, sep=",", quote="\"")
Get the following warning:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
Now some cells in my CSV have missing values that are represented by "".
How do I write this code so that I do not get the above warning?

Your CSV might be encoded in UTF-16. This isn't uncommon when working with some Windows-based tools.
You can try loading a UTF-16 CSV like this:
read.csv("mycsv.csv", ..., fileEncoding="UTF-16LE")

You can try using the skipNul = TRUE option.
mydata = read.csv("mycsv.csv", quote = "\"", skipNul = TRUE)
From ?read.csv
Embedded nuls in the input stream will terminate the field currently being read, with a warning once per call to scan. Setting skipNul = TRUE causes them to be ignored.
It worked for me.

This is nothing to do with the encoding. This is the problem with reading of the nulls in the file. To handle that, you need to pass the skipNul = TRUE paramater.
for example:
neg = scan('F:/Natural_Language_Processing/negative-words.txt', what = 'character', comment.char = '', encoding = "UTF-8", skipNul = TRUE)

Might be a file that do not have CRLF, might only have LF. Try to check the HEX output of the file.
If so. Try running the file through awk:
awk '{printf "%s\r\n", $0}' file > new_log_file

I had the same error message and figured out that although my files had a .csv extensions and opened up with no problems in a spreadsheet, they were actually saved as ¨All Formats¨ rather than ¨Text CSV (.csv)¨

Another quick solution:
Double check that you are, in fact, reading a .csv file!
I was accidentally reading a .rds file instead of .csv and got this "embedded null" error.

In those cases be sure the data you are importing does not have "#" characters but if that the case try using the option comment.char="". It worked for me.

Related

not able to read file using read.csv in R

I am not able to read a csv file in R. The file I imported need some cleaning such as removing text qualifiers such ",' etc. Still I am unable to read it. shows the following error.
currency<-read.csv("02prepared data/scraped data kickstarter/film & video1.csv")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals, :
invalid multibyte string at '45,<30>97'
here is the link to the file:- https://drive.google.com/open?id=1ABXPoYxk8b4WCQuRAu-Hhh2OvpJ76PhH
You can try setting fileEncoding = 'latin1', as suggested in this answer:
https://stackoverflow.com/a/14363274/6304113
I tried the method in the link to read your file, and it works for me.

R/RStudio changes name of column when read from csv

I am trying to read in a file in R, using the following command (in RStudio):
fileRaw <- read.csv(file = "file.csv", header = TRUE, stringsAsFactors = FALSE)
file.csv looks something like this:
However, when it's read into R, I get:
As you can see LOCATION is changed to ï..LOCATION for seemingly no reason.
I tried adding check.names = FALSE but this only made it worse, as LOCATION is now replaced with LOCATION. What gives?
How do I fix this? Why is R/RStudio doing this?
There is a UTF-8 BOM at the beginning of the file. Try reading as UTF-8, or remove the BOM from the file.
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence
0xEF,0xBB,0xBF. A text editor or web browser misinterpreting the text
as ISO-8859-1 or CP1252 will display the characters  for this.
Edit: looks like using fileEncoding = "UTF-8-BOM" fixes the problem in RStudio.
Using fileEncoding = "UTF-8-BOM" fixed my problem and read the file with no issues.
Using fileEncoding = "UTF-8"/encoding = "UTF-8" did not resolve the issue.

Read csv in r with special characters

I am trying to read a data file into R with several delimited columns. Some columns have entries which are special characters (such as arrow). Read.table comes back with an error:
incomplete final line found by readTableHeader
and does not read the file. Tried UTF-8, UTF-16 coding options which didn't help either. Here is a small example file.
I am not able to reproduce the arrow in this question box, hence I am attaching the image of the notepad screen of a small file (test1.txt).
Here is what I get when I try to open it.
test <- read.table("test1.TXT", header=T, sep=",", fileEncoding="UTF-8", stringsAsFactor=F)
Warning message: In read.table("test1.TXT", header = T, sep = ",",
fileEncoding = "UTF-8", : incomplete final line found by
readTableHeader on 'test1.TXT'
However, if I remove the second line (with the special character) and try to import the file, R imports it without problem.
test2.txt =
id, ti, comment
1001, 105AB, "All OK"
test <- read.table("test2.TXT", header=T, sep=",", fileEncoding="UTF-8", stringsAsFactor=F)
id ti comment
1 1001 105AB All OK
Although this is a small example, the file I am working with is very large. Is there a way I can import the file to R with those special characters in place?
Thank you.
test1.txt

Scopus_ReadCSV {CITAN} not working with csv file exported from Scopus

I am using Rstudio with R 3.3.1 on Windows 7 and I have installed CITAN package. I am trying to import bibliography entries from a CSV file that I exported from Scopus (as it is, untouched), choosing to export all available information.
This is the error that I get:
example <- Scopus_ReadCSV("scopus.csv")
Error in Scopus_ReadCSV("scopus.csv") : Column not found: `Source'.
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'scopus.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'scopus.csv'
Column `Source' is there when I open the file, so I do not know why it says 'not found'.
Eventually I came into the following conclusions:
The encoding of the CSV file as exported from Scopus was UTF-8-BOM, which does not seem to be recognized from R when using Scopus_readCSV("file.csv") or read.table("file.csv", header = TRUE, sep = ",", fileEncoding = "UTF-8").
Although it is used an encoding type for the file from Scopus, there can be found some "strange" non-english characters which are not readable from the read function in R. (Mainly found this problem in names with special characters)
Solutions for those issues:
Open the CSV file with a notepad application like the Notepad++ and save the file with UTF-8 encoding to become readable for R as UTF-8.
When running the read function in R you will notice that it stops reading (e.g. in the 40th out of 200 registries). See where exactly it stopped and this way you can find the special character, by opening the CSV with the notepad, and then you can erase/change it as you wish in order to not have the same issue in R again.
Another solution that worked for me:
Open the file in Google Sheets, then download it from there again as a *.csv-file. R opens it correctly afterwards.

Get "embedded nul(s) found in input" when reading a csv using read.csv()

I was reading in a csv file.
Code is:
mydata = read.csv("mycsv.csv", header=True, sep=",", quote="\"")
Get the following warning:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
Now some cells in my CSV have missing values that are represented by "".
How do I write this code so that I do not get the above warning?
Your CSV might be encoded in UTF-16. This isn't uncommon when working with some Windows-based tools.
You can try loading a UTF-16 CSV like this:
read.csv("mycsv.csv", ..., fileEncoding="UTF-16LE")
You can try using the skipNul = TRUE option.
mydata = read.csv("mycsv.csv", quote = "\"", skipNul = TRUE)
From ?read.csv
Embedded nuls in the input stream will terminate the field currently being read, with a warning once per call to scan. Setting skipNul = TRUE causes them to be ignored.
It worked for me.
This is nothing to do with the encoding. This is the problem with reading of the nulls in the file. To handle that, you need to pass the skipNul = TRUE paramater.
for example:
neg = scan('F:/Natural_Language_Processing/negative-words.txt', what = 'character', comment.char = '', encoding = "UTF-8", skipNul = TRUE)
Might be a file that do not have CRLF, might only have LF. Try to check the HEX output of the file.
If so. Try running the file through awk:
awk '{printf "%s\r\n", $0}' file > new_log_file
I had the same error message and figured out that although my files had a .csv extensions and opened up with no problems in a spreadsheet, they were actually saved as ¨All Formats¨ rather than ¨Text CSV (.csv)¨
Another quick solution:
Double check that you are, in fact, reading a .csv file!
I was accidentally reading a .rds file instead of .csv and got this "embedded null" error.
In those cases be sure the data you are importing does not have "#" characters but if that the case try using the option comment.char="". It worked for me.

Resources