R - Converting a CSV file from Unicode to Windows-1252 - r

I wish to read a CSV file in R that I have downloaded from GCS in the Unicode format.
When I try to read the file it goes like this :
Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : invalid input found on input connection 'reviews_report_201605.csv'
The data is read till line 39 where it encounters a special character and cannot read any further :
2 basic features in non logged in sections are not working.
😣,2016-05-03T09:52:06Z,1462269126290
The code truncates when it reaches that smiley face. I do not mind reading the smiley face as question marks either.
My workaround was to save the CSV through a notepad as an ANSI file which converted the same the same smiley to ??.
How do I do this in R? I have tried multiple methods but none of them work, and it is not manually doable as there are a lot of files.
The code I applied as the file was in Unicode was as follows :
reviews1 <- read.csv("reviews_report_201605.csv", header = T,stringsAsFactors = F,fileEncoding = "UTF-16LE")
Please do suggest any ideas as to how to resolve this.

Related

Why is my txt file not being completely read with read.delim?

I am trying to read this large text file (~3gb) into R, but unfortunately I am not being able to read fully load it. What happens is that I'm missing a lot of rows (I get a dataframe of ~700 thousand rows, while I know the file has at least 4-5million).
The code I was initially using was as follows:
df<-read.delim("file.txt",quote = "",comment.char = "")
However, besides noticing that R wasn't loading all the rows, I was also receiving this warning:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
I searched for a bit online, and I found I could solve it by adding the skipNul = TRUE argument. When I included it in the read.delim function, the warning stopped showing, but my file keeps missing a lot of rows, and still returns the same number of rows as before.
I have loaded files of similar size in the past, so I'm not sure why this is happening.
If someone has any idea what might be causing the problem, I would be very thankful.

Reading csv file using R and RStudio

I am trying to read a csv file in R, but I am getting some errors.
This is what I have and also I have set the correct path
mydata <- read.csv("food_poisioning.csv")
But I am getting this error
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<ff><fe>Y'
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 2 appears to contain embedded nulls
I believe I am getting this error because my csv file is actually not separated by comma, but it has spaces. This is what is looks like:
I tried using sep=" ", but it didn't work.
If you're having difficulty using read.csv() or read.table() (or writing other import commands), try using the "Import Dataset" button on the Environment panel in RStudio. It is useful especially when you are not sure how to specify the table format or when the table format is complex.
For your .csv file, use "From Text (readr)..."
A window will pop up and allow you to choose a file/URL to upload. You will see a preview of the data table after you select a file/URL. You can click on the column names to change the column class or even "skip" the column(s) you don't need. Use the Import Options to further manage your data.
Here is an example using CreditCard.csv from Vincent Arel-Bundock's Github projects:
You can also modify and/or copy and paste the code in Code Preview, or click Import to run the code when you are ready.

not able to read file using read.csv in R

I am not able to read a csv file in R. The file I imported need some cleaning such as removing text qualifiers such ",' etc. Still I am unable to read it. shows the following error.
currency<-read.csv("02prepared data/scraped data kickstarter/film & video1.csv")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals, :
invalid multibyte string at '45,<30>97'
here is the link to the file:- https://drive.google.com/open?id=1ABXPoYxk8b4WCQuRAu-Hhh2OvpJ76PhH
You can try setting fileEncoding = 'latin1', as suggested in this answer:
https://stackoverflow.com/a/14363274/6304113
I tried the method in the link to read your file, and it works for me.

Encoding for read.csv: data is truncated and special characters are not read properly

I want to read a csv file with 5.000 observations into R Studio. If I set the encoding to UTF-8, only 3.500 observations are imported and I get 2 warning messages:
# Example code
options(encoding = "UTF-8")
df <- read.csv("path/data.csv", dec = ".", sep = ",")
1: invalid input found on input connection
2: EOF within quoted string
According to this thread I was able to find some encodings, which are able to read the whole csv file, e.g. windows-1258. However, with this encoding special characters such as ä, ü or ß are not read properly.
My guess is, that UTF-8 would be the right encoding, but that something is wrong with the character/factor variables of the csv file. For that reason I need a way to read the whole csv file with UTF-8. Any help is highly appreciated!

Scopus_ReadCSV {CITAN} not working with csv file exported from Scopus

I am using Rstudio with R 3.3.1 on Windows 7 and I have installed CITAN package. I am trying to import bibliography entries from a CSV file that I exported from Scopus (as it is, untouched), choosing to export all available information.
This is the error that I get:
example <- Scopus_ReadCSV("scopus.csv")
Error in Scopus_ReadCSV("scopus.csv") : Column not found: `Source'.
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'scopus.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'scopus.csv'
Column `Source' is there when I open the file, so I do not know why it says 'not found'.
Eventually I came into the following conclusions:
The encoding of the CSV file as exported from Scopus was UTF-8-BOM, which does not seem to be recognized from R when using Scopus_readCSV("file.csv") or read.table("file.csv", header = TRUE, sep = ",", fileEncoding = "UTF-8").
Although it is used an encoding type for the file from Scopus, there can be found some "strange" non-english characters which are not readable from the read function in R. (Mainly found this problem in names with special characters)
Solutions for those issues:
Open the CSV file with a notepad application like the Notepad++ and save the file with UTF-8 encoding to become readable for R as UTF-8.
When running the read function in R you will notice that it stops reading (e.g. in the 40th out of 200 registries). See where exactly it stopped and this way you can find the special character, by opening the CSV with the notepad, and then you can erase/change it as you wish in order to not have the same issue in R again.
Another solution that worked for me:
Open the file in Google Sheets, then download it from there again as a *.csv-file. R opens it correctly afterwards.

Resources