Importing UTF-8 text file (with Italian accents) in R

Importing UTF-8 text file (with Italian accents) in R - r

I am trying to import into R a text file, saved with TextWrangler as Unicode (UTF-8) and Unix(LF)
Here is the code I am using:
scan("Testi/PIRANDELLOsigira.txt", fileEncoding='UTF-8', what=character(), sep='\n')
I got the following warning:
Read 6 items
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
invalid input found on input connection 'Testi/PIRANDELLOsigira.txt'
and a vector that stops at the first accented character.

first change your locale from Italy to English
Sys.setlocale(category="LC_ALL", locale = "English_United States.1252")
Then you can read the data with italian encoding
df_ch <- read.table("test.utf8",
sep=",",
header=TRUE,
encoding=" Italian",
)
if you want to only read the data with UTF-8 encoding
you can simply use the following
yourdf <- read.table(" path to your data.utf8",
sep=",",
header=TRUE,
encoding="UTF-8",
)

Related

Problems reading a CSV file into R with incomplete quotes, commas in character strings and unusual characters

I'm trying to read in a large CSV file to R. The file is available at https://github.com/AidData-WM/public_datasets/releases/download/v3.0/AidDataCore_ResearchRelease_Level1_v3.0.zip and the READ ME states that encoding is UTF-8 and there should be 1,561,039 rows and 68 columns. I have tried several different ways to read in the data, but cannot get the full dataset to be read in. I think some problems might arise because: (i) there are incomplete quotations inside character strings, (ii) there are commas inside character strings and sep="," (so I can't use quote="" to deal with the quotations issue), and (iii) there are unusual characters such as arrows.
Here are my various attempts to read the data and the resulting warnings:
aid <- read.csv("AidDataCoreFull_ResearchRelease_Level1_v3.0.csv"),header=T, encoding="UTF-8")
> dim(aid)
[1] 9960 68
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
aid <- read.table("AidDataCoreFull_ResearchRelease_Level1_v3.0.csv"),header=T,sep=",",encoding="UTF-8")
> dim(aid)
[1] 9960 68
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
number of items read is not a multiple of the number of columns
aid <- read.csv("AidDataCoreFull_ResearchRelease_Level1_v3.0.csv"),header=F,skip=1,quote="",encoding="UTF-8")
> dim(aid)
[1] 10956 72
No warning message this time, but no where near the full rows read in and now too many columns.
tx <- readLines("AidDataCoreFull_ResearchRelease_Level1_v3.0.csv",encoding="utf-8",skipNul=T)
> length(tx)
[1] 9961
Warning message:
In readLines("AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", :
incomplete final line found on 'AidDataCoreFull_ResearchRelease_Level1_v3.0.csv'
I can't find a combination of commands that reads in the full CSV, and I can't open it in excel in order to view and try to tidy up the data. Any help would be greatly appreciated!

Skip only Non ASCII character with read.table

I have a csv file with Non ASCII characteres in it. I simply want to remove that characters and read my csv file.
> tables <- lapply('/.././abc.csv', read.csv,header=F,stringsAsFactors=FALSE,fileEncoding="UTF-8")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
invalid input found on input connection '/.././abc.csv'
> df= suppressWarnings(do.call(rbind, tables))
It is not reading the complete file. It has only read the records before the Non-ASCII character. It has skipped all the records after Non ASCII chracter.
I cannot use iconv('/.././abc.csv', "latin1", "ASCII", sub="") as it expects x as vector.
cat '/.././abc.csv'
88036,120,151036.656250,2017-07-17 22:27:49,17-07-17 22:27:49
88036,120,151036.671875,2017-07-17 22:27:53,17-07-17 22:27:53
88036,310,151036.687500,2017-07-17 22:27:58,17-07-17 22:27:58
88036,310,151036.703▒▒F▒▒B▒▒▒D▒%▒▒▒2▒T▒▒K222642,17-07-17 22:28:03,2017-07-17 22:28:03
88036,310,151036.484375,2017-07-17 22:26:54,17-07-17 22:26:54
88036,310,151036.500000,2017-07-17 22:26:59,17-07-17 22:26:59
It is skipping last 2 records after reading the CSV files. Any help.

What if you read it first and then you do
td <- td[,lapply(.SD,function(x){ iconv(x, "latin1", "ASCII", sub="")})]
assuming that you read your csv file as a data.table

Appending files in a folder using R

i am appending 100 files in a folder with a delimiter "|". Below is the code used. I am getting an error whichi am not able to debug,
file_list <- list.files()
dataset <- ldply(file_list, read.table, header=TRUE, sep="|")
ERROR - Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 284 did not have 12 elements
Please help me on this who have experience around this.

how do you skip uncomplete rows in R

I am trying to read a huge file (2GB in size) with this:
data1<-read.table("file1.txt", sep=",",header=F)
I get this error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 513836 did not have 8 elements
Is there a way to skip lines where missing data or replace it with NA values?

This error is most commonly fixed by adding fill = TRUE to your read.table() call. In your case, it would be the following
data1 <- read.table("file1.txt", sep = ",", fill = TRUE)
Additionally, header = FALSE is the default setting for the header argument in read.table() and therefore unnecessary in your code.

Problems with reading in a text file

I'm trying to read a text file into R using the below code:
d = read.table("test_data.txt")
It returned the following error message:
"Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 2 did not have 119 elements"
I tried this:
read.table("man_cohort9_check.txt", header=T, sep="\t")
but it gave this error:
"Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 43 did not have 116 elements"
I don't understand what's going wrong??

It's because your file has rows with different number of column. To start investigate you can run:
d = read.table("test_data.txt", fill=TRUE, header=TRUE, sep="\t")

The usual cause of this are unmatched quotes and or lurking octothorpes ("#"). I would investigate these by seeing which of these produces the most regular table:
table( countfields("test_data.txt", quote="", comment.char="") )
table( countfields("test_data.txt", quote="") )
table( countfields("test_data.txt", comment.char="") )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Importing UTF-8 text file (with Italian accents) in R - r

Related

Problems reading a CSV file into R with incomplete quotes, commas in character strings and unusual characters

Skip only Non ASCII character with read.table

Appending files in a folder using R

how do you skip uncomplete rows in R

Problems with reading in a text file

Categories

Resources