Scan and readLines - r

In R, scan and readLines have same function on file reading, but different classes of output. To get the vector for further steps, I use scan in reading files. However, one of the text file always occurs error, just like below:
filt <- "E:/lexicon/wenku_baidu_com/stopwords_cn.txt"
specialfilter <- scan(file = filt, what=character(), nmax = -1, sep = "\n", blank.lines.skip = TRUE, skipNul = TRUE, fileEncoding = "UTF-8")
Read 1 item
Warning message:
In scan(file = filt, what = character(), nmax = -1, sep = "\n", :
invalid input found on input connection 'E:/lexicon/wenku_baidu_com/stopwords_cn.txt'
The environment has checked several times, no directory error, no encoding error(file encoding is UTF-8). The salient feature in this file is it has thousand of lines. If use readLines, there is no errors at all:
specialfilter<-readLines(filt, encoding = "UTF-8", skipNul = FALSE)
My questions are:
Is scan have lines limits on reading files? If the answer is
“yes”, how many lines it can read in one file?
If in this case, we can only use readLines, how to change the
result(specialfilter) into vector?
PS: the file uploaded in a network storage, its only 12kb: https://yunpan.cn/OcMTMXyFXNQzYu Access Code is 3c9d

Related

read.csv warning 'EOF within quoted string' to read whole file

I have a .csv file that contains 285000 observations. Once I tried to import dataset, here is the warning and it shows 166000 observations.
Joint <- read.csv("joint.csv", header = TRUE, sep = ",")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
When I coded with quote, as follows:
Joint2 <- read.csv("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
When I coded like that, it shows 483000 observations:
Joint <- read.table("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
What should I do to read the file properly?
I think the problem has to do with file encoding. There are a lot of special characters in the header.
If you know how your file is encoded you can specify using the fileEncoding argument to read.csv.
Otherwise you could try to use fread from data.table. It is able to read the file despite the encoding issues. It will also be significantly faster for reading such a large data file.

Issues reading data as csv in R

I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")

Changing file encoding in R

I was having difficulties importing an excel sheet into R (csv). However, after reading this post, I was able to successfully import it. However, I noticed that some of the numbers in a particular column have transformed into unwanted characters-"Ï52,386.43" "Ï6,887.61" "Ï32,923.45". Any ideas how I can change these to numbers?
Here's my code below:
df <- read.csv("data.csv", header = TRUE, strip.white = TRUE,
fileEncoding="latin1", stringsAsFactors=FALSE)
I've also tried fileEncoding = "UTF-8" but this doesn't work-I'm getting the following warning:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'data.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote
I am using a mac with "R version 3.2.4 (2016-03-10)" (if that makes any difference). Here are the first ten entries from the affected column:
[1] "Ï52,386.43" "Ï6,887.61" "Ï32,923.45" "" "Ï82,108.44"
[6] "Ï6,378.10" "" "Ï22,467.43" "Ï3,850.14" "Ï5,547.83"
It turns out the issue was a pound sign that got changed into Ï in the process of saving an xls file into csv format (in windows-opened in a mac). Thanks for your replies.

Run into error "empty beginning of file" when read.csv in R

I have a csv file which has only one column and empty cells in the first couple cells. When I tried to read it into R with intention reading those blank lines as missing values, I run into the following error. Any help is appreciated!
Test = read.csv("test.csv", header = FALSE, blank.lines.skip = FALSE, nrows = 10)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
empty beginning of file

read.xls() and arguments in R give read.table error "no lines available in input"

I am having a hard time importing some data from .xls files in R.
library(gdata)
file.names <- list.files(path = ".", pattern = "\\.xls$")
file.names
for (file in seq(file.names))
temp <- read.xls(file.names[file],
verbose = FALSE, skip = 16, nrows = 14, header = FALSE,
check.names = FALSE, sep = "\t", fill = TRUE, fileEncoding="UTF-8")
write.csv(temp, "file.csv")
The code above fails to do what i want, producing the error i provided in the title section of this question. Some similar question here is SO aren't helpful at all.
Is there a conflict with additional arguments? Could this be a perl script error or something caused by bad encoding?
Omit the sep= and fileEncoding= arguments in which case I get a 14x48 data frame with the sample data.

Resources