Issues reading data as csv in R - r

I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??

Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.

Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")

Related

read.csv warning 'EOF within quoted string' to read whole file

I have a .csv file that contains 285000 observations. Once I tried to import dataset, here is the warning and it shows 166000 observations.
Joint <- read.csv("joint.csv", header = TRUE, sep = ",")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
When I coded with quote, as follows:
Joint2 <- read.csv("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
When I coded like that, it shows 483000 observations:
Joint <- read.table("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
What should I do to read the file properly?
I think the problem has to do with file encoding. There are a lot of special characters in the header.
If you know how your file is encoded you can specify using the fileEncoding argument to read.csv.
Otherwise you could try to use fread from data.table. It is able to read the file despite the encoding issues. It will also be significantly faster for reading such a large data file.

Trouble reading content of .txt files in multiple subfolders into R

I have data of the structure:
Main_Text
Sub1_text
Sub2_text
Etc (I have several hundred subfolders)
Each subfolder containers multiple .txt files.
I want to read all of the files into R, to create a data frame that looks like this:
Filename | Text
Name of file | Content of .txt file
I've tried the following two approaches, and neither quite works. Any help would be appreciated.
1) Using the readtext package: although this package supposedly loops through subfolders, I cannot get it to do so. The code to loop through the files in the readtext vignette should work like this:
dir <- "/Users/Main_Folder"
text = readtext(paste0(dir, "/Main_Text/*.txt"))
This only produces an error:
Error in listMatchingFiles(i, ignoreMissing = ignoreMissing, lastRound = T) : File '' does not exist.
It works, however, if I specify the subfolder, i.e.
text = readtext(paste0(dir, "/Main_Text/Sub1_text*.txt"))
but given that I have several hundred subfolders, I need a more recursive solution.
2) I've also tried the following two step solution, where I create a list of the files first and then attempt to read in the text, which is also resulting in an error:
This generates an accurate list of all my files, but obviously doesn't include a content generating step:
setwd("/Users/Main_Folder")
dat = basename(list.files(pattern = ".txt$", recursive = TRUE, full.names=TRUE, include.dirs=TRUE))
So I also tried:
mypath="/Users/Main_Folder/"
txt_files_ls = list.files(path=mypath, recursive=T, pattern="*.txt")
Which works, however:
txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x, header = F, fill=T, sep =",")})
Throws an error:
Error in read.table(file = x, header = F, fill = T, sep = ",") : no lines available in input In addition: There were 42 warnings (use warnings() to see them)
If I specify
header=T
I get a different error:
Error in read.table(file = x, header = T, fill = T, sep = ",") : more columns than column names In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
So I can't even get to the final step of combining them using something like
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))
I have a sense of why this is, given that the text files themselves don't have headers, and have random formatting (they're press releases). Here's a sample of one of my .txt files:
cat(readLines("Aderholt_text/Aderholt1-28-11.txt"), sep = "\n")
Friday January 28, 2011 Contact: Darrell "DJ" Jordan 202-225-4876 CONGRESSMAN ROBERT ADERHOLT STATEMENT ON THE VIOLENCE IN ALBANIA Washington, DC - Congressman Robert Aderholt (R-Alabama) today issued th
I'm sure I'm missing something small, but can anyone help illuminate how to correctly read in the filenames + text, either using one of the half-working solutions I've tried, or something else?

Changing file encoding in R

I was having difficulties importing an excel sheet into R (csv). However, after reading this post, I was able to successfully import it. However, I noticed that some of the numbers in a particular column have transformed into unwanted characters-"Ï52,386.43" "Ï6,887.61" "Ï32,923.45". Any ideas how I can change these to numbers?
Here's my code below:
df <- read.csv("data.csv", header = TRUE, strip.white = TRUE,
fileEncoding="latin1", stringsAsFactors=FALSE)
I've also tried fileEncoding = "UTF-8" but this doesn't work-I'm getting the following warning:
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection 'data.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote
I am using a mac with "R version 3.2.4 (2016-03-10)" (if that makes any difference). Here are the first ten entries from the affected column:
[1] "Ï52,386.43" "Ï6,887.61" "Ï32,923.45" "" "Ï82,108.44"
[6] "Ï6,378.10" "" "Ï22,467.43" "Ï3,850.14" "Ï5,547.83"
It turns out the issue was a pound sign that got changed into Ï in the process of saving an xls file into csv format (in windows-opened in a mac). Thanks for your replies.

R Error in columns and type.convert(data[[i]], specifically on Mac

I am trying to make R read my CSV file (which contains numerical and categorical data). I am able to open this file on a Windows computer(I tried different ones and it always worked) without any issues, but it is not working on my Mac at all. I am using the latest version of R. Originally, the data was in Excel and then I converted it to csv.
I have exhausted all my options, I tried recommendations from similar topics but nothing works. One time I sort of succeeded but the result looked like this: ;32,0;K;;B;50;;;; I tried the advice given in this topic Import data into R with an unknown number of columns? and the result was the same. I am a beginner in R and I really know nothing about coding or programming, so I would appreciate tremendously any kind of advice on this issue.Below are my feckless attempts to fix this problem:
> file=read.csv("~/Desktop/file.csv", sep = ";")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv("~/Desktop/file.csv", sep = " ")
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
> ?read.csv
> file=read.csv2("~/Desktop/file.csv", sep = ";")
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv2("~/Desktop/file.csv", sep = ";", header=TRUE)
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
> file=read.csv("~/Desktop/file.csv", sep=" ",row.names=1)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
> file=read.csv("~/Desktop/file.csv", row.names=1)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
> file=read.csv("~/Desktop/file.csv", sep=";",row.names=1)
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<ca>110'
This is what the header of the data looks like. So using the advice below, I saved the document in the CSV format for Mac and once I executed the View(file) function, everything looked ok, except for some rows like the row#1 (Cord Number 1) below, it was completely misplaced :
Cord.Number Ply Attch Knots Length Term Thkns Color Value
1,S,U,,37.0,K,,MB,,,"5.5 - 6.5:4, 8.0 - 8.5:2",,UR1031,unknown,
1s1 S U 1S(5.5/Z) 1E(11.5/S) 46.5 K NA W 11
1s2 S U 1S(5.5/Z) 5L(11.0/Z) 21.0 B NA W 15
This is what the spreadsheet looks like in R Studio on Windows (I don't have enough reputation to post an image):
http://imgur.com/zQdJBT2
As a workaround, what you can do is open the csv file on a Windows machine, and then save it to a .rdata file. Rdata is R's internal storage format. You can then put the file on a USB stick, (or DropBox, Google Drive, or whatever), copy it to your Mac, and work on it there.
# on the Windows PC
dat <- read.csv("<file>", ...)
save(dat, file="<file location>/dat.rdata")
# copy the dat.rdata file over, and then on your Mac:
load("<Mac location>/dat.rdata")
fileEncoding="latin1" is a way to make R read the file, but in my case it came with loss of data and special characters. For example, the symbol € disappeared.
As a workaround that worked best for me for this issue (I'm on a mac too), I opened first the file on Sublime Text, and saved it "with encoding" UTF 8.
When trying to import it after again, it could get read by R with no problem, and my special character were still present.
I had a similar problem, but when including , fileEncoding="latin1" after file's name it works

Read in multiple .txt files with header in R

Okay, I'm trying to use this method to get my data into R, but I keep on getting the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 22 elements
This is the script that I'm running:
library(foreign)
setwd("/Library/A_Intel/")
filelist <-list.files()
#assuming tab separated values with a header
datalist = lapply(filelist, function(xx)read.table(xx, header=T, sep=";"))
#assuming the same header/columns for all files
datafr = do.call("rbind", datalist)
Keep in mind, my priorities are:
To read from a .txt file
To associate the headers with the content
To read from multiple files.
Thanks!!!
It appears that one of the files you are trying to read does have the same number of columns as the header. To read this file, you may have to alter the header of this file, or use a more appropriate column separator. To see which file is causing the problem, try something like:
datalist <- list()
for(filename in filelist){
cat(filename,'\n')
datalist[[filename]] <- read.table(filename, header = TRUE, sep = ';')
}
Another option is to get the contents of the file and the header separately:
datalist[[filename]] <- read.table(filename, header = FALSE, sep = ';')
thisHeader <- readLines(filename, n=1)
## ... separate columns of thisHeader ...
colnames(datalist[[filename]]) <- processedHeader
If you can't get read.table to work, you can always fall back on readLines and extract the file contents manually (using, for example, strsplit).
To keep in the spirit of avoiding for loops an initial sanity check before loading all of the data could be done with
lapply(filelist, function(xx){
print (scan(xx, what = 'character', sep=";", nlines = 1))} )
(assuming your header is separated with ';' which may not be the case)

Resources