Batch load files to R while excluding bad rows using 'sqldf' - r

I have a chunk of R code that recursively loads, tidies, and exports all .txt files in a directory (files are tab-delimited but I used read.fwf to drop columns). The code works for .txt files with complete data after 9 lines of unnecessary headers. However, when I expanded the code to the directory with the full set of .txt files (>500), I found that some of the files have bad rows embedded within the data (essentially, automated repeats of a few header lines, sample available here). I have tried just loading all rows, both good and bad, with the intent of removing the bad rows from within R, but get error messages about column numbers.
Original error: Batch load using read.fwf (Note: I only need the first three columns from each .txt file)
setwd("C:/Users/Seth/Documents/testdata")
library(stringr)
filesToProcess <- dir(pattern="*.txt", full.names=T)
listoffiles <- lapply(filesToProcess, function(x) read.fwf (x,skip=9, widths=c(10,20,21), col.names=c("Point",NA,"Location",NA,"Time"), stringsAsFactors=FALSE))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 344 did not have 5 elements #error from bad rows
Next I tried pre-processing the data to exclude the bad rows using 'sqldf'.
Fix attempt 1: Batch pre-process using 'sqldf'
library(sqldf)
listoffiles <- lapply(filesToProcess, function(x) read.csv.sql(x, sep="\t",
+ skip=9,field.types=c("Point","Location","Time","V4","V5","V6","V7","V8","V9"),
+ header=F, sql = "select * from file where Point = 'Trackpoint' "))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 9 elements
Fix attempt 2: Single file pre-process using 'sqldf'
test.v1 <- read.csv.sql("C:/Users/Seth/Documents/testdata/test/2008NOV28_MORNING_Hunknown.txt",
+ sep="\t", skip=9,field.types=c("Point","Location","Time","V4","V5","V6","V7","V8","V9"),
+ header=F, sql = "select * from file where Point = 'Trackpoint' ")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 9 elements
I'd prefer to do this cleanly and use something like 'sqldf' or 'dplyr, but am open to pulling in all rows and then post-process within R. My questions: How do I exclude the bad rows of data during import? Or, how do I get the full data set imported and then remove the bad rows within R?

Here are a few ways. They all make use of the fact that the good lines all contain the degree symbol (octal 260) and junk lines do not. In all of these we have assumed that columns 1 and 3 are to be dropped.
1) This code assumes you have grep but you may need to quote the first argument of grep depending on your shell. (On Windows, to get grep you would need to install Rtools and under a normal Rtools install grep is found here: C:\\Rtools\bin\grep.exe. The Rtools bin directory would have to be placed on your Windows path or else the entire pathname would need to be used when referencing the Rtools grep.) These comments only apply to (1) and (4) as (2) and (3) do not use the system's grep.
File <- "2008NOV28_MORNING_trunc.txt"
library(sqldf)
DF <- read.csv.sql(File, header = FALSE, sep = "\t", eol = "\n",
sql = "select V2, V4, V5, V6, V7, V8, V9 from file",
filter = "grep [\260] ")
2) You may not need sqldf for this:
DF <- read.table(text = grep("\260", readLines(File), value = TRUE),
sep = "\t", as.is = TRUE)[-c(1, 3)]
3) Alternately try the following which is more efficient than (2) but involves specifying the colClasses vector:
colClasses <- c("NULL", NA, "NULL", NA, NA, NA, NA, NA, NA)
DF <- read.table(text = grep("\260", readLines(File), value = TRUE),
sep = "\t", as.is = TRUE, colClasses = colClasses)
4) We can also use the system's grep with `read.table. The comments in (1) about grep apply here too:
DF <- read.table(pipe(paste("grep [\260]", File)),
sep = "\t", as.is = TRUE, colClasses = colClasses)

Related

My CSV data when read into R via read.table after so many lines, stops creating new rows and separating the values by ","

When reading my CVS data into R, after reading so many values as normal the data stops being separated by "," leaving lots of data missing
Here is how I load my data into R.
CODATA <- read.table( file.choose("CO2 Emissions per country.cvs"), header = TRUE, sep = "," )
I'm given this warning.
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 3 elements
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
2: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
number of items read is not a multiple of the number of columns
Then the value is...
"Cote dIvoire,0.42,0.44,0.44,0.43,0.45,0.49,0.46,0.51,0.45,0.4,0.33,0.32,0.38,0.33,0.29,0.28,0.26,0.25,0.23,0.21,0.21,0.2,0.21,0.21,0.22,0.25,0.3,0.3,0.4,0.37,0.36,0.36,0.29,0.31,0.32,0.32,0.3,0.34,0.31,0.29\nNigeria,0.1,0.12,0.15,0.16,0.18,0.22,0.27,0.3,0.32,0.35,0.39,0.42,0.41,0.37,0.38,0.35,0.35,0.36,0.36,0.3,0.34,0.4,0.36,0.29,0.28,0.3,0.34,0.29,0.31,0.34,0.38,0.39,0.36,0.36,0.4,0.35,0.32,0.33,0.27,0.29\nKenya,0.28,0.28.....(and so on)
where the values haven't been separated. The data is meant to start a new line with each country. It reads the previous 100 or so countries as normal up to Cote dIvoire.
Is there any way to fix without editing the csv file and changing the code to load it in?
Thank you for any help given.
You're best checking over the CSV file again for any problems. You could also try CODATA <- read.csv("CO2 Emissions per country.csv") rather than read.table?

read.csv warning 'EOF within quoted string' to read whole file

I have a .csv file that contains 285000 observations. Once I tried to import dataset, here is the warning and it shows 166000 observations.
Joint <- read.csv("joint.csv", header = TRUE, sep = ",")
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
When I coded with quote, as follows:
Joint2 <- read.csv("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
When I coded like that, it shows 483000 observations:
Joint <- read.table("joint.csv", header = TRUE, sep = ",", quote="", fill= TRUE)
What should I do to read the file properly?
I think the problem has to do with file encoding. There are a lot of special characters in the header.
If you know how your file is encoded you can specify using the fileEncoding argument to read.csv.
Otherwise you could try to use fread from data.table. It is able to read the file despite the encoding issues. It will also be significantly faster for reading such a large data file.

Issues reading data as csv in R

I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")

R - reading several data tables from one text file

I am new to R and trying to learn how to read the text below. I am using
data <- read.table("myvertices.txt", stringsAsFactors=TRUE, sep=",")
hoping to convey that the "FID..." should be associated with the comma separated numbers below them.
The error I get is:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 13 did not have 2 elements
How would I read the following format
FID001:
-120.9633,51.8496
-121.42749,52.293
-121.25453,52.3195
FID002:
-65.4794,47.69011
-65.4797,47.0401
FID003:
-65.849,47.5215
-65.467,47.515
into something like
FID001 -120.9633 51.8496
FID001 -121.42749 52.293
FID001 -121.25453 52.3195
FID002 -65.4794 47.69011
FID002 -65.4797 47.0401
FID003 -65.849 47.5215
FID003 -65.467 47.515
Here is a possible way to achieve this:
data <- read.table("myvertices.txt") # Read as-is.
fid1 <- c(grep("^FID", data$V1), nrow(data) +1) # Get the row numbers containing "FID.."
df1 <- diff(x = fid1, lag = 1) # Calculate the length+1 rows to read
listdata <- lapply(seq_along(df1),
function(n) cbind(FID = data$V1[fid1[n]],
read.table("myvertices.txt",
skip = fid1[n],
nrows = df1[n] -1,
sep = ",")))
data2 <- do.call(rbind, listdata) # Combine all the read tables into a single data frame.

Read in multiple .txt files with header in R

Okay, I'm trying to use this method to get my data into R, but I keep on getting the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 22 elements
This is the script that I'm running:
library(foreign)
setwd("/Library/A_Intel/")
filelist <-list.files()
#assuming tab separated values with a header
datalist = lapply(filelist, function(xx)read.table(xx, header=T, sep=";"))
#assuming the same header/columns for all files
datafr = do.call("rbind", datalist)
Keep in mind, my priorities are:
To read from a .txt file
To associate the headers with the content
To read from multiple files.
Thanks!!!
It appears that one of the files you are trying to read does have the same number of columns as the header. To read this file, you may have to alter the header of this file, or use a more appropriate column separator. To see which file is causing the problem, try something like:
datalist <- list()
for(filename in filelist){
cat(filename,'\n')
datalist[[filename]] <- read.table(filename, header = TRUE, sep = ';')
}
Another option is to get the contents of the file and the header separately:
datalist[[filename]] <- read.table(filename, header = FALSE, sep = ';')
thisHeader <- readLines(filename, n=1)
## ... separate columns of thisHeader ...
colnames(datalist[[filename]]) <- processedHeader
If you can't get read.table to work, you can always fall back on readLines and extract the file contents manually (using, for example, strsplit).
To keep in the spirit of avoiding for loops an initial sanity check before loading all of the data could be done with
lapply(filelist, function(xx){
print (scan(xx, what = 'character', sep=";", nlines = 1))} )
(assuming your header is separated with ';' which may not be the case)

Resources