Okay, I'm trying to use this method to get my data into R, but I keep on getting the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 22 elements
This is the script that I'm running:
library(foreign)
setwd("/Library/A_Intel/")
filelist <-list.files()
#assuming tab separated values with a header
datalist = lapply(filelist, function(xx)read.table(xx, header=T, sep=";"))
#assuming the same header/columns for all files
datafr = do.call("rbind", datalist)
Keep in mind, my priorities are:
To read from a .txt file
To associate the headers with the content
To read from multiple files.
Thanks!!!
It appears that one of the files you are trying to read does have the same number of columns as the header. To read this file, you may have to alter the header of this file, or use a more appropriate column separator. To see which file is causing the problem, try something like:
datalist <- list()
for(filename in filelist){
cat(filename,'\n')
datalist[[filename]] <- read.table(filename, header = TRUE, sep = ';')
}
Another option is to get the contents of the file and the header separately:
datalist[[filename]] <- read.table(filename, header = FALSE, sep = ';')
thisHeader <- readLines(filename, n=1)
## ... separate columns of thisHeader ...
colnames(datalist[[filename]]) <- processedHeader
If you can't get read.table to work, you can always fall back on readLines and extract the file contents manually (using, for example, strsplit).
To keep in the spirit of avoiding for loops an initial sanity check before loading all of the data could be done with
lapply(filelist, function(xx){
print (scan(xx, what = 'character', sep=";", nlines = 1))} )
(assuming your header is separated with ';' which may not be the case)
Related
I have several csv files (let say A.csv, B.csv, ...) in several folders let say (F1, F2,...). I want to read all files in a way that cbind A.csv, B.csv of each folder and create main dataframe for each folder. It means I need to have n dataframe for my n folders with a unique name based on folder name.
I have tried this code to get the list of csv files.
files <- dir("/Users/.../.../...", recursive=TRUE, full.names=TRUE, pattern="\\.csv$")
then created a function:
readFun <- function(x) { df <- read.csv(x)}
then sapply:
sapply(files, readFun)
it returns this error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input
I played around with the code alot, but did not figure out how to debug it.
Any help is highly appreciated.
Also, any hint on how to create main dataframe for each folder?
Thanks
You could first get all the directory names, then create a function to read all the files from one directory combining them and writing the combined file in the same directory.
all_directories <- dir('top/directory', full.names = TRUE)
bind_files_into_one <- function(file_path) {
all_files <- list.files(file_path, full.names = TRUE, pattern = "\\.csv$")
temp <- do.call(cbind, lapply(all_files, read.csv))
write.csv(temp, paste0(file_path, '/combined.csv'), row.names = FALSE)
}
We can use lapply to apply this function to every directory.
lapply(all_directories, bind_files_into_one)
I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")
I have a question of which I cannot find the answer online and on SO. I was wondering if it's possible to import multiple .csv files from multiple directories all at once, but ignore files that only have headers (first line) and no data?
Something like this works, but only if all files have data:
all_csv = dir(dir, recursive=T, full.names=T, pattern="\\.csv$")
myfiles = lapply(all_csv,read.csv,sep=sep,dec=dec,stringsAsFactor=F,header=F,skip=1)
for(i in 1:length(myfiles)){
colnames(myfiles[[i]])[1:2] <- c("A","B")
}
combi <- do.call("rbind",myfiles)
Update: With .csv files of 0 data rows (only headers) included in the target directory, an error is returned at the 2nd command line:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
thank you.
I have a chunk of R code that recursively loads, tidies, and exports all .txt files in a directory (files are tab-delimited but I used read.fwf to drop columns). The code works for .txt files with complete data after 9 lines of unnecessary headers. However, when I expanded the code to the directory with the full set of .txt files (>500), I found that some of the files have bad rows embedded within the data (essentially, automated repeats of a few header lines, sample available here). I have tried just loading all rows, both good and bad, with the intent of removing the bad rows from within R, but get error messages about column numbers.
Original error: Batch load using read.fwf (Note: I only need the first three columns from each .txt file)
setwd("C:/Users/Seth/Documents/testdata")
library(stringr)
filesToProcess <- dir(pattern="*.txt", full.names=T)
listoffiles <- lapply(filesToProcess, function(x) read.fwf (x,skip=9, widths=c(10,20,21), col.names=c("Point",NA,"Location",NA,"Time"), stringsAsFactors=FALSE))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 344 did not have 5 elements #error from bad rows
Next I tried pre-processing the data to exclude the bad rows using 'sqldf'.
Fix attempt 1: Batch pre-process using 'sqldf'
library(sqldf)
listoffiles <- lapply(filesToProcess, function(x) read.csv.sql(x, sep="\t",
+ skip=9,field.types=c("Point","Location","Time","V4","V5","V6","V7","V8","V9"),
+ header=F, sql = "select * from file where Point = 'Trackpoint' "))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 9 elements
Fix attempt 2: Single file pre-process using 'sqldf'
test.v1 <- read.csv.sql("C:/Users/Seth/Documents/testdata/test/2008NOV28_MORNING_Hunknown.txt",
+ sep="\t", skip=9,field.types=c("Point","Location","Time","V4","V5","V6","V7","V8","V9"),
+ header=F, sql = "select * from file where Point = 'Trackpoint' ")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 9 elements
I'd prefer to do this cleanly and use something like 'sqldf' or 'dplyr, but am open to pulling in all rows and then post-process within R. My questions: How do I exclude the bad rows of data during import? Or, how do I get the full data set imported and then remove the bad rows within R?
Here are a few ways. They all make use of the fact that the good lines all contain the degree symbol (octal 260) and junk lines do not. In all of these we have assumed that columns 1 and 3 are to be dropped.
1) This code assumes you have grep but you may need to quote the first argument of grep depending on your shell. (On Windows, to get grep you would need to install Rtools and under a normal Rtools install grep is found here: C:\\Rtools\bin\grep.exe. The Rtools bin directory would have to be placed on your Windows path or else the entire pathname would need to be used when referencing the Rtools grep.) These comments only apply to (1) and (4) as (2) and (3) do not use the system's grep.
File <- "2008NOV28_MORNING_trunc.txt"
library(sqldf)
DF <- read.csv.sql(File, header = FALSE, sep = "\t", eol = "\n",
sql = "select V2, V4, V5, V6, V7, V8, V9 from file",
filter = "grep [\260] ")
2) You may not need sqldf for this:
DF <- read.table(text = grep("\260", readLines(File), value = TRUE),
sep = "\t", as.is = TRUE)[-c(1, 3)]
3) Alternately try the following which is more efficient than (2) but involves specifying the colClasses vector:
colClasses <- c("NULL", NA, "NULL", NA, NA, NA, NA, NA, NA)
DF <- read.table(text = grep("\260", readLines(File), value = TRUE),
sep = "\t", as.is = TRUE, colClasses = colClasses)
4) We can also use the system's grep with `read.table. The comments in (1) about grep apply here too:
DF <- read.table(pipe(paste("grep [\260]", File)),
sep = "\t", as.is = TRUE, colClasses = colClasses)
I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.
The CSV files have the following format:
Image of CSV format:
I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the 'actual names' to A1-E1.
Raw text file:
"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 #A (A1): Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 #B (B1): Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 #C (C1): Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 #D (D1): LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 #E (E1): Ik kijk Bureau sport op Nederland 3. #bureausport #kijkes";"E (E1)";"2012-06-05 00:00:27"
Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).
Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.
http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/
http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/
Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven't found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!
Although I have tried some codes I found on the web, this didn't actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):
filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This results in the following error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file '..': No such file or directory
Update
I have tried the following code:
# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
But I run into the following errors:
After the 3rd line I get:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
After the 4th line I get:
Error: object 'my.df' not found
I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn't get any errors. However its seemed like nothing happened at all. I didn't get any output from R?
To solve the extra column problem I adjusted the code slightly:
#grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
I tried this code on all the files, although R clearly started processing, I eventually got the following errors:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_22_30 2012 .csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_24_31 2012 .csv'
Error: object 'my.df' not found
What did I do wrong?
First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This should get you a data.frame with the contents of all the tweets
A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:
read.csv('fred.csv', header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time'),
colClasses=rep('character', 4))
Nb. changed so all columns are character, and ';' separated
I'd parse out the time later if it was needed...
A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like
my.new.df <- my.df[!duplicated(my.df$tweet),]
For unique by author, I'd append the two fields - hard to know what works without the real data though!
my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]
So bringing it all together and assuming a few things along the way...
# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time'),
colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...
Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
cat('about to read from file:', fnam, '\n')
if (exists('tmp.df')) rm(tmp.df)
tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time','extra'),
colClasses=rep('character', 5))
if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
cat(' successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
# now lets append a column containing the originating file name
# so that debugging the file contents is easier
tmp.df$fnam <- fnam
# now lets rbind everything together
if (exists('my.df')) {
my.df <- rbind(my.df, tmp.df)
} else {
my.df <- tmp.df
}
} else {
cat(' read NO rows from ', fnam, '\n')
}
}
cat('processed ', n.files.processed, ' files\n')
my.new.df <- my.df[!duplicated(my.df$tweet),]