header=FALSE not working when importing multiple CSVs at once - r

I am trying to import multiple CSVs from a folder at once, but the CSVs do not have column names. The following code works, but the first row is converted into column names:
dat <- list.files(pattern="*.csv") %>% lapply(read.csv)
When I try to use the code below:
dat <- list.files(pattern="*.csv") %>% lapply(read.csv(header = FALSE))
I get the following error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : argument "file" is missing, with no default
Any idea how I can avoid this?

The issue comes from incorrect specifying of additional parameters to FUN.
? lapply:
lapply(X, FUN, ...)
... optional arguments to FUN.
You need to make a tiny change to your code to get it to work:
dat <- list.files(pattern="*.csv") %>% lapply(read.csv, header=FALSE)

If you're in the tidyverse you might want
list.files(pattern=".csv") %>%
purrr::map(readr::read_csv, col_names=FALSE)
(watch out for differences in default behaviour between read.csv and readr::read_csv)

Related

EOF within quoted string warning when merging csv files

I have more than 70 CSV files and I am trying to merge them row-wise (they all have same columns). I tried to combine them using this code:
library(tidyverse)
library(plyr)
library(readr)
setwd("*\\data")
myfolder="test"
allfiles= list.files(path=myfolder, pattern="*.csv", full.names = T)
allfiles
combined_csv= ldply(allfiles, read.csv)
Once I run this code I get a warning message:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
It looks like that I am losing some rows. How can I fix this?
It is possible that same columns in different files are read as different types when some of them have some 'character' element and some are just numeric. Here, is one method to read with all columns specified as "character" column, rbind the elements and then use type.convert to automatically convert the column classes based on the value it have
library(data.table)
out <- rbindlist(lapply(list.files(path=myfolder, full.names = TRUE),
fread, colClasses = "character"))
out <- type.convert(out, as.is = TRUE)
Try this:
library(dplyr)
library(readr)
myfolder="test"
df <- list.files(path=myfolder, full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows

How can I read a table in a loosely structured text file into a data frame in R?

Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)

R unable to detect that I have more than one column in loaded files

What I want to do is take every file in the subdirectory that I am in and essentially just shift the column header names over one left.
I try to accomplish this by using fread in a for loop:
library(data.table)
## I need to write this script to reorder the column headers which are now apparently out of wack
## I just need to shift them over one
filelist <- list.files(pattern = ".*.txt")
for(i in 1:length(filelist)){
assign(filelist[[i]], fread(filelist[[i]], fill = TRUE))
names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
}
However, I keep getting the following or a variant of the following error message:
Error in names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1", :
'names' attribute [8] must be the same length as the vector [1]
Which is confusing to me because, as you can clearly see above, R Studio is able to load the files as having the correct number of columns. However, the error message seems to imply that there is only one column. I have tried different functions, such as colnames, and I have even tried to define the separator as being quotation marks (as my files were previously generated by another R script that quotation-separated the entries), to no luck. In fact, if I try to define the separator as such:
for(i in 1:length(filelist)){
assign(filelist[[i]], fread(filelist[[i]], sep = "\"", fill = TRUE))
names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
}
I get the following error:
Error in fread(filelist[[i]], sep = "\"", fill = TRUE) :
sep == quote ('"') is not allowed
Any help would be appreciated.
I think the problem is that, despite the name, list.files returns a character vector, not a list. So using [[ isn't right. Then, with assign, you create an objects that have the same name as the files (not good practice, it would be better to use a list). Then you try to modify the names of the object created, but only using the character string of the object name. To use an object who's name is in a character string, you need to use get (which is part of why using a list is better than creating a bunch of objects).
To be more explicit, let's say that filelist = c("data1.txt", "data2.txt"). Then, when i = 1, this code: assign(filelist[[i]], fread(filelist[[i]], fill = TRUE)) creates a data table called data1.txt. But your next line, names(filelist[[i]]) <- ... doesn't modify your data table, it modifies the first element of filelist, which is the string "data1.txt", and that string indeed has length 1.
I recommend reading your files into a list instead of using assign to create objects.
filelist <- list.files(pattern = ".*.txt")
datalist <- lapply(filelist, fread, fill = TRUE)
names(datalist) <- filelist
For changing the names, you can use data.table::setnames instead:
for(dt in datalist) setnames(dt, c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)"))
However, fread has a col.names argument, so you can just do it in the read step directly:
my_names <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
datalist <- lapply(filelist, fread, fill = TRUE, col.names = my_names)
I would also suggest not using "-log10(p)" as a column name - nonstandard column names (with parens and -) are usually more trouble than they are worth.
Could you run the following code to have a closer look at what you are putting into filelist?
i <- 1
assign(filelist[[i]], fread(filelist[[i]], fill = TRUE))
print(filelist[[i]])
I suspect you may need to use the code below instead of the assign statement
filelist[[i]] <- fread(filelist[[i]], fill = TRUE)

Import data from excel but get warning messages

I import data from excel and I have multiple excel so I read at one time.
Here is my code:
library(readxl)
library(data.table)
file.list <- dir(path = "path/", pattern='\\.xlsx', full.names = T)
df.list <- lapply(file.list, read_excel)
data <- rbindlist(df.list)
However, I get this warning messages between df.list <- lapply(file.list, read_excel) and data <- rbindlist(df.list).
Warning messages:
1: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, :
[3083, 9]: expecting date: got '2015/07/19'
2: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, :
[3084, 9]: expecting date: got '2015/07/20'
What's going on? How can I check and correct?
According to my comment I submit this as an answer. Have you looked into your excel sheet at the respective lines? to me it seems that there is something going on there. maybe you have an empty cell before or after these lines, some space or anything like that... or the format of your date is different in these ones from what is in the other cells.
It is not an elegant solution but use the parameter guess_max = "number of lines in your data file"; this eliminates the warnings and the side effects.

Executing function on objects of name 'i' within for-loop in R

I am still pretty new to R and very new to for-loops and functions, but I searched quite a bit on stackoverflow and couldn't find an answer to this question. So here we go.
I'm trying to create a script that will (1) read in multiple .csv files and (2) apply a function to strip twitter handles from urls in and do some other things to these files. I have developed script for these two tasks separately, so I know that most of my code works, but something goes wrong when I try to combine them. I prepare for doing so using the following code:
# specify directory for your files and replace 'file' with the first, unique part of the
# files you would like to import
mypath <- "~/Users/you/data/"
mypattern <- "file+.*csv"
# Get a list of the files
file_list <- list.files(path = mypath,
pattern = mypattern)
# List of names to be given to data frames
data_names <- str_match(file_list, "(.*?)\\.")[,2]
# Define function for preparing datasets
handlestripper <- function(data){
data$handle <- str_match(data$URL, "com/(.*?)/status")[,2]
data$rank <- c(1:500)
names(data) <- c("dateGMT", "url", "tweet", "twitterid", "rank")
data <- data[,c(4, 1:3, 5)]
}
That all works fine. The problem comes when I try to execute the function handlestripper() within the for-loop.
# Read in data
for(i in data_names){
filepath <- file.path(mypath, paste(i, ".csv", sep = ""))
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
}
When I execute this code, I get the following error: Error in data$URL : $ operator is invalid for atomic vectors. I know that this means that my function is being applied to the string I called from within the vector data_names, but I don't know how to tell R that, in this last line of my for-loop, I want the function applied to the objects of name i that I just created using the assign command, rather than to i itself.
Inside your loop, you can change this:
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
to
tmp <- read.delim(filepath, colClasses = "character", sep = ",")
assign(i, handlestripper(tmp))
I think you should make as few get and assign calls as you can, but there's nothing wrong with indexing your loop with names as you are doing. I do it all the time, anyway.

Resources