I have 774 XLS files that I would like to merge into one big CSV data-base. They are roughly similar, but I don't know how to handle the differences...
Some XLS files have more than one sheet, and they are useless; thus I need to get rid of them. The problem is that, in some files, these extra sheets were moved to be the first, while in others this doesn't happen. So I can't depend of the default value of functions that read XLS on R, right?
Besides that, the name of the extra sheets (those I don't intend to keep) may vary.
Below I present the script that I know, hoping someone could help me adapt it to this situation.
setwd("D:/Folder")
library(readxl)
lst = list.files()
df = data.frame()
# Now comes the loop
for(table in lst){
dataFromExcel <- read_excel(table)
df <- rbind(df,dataFromExcel)
}
When I run the loop, I receive the message:
New names:
`` -> ...3
`` -> ...4
`` -> ...5
Error in as.POSIXlt.character(x, tz, ...) : character string is not in a standard unambiguous format
Can someone give me some help?
try
for(table in lst){
dataFromExcel <- read_excel(table, col_types = "text" ) # <- !!
df <- rbind(df,dataFromExcel)
}
you will have to wrangle the data to the correct type afterwards..
further: perhaps a more R-like code:
library( data.table )
DT <- rbindlist( lapply( list, read_excel, col_types = "text" ),
use.names = TRUE, fill = TRUE )
should do the same as your for-loop (and has some nice extra features, see ?data.table::rbindlist ).
Related
I'm realtively new to R and have been trying to find a working answer here for the last three hours, but just cannot seem to find a combination that works.
I have a folder that contains 841 csv files, none of the files have column names. The format is the same for every file (although some of the files might have blank columns due to there simply not being any data available for said column in that file).
I want to be able to read in all 841 csv files, add the column names and then cbind them into a single data frame.
Bringing in a single file and adding the column names is easy enough:
col.names = c("ID", "NAMES_URI", "NAME1", "NAME1_LANG", "NAME2", "NAME2_LANG", "TYPE", "LOCAL_TYPE",
"GEOMETRY_X", "GEOMETRY_Y", "MOST_DETAIL_VIEW_RES", "LEAST_DETAIL_VIEW_RES", "MBR_XMIN",
"MBR_YMIN", "MBR_XMAX", "MBR_YMAX", "POSTCODE_DISTRICT", "POSTCODE_DISTRICT_URI",
"POPULATED_PLACE", "POPULATED_PLACE_URI", "POPULATED_PLACE_TYPE", "DISTRICT_BOROUGH",
"DISTRICT_BOROUGH_URI", "DISTRICT_BOROUGH_TYPE", "COUNTY_UNITARY", "COUNTY_UNITARY_URI",
"COUNTY_UNITARY_TYPE", "REGION", "REGION_URI", "COUNTRY", "COUNTRY_URI", "RELATED_SPATIAL_OBJECT",
"SAME_AS_DBPEDIA", "SAME_AS_GEONAMES")
Single_File <- fread(file = "C:/Users/djr/Desktop/PostCodes/Data/HP40.csv", header = FALSE)
setnames(Single_File, col.names)
My issue comes in when I try to read the files in as a list and bind. I've tried examples using lapply or map_dfr, but they always bring up error messages about the vector size not being the same or not being able to fill or about the column specification not being the same.
My current code I am trying is:
dir(pattern = ".csv") %>%
map_dfr(read_csv, col_names = c("ID", "NAMES_URI", "NAME1", "NAME1_LANG", "NAME2", "NAME2_LANG", "TYPE", "LOCAL_TYPE",
"GEOMETRY_X", "GEOMETRY_Y", "MOST_DETAIL_VIEW_RES", "LEAST_DETAIL_VIEW_RES", "MBR_XMIN",
"MBR_YMIN", "MBR_XMAX", "MBR_YMAX", "POSTCODE_DISTRICT", "POSTCODE_DISTRICT_URI",
"POPULATED_PLACE", "POPULATED_PLACE_URI", "POPULATED_PLACE_TYPE", "DISTRICT_BOROUGH",
"DISTRICT_BOROUGH_URI", "DISTRICT_BOROUGH_TYPE", "COUNTY_UNITARY", "COUNTY_UNITARY_URI",
"COUNTY_UNITARY_TYPE", "REGION", "REGION_URI", "COUNTRY", "COUNTRY_URI", "RELATED_SPATIAL_OBJECT",
"SAME_AS_DBPEDIA", "SAME_AS_GEONAMES"))
But this just brings up loads of output in the console that is meaningless to me, it seems to be giving a summary of each file.
Is there any simple code to bring in CSV's, add the column names to each and then cbind them all together that anyone has?
I am not 100% sure what exactly it is you need but my best guess would be something like this:
library(data.table)
y_path <- 'C:/your_path/your_folder'
all_csv <- list.files(path = y_path, pattern = '.csv', full.names = TRUE)
open_csv <- lapply(all_csv, \(x) fread(x, ...)) # ... here just signifying other arguments
one_df <- data.table::rbindlist(open_csv)
# or: do.call(rbind, open_csv)
Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)
What I want to do is take every file in the subdirectory that I am in and essentially just shift the column header names over one left.
I try to accomplish this by using fread in a for loop:
library(data.table)
## I need to write this script to reorder the column headers which are now apparently out of wack
## I just need to shift them over one
filelist <- list.files(pattern = ".*.txt")
for(i in 1:length(filelist)){
assign(filelist[[i]], fread(filelist[[i]], fill = TRUE))
names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
}
However, I keep getting the following or a variant of the following error message:
Error in names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1", :
'names' attribute [8] must be the same length as the vector [1]
Which is confusing to me because, as you can clearly see above, R Studio is able to load the files as having the correct number of columns. However, the error message seems to imply that there is only one column. I have tried different functions, such as colnames, and I have even tried to define the separator as being quotation marks (as my files were previously generated by another R script that quotation-separated the entries), to no luck. In fact, if I try to define the separator as such:
for(i in 1:length(filelist)){
assign(filelist[[i]], fread(filelist[[i]], sep = "\"", fill = TRUE))
names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
}
I get the following error:
Error in fread(filelist[[i]], sep = "\"", fill = TRUE) :
sep == quote ('"') is not allowed
Any help would be appreciated.
I think the problem is that, despite the name, list.files returns a character vector, not a list. So using [[ isn't right. Then, with assign, you create an objects that have the same name as the files (not good practice, it would be better to use a list). Then you try to modify the names of the object created, but only using the character string of the object name. To use an object who's name is in a character string, you need to use get (which is part of why using a list is better than creating a bunch of objects).
To be more explicit, let's say that filelist = c("data1.txt", "data2.txt"). Then, when i = 1, this code: assign(filelist[[i]], fread(filelist[[i]], fill = TRUE)) creates a data table called data1.txt. But your next line, names(filelist[[i]]) <- ... doesn't modify your data table, it modifies the first element of filelist, which is the string "data1.txt", and that string indeed has length 1.
I recommend reading your files into a list instead of using assign to create objects.
filelist <- list.files(pattern = ".*.txt")
datalist <- lapply(filelist, fread, fill = TRUE)
names(datalist) <- filelist
For changing the names, you can use data.table::setnames instead:
for(dt in datalist) setnames(dt, c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)"))
However, fread has a col.names argument, so you can just do it in the read step directly:
my_names <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
datalist <- lapply(filelist, fread, fill = TRUE, col.names = my_names)
I would also suggest not using "-log10(p)" as a column name - nonstandard column names (with parens and -) are usually more trouble than they are worth.
Could you run the following code to have a closer look at what you are putting into filelist?
i <- 1
assign(filelist[[i]], fread(filelist[[i]], fill = TRUE))
print(filelist[[i]])
I suspect you may need to use the code below instead of the assign statement
filelist[[i]] <- fread(filelist[[i]], fill = TRUE)
So let's say I've defined the below function to read in a set of files:
read_File <- function(file){
# read Excel file
df1 <- read.xls(file,sheet=1, pattern="Name", header=T, na.strings=("##"), stringsAsFactors=F)
# remove rows where Name is empty
df2 <- df1[-which(df1$Name==""),]
# remove rows where "Name" is repeated
df3 <- df2[-which(df2$Name=="Name"),]
# remove all empty columns (anything with an auto-generated colname)
df4 <- df3[, -grep("X\\.{0,1}\\d*$", colnames(df3))]
row.names(df4) <- NULL
df4$FileName <- file
return(df4)
}
It works fine like this, but it feels like bad form to define df1...df4 to represent the intermediate steps. Is there a better way to do this without compromising readability?
I see no reason to save intermediate objects separately unless they need to be used multiple times. This is not the case in your code, so I would replace all your df[0-9] with df:
read_File <- function(file){
# read Excel file
df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
na.strings = ("##"), stringsAsFactors = F)
# remove rows where Name is empty
df <- df[-which(df$Name == ""), ]
# remove rows where "Name" is repeated
df <- df[-which(df$Name == "Name"), ]
# remove all empty columns (anything with an auto-generated colname)
df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
row.names(df) <- NULL
df$FileName <- file
return(df)
}
df3 is not a nice descriptive variable name - it doesn't tell you anything more about the variable then df. Sequentially naming variables steps like that also creates a maintenance burden: if you need to add a new step in the middle, you will need to rename all subsequent objects to maintain consistency - which sounds both annoying and potentially risky for bugs.
(Or have something hacky like df2.5, which is ugly and doesn't generalize well.) Generally, I think sequentially named variables are almost always bad practice, even when they are separate objects that you need saved.
Furthermore, keeping the intermediate objects around is not good use of memory. In most cases it won't matter, but if your data is large than saving all the intermediate steps separately will greatly increase the amount of memory used during processing.
The comments are excellent, lots of detail - they tell you all you need to know about what's going on in the code.
If it were me, I would probably combine some steps, something like this:
read_File <- function(file){
# read Excel file
df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
na.strings = ("##"), stringsAsFactors = F)
# remove rows where Name is bad:
bad_names <- c("", "Name")
df <- df[-which(df$Name %in% bad_names), ]
# remove all empty columns (anything with an auto-generated colname)
df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
row.names(df) <- NULL
df$FileName <- file
return(df)
}
Having a bad_names vector to omit saves a line and is more parametric - it would be trivial to promote bad_names to a function argument (perhaps with the default value c("", "Name")) so that the user could customize it.
I am still pretty new to R and very new to for-loops and functions, but I searched quite a bit on stackoverflow and couldn't find an answer to this question. So here we go.
I'm trying to create a script that will (1) read in multiple .csv files and (2) apply a function to strip twitter handles from urls in and do some other things to these files. I have developed script for these two tasks separately, so I know that most of my code works, but something goes wrong when I try to combine them. I prepare for doing so using the following code:
# specify directory for your files and replace 'file' with the first, unique part of the
# files you would like to import
mypath <- "~/Users/you/data/"
mypattern <- "file+.*csv"
# Get a list of the files
file_list <- list.files(path = mypath,
pattern = mypattern)
# List of names to be given to data frames
data_names <- str_match(file_list, "(.*?)\\.")[,2]
# Define function for preparing datasets
handlestripper <- function(data){
data$handle <- str_match(data$URL, "com/(.*?)/status")[,2]
data$rank <- c(1:500)
names(data) <- c("dateGMT", "url", "tweet", "twitterid", "rank")
data <- data[,c(4, 1:3, 5)]
}
That all works fine. The problem comes when I try to execute the function handlestripper() within the for-loop.
# Read in data
for(i in data_names){
filepath <- file.path(mypath, paste(i, ".csv", sep = ""))
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
}
When I execute this code, I get the following error: Error in data$URL : $ operator is invalid for atomic vectors. I know that this means that my function is being applied to the string I called from within the vector data_names, but I don't know how to tell R that, in this last line of my for-loop, I want the function applied to the objects of name i that I just created using the assign command, rather than to i itself.
Inside your loop, you can change this:
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
to
tmp <- read.delim(filepath, colClasses = "character", sep = ",")
assign(i, handlestripper(tmp))
I think you should make as few get and assign calls as you can, but there's nothing wrong with indexing your loop with names as you are doing. I do it all the time, anyway.