How to read and name different CSV files in R - r

I would like to work on several csv files to make some comparisons, so I wrote this code to read the different csv files I have:
path <- "C:\\data\\"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
My csv files are something like this:
Start Time,End Time,Total,Diffuse,Direct,Reflected
04/09/14 00:01:00,04/09/14 00:01:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
04/09/14 00:02:00,04/09/14 00:02:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
04/09/14 00:03:00,04/09/14 00:03:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
(...)
Using my code, R separate correctly all files, but for each of them it creates a table adding a more extra space at the beginning:
|Start Time |End Time |Total |Diffuse |Direct |Reflected
04/09/14 00:01:00|04/09/14 00:01:00|2.221220E-003|5.797364E-004|0.000000E+000|1.641484E-003|NA
...
How can I fix it?
Moreover, considering that the original name of each file is really long, is it possible to name each data.frame using the last letters of the file? Or just a cardinal number?

I would suggest using the data.table package - it's faster and for non-blank columns in the end, it converts those to NA (in my experience). Here's some code I wrote for a simialr task:
read_func <- function(z) {
dat <- fread(z, stringsAsFactors = FALSE)
names(dat) <- c("start_time", "end_time", "Total", "Diffuse", "Direct", "Reflect")
dat$start_tme <- as.POSIXct(strptime(dat$start_tme,
format = "%d/%m/%y %H:%M:%S"), tz = "Pacific/Easter")
patrn <- "([0-9][0-9][0-9])\\.csv"
dat$type <- paste("Dataset",gsub(".csv", "", regmatches(z,regexpr(patrn, z))),sep="")
return(as.data.table(dat))
}
path <- ".//Data/"
file_list <- dir(path, pattern = "csv")
file_names <- unname(sapply(file_list, function(x) paste(path, x, sep = "")))
data_list <- lapply(file_names, read_func)
dat <- rbindlist(data_list, use.names = TRUE)
rm(path, file_list, file_names)
This will give you a list with each item as the data.table from the corresponding file name. I assumed that all file names have a three digit number before the extension which I used to assign a variable type to each data.table. You can change patrn to match your specific use case. This way, when you combine all of them into a single data.table dat, you can always sort/filter based on type. For example, if youwanted to plot diffuse vs direct for Dataset158 and datase222, you could do the following:
ggplot(data = dat[type == 'Dataset158' | type == 'Dataset222'],
aes(x = Diffuse, y = Direct)) + geom_point()
Hope this helps!

You're having a problem because your csv files have a blank column at the end... which makes your data end in a comma:
04/09/14 00:01:00,04/09/14 00:01:00,2.221220E-003,5.797364E-004,0.000000E+000,1.641484E-003,
This leads R to think your data consists of 7 columns rather than 6. The correct solution is to resave all your csv files correctly. Otherwise, R will see 7 columns but only 6 columnnames, and will logically think that the first column is rownames. Here you can apply the patch we came up with #konradrudolph:
library(tibble)
df %>% rownames_to_column() %>% setNames(c(colnames(.)[-1], 'DROP')) %>% select(-DROP)
where df is the data from the csv. But patches like this can lead to unexpected results... better save the csv files correctly.

Related

How can I read a table in a loosely structured text file into a data frame in R?

Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)

How to rename multiple objects in r? Or how objects can be recognised with "-" in their name in r?

I have read a large number of CSV files into R and foolishly the data frames have been saved with "-" symbol in their names, for instance "new-zealand_kba". Is there a way to replace "-" with "_" for a large number of objects. Eg "new-zealand_kba" would become "new_zealand_kba".
Alternatively, is there a way to read in large numbers of CSV files where "-" in their file names are replaced with "_".
# reading in all the CSV files in the folder "Oceania"
path <- "C:/Users/Melissa/Desktop/Dissertation/DATA/Oceania/"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
# merging all the CSV files with the same country
# is it this section of my code which wont run as new-zealand_red etc is not
# recognised properly.
new-zealand1<-full_join(new-zealand_red, new-zealand_kba, by = "Year")
new-zealand2<-full_join(new-zealand1, new-zealand_con, by = "Year")
new-zealand3<-full_join(new-zealand2, new-zealand_rep, by = "Year")
You are using assign to create your data.frames. Here you can change the name:
assign(
gsub("-", "_", gsub(" ","",substr(file, 1, perpos-1))),
read.csv(paste(path,file,sep="")))
However, assign maybe not the best option and maybe you should consider storing a list of data frames instead of each data frame separately, for instance like this (untested):
list_of_dataframes <- lapply(files, function(file) read.csv(paste(path,file,sep=""))))
Then all your data frames would sit in a list. You could assign names to this list easily:
names(list_of_dataframes) <- gsub("[ -]", "_", gsub("\\.csv$", "", files)))
If you do not want to re-read your data and you want to rename existing data frames you could the following:
all_obj <- ls()
for (old_name in all_aobj) {
new_name <- make.names(old_name) ## get a syntactical correct name
if (new_name != old_name) {
assign(new_name, get(old_name))
rm(old_name)
}
}
Note. This code will also rename valid infix operators like %||%, so starting with a better selection for all_obj would be recommended.

rbind txt files from online directory (R)

I am trying to get concatenate text files from url but i don't know how to do this with the html and the different folders?
This is the code i tried, but it only lists the text files and has a lot of html code like this How do I fix this so that I can combine the text files into one csv file?
library(RCurl)
url <- "http://weather.ggy.uga.edu/data/daily/"
dir <- getURL(url, dirlistonly = T)
filenames <- unlist(strsplit(dir,"\n")) #split into filenames
#append the files one after another
for (i in 1:length(filenames)) {
file <- past(url,filenames[i],delim='') #concatenate for urly
if (i==1){
cp <- read_delim(file, header=F, delim=',')
}
else{
temp <- read_delim(file,header=F,delim=',')
cp <- rbind(cp,temp) #append to existing file
rm(temp)# remove the temporary file
}
}
here is a code snippet that I got to work for me. I like to use rvest over RCurl, just because that's what I've learned. In this case, I was able to use the html_nodes function to isolate each file ending in .txt. The result table has the times saved as character strings, but you could fix that later. Let me know if you have any questions.
library(rvest)
library(readr)
url <- "http://weather.ggy.uga.edu/data/daily/"
doc <- xml2::read_html(url)
text <- rvest::html_text(rvest::html_nodes(doc, "tr td a:contains('.txt')"))
# define column types of fwf data ("c" = character, "n" = number)
ctypes <- paste0("c", paste0(rep("n",11), collapse = ""))
data <- data.frame()
for (i in 1:2){
file <- paste0(url, text[1])
date <- as.Date(read_lines(file, n_max = 1), "%m/%d/%y")
# Read file to determine widths
columns <- fwf_empty(file, skip = 3)
# Manually expand `solar` column to be 3 spaces wider
columns$begin[8] <- columns$begin[8] - 3
data <- rbind(data, cbind(date,read_fwf(file, columns,
skip = 3, col_types = ctypes)))
}

Extracting file numbers from file names in r and looping through files

I have a folder full of .txt files that I want to loop through and compress into one data frame, but each .txt file is data for one subject and there are no columns in the text files that indicate subject number or time point in the study (e.g. 1-5). I need to add a line or two of code into my loop that looks for strings of four numbers (i.e. each file is labeled something like: "4325.5_ERN_No_Startle") and just creates a column with 4325 and another column with 5 that will appear for every data point for that subject until the loop gets to the next one. I have been looking for awhile but am still coming up empty, any suggestions?
I also have not quite gotten the loop to work:
path = "/Users/me/Desktop/Event Codes/ERN task/ERN text files transferred"
out.file <- ""
file <- ""
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- read.table(file.names[i],header=FALSE, fill = TRUE)
out.file <- rbind(out.file, file)
}
which runs okay until I get this error message part way through:
Error in read.table(file.names[i], header = FALSE, fill = TRUE) :
no lines available in input
Consider using regex to parse the file name for study period and subject, both of which are then binded in a lapply of list.files:
path = "path/to/text/files"
# ANY TXT FILE WITH PATTERN OF 4 DIGITS FOLLOWED BY A PERIOD AND ONE DIGIT
file.names <- list.files(path, pattern="*[0-9]{4}\\.[0-9]{1}.*txt", full.names=TRUE)
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES AND BINDS THE REGEX EXTRACTS
dfList <- lapply(file.names, function(x) {
if (file.exists(x)) {
data.frame(period=regmatches(x, gregexpr('[0-9]{4}', x))[[1]],
subject=regmatches(x, gregexpr('\\.[0-9]{1}', x))[[1]],
read.table(x, header=FALSE, fill=TRUE),
stringsAsFactors = FALSE)
}
})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# REMOVE PERIOD IN SUBJECT (NEEDED EARLIER FOR SPECIAL DIGIT)
df['subject'] <- sapply(df['subject'],
function(x) gsub("\\.", "", x))
You can try to use tryCatchwhich basically would give you a NULL instead of an error.
file <- tryCatch(read.table(file.names[i],header=FALSE, fill = TRUE), error=function(e) NULL))

applying same function on multiple files in R

I am new to R program and currently working on a set of financial data. Now I got around 10 csv files under my working directory and I want to analyze one of them and apply the same command to the rest of csv files.
Here are all the names of these files: ("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
For example, because the Date column in CSV files are Factor so I need to change them to Date format:
CAN <- read.csv("CAN%10y.csv", header = T, sep = ",")
CAN$Date <- as.character(CAN$Date)
CAN$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
CAN_merge <- merge(all.dates.frame, CAN, all = T)
CAN_merge$Bid.Yield.To.Maturity <- NULL
all.dates.frame is a data frame of 731 consecutive days. I want to merge them so that each file will have the same number of rows which later enables me to combine 10 files together to get a 731 X 11 master data frame.
Surely I can copy and paste this code and change the file name, but is there any simple approach to use apply or for loop to do that ???
Thank you very much for your help.
This should do the trick. Leave a comment if a certain part doesn't work. Wrote this blind without testing.
Get a list of files in your current directory ending in name .csv
L = list.files(".", ".csv")
Loop through each of the name and reads in each file, perform the actions you want to perform, return the data.frame DF_Merge and store them in a list.
O = lapply(L, function(x) {
DF <- read.csv(x, header = T, sep = ",")
DF$Date <- as.character(CAN$Date)
DF$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
DF_Merge <- merge(all.dates.frame, CAN, all = T)
DF_Merge$Bid.Yield.To.Maturity <- NULL
return(DF_Merge)})
Bind all the DF_Merge data.frames into one big data.frame
do.call(rbind, O)
I'm guessing you need some kind of indicator, so this may be useful. Create a indicator column based on the first 3 characters of your file name rep(substring(L, 1, 3), each = 731)
A dplyr solution (though untested since no reproducible example given):
library(dplyr)
file_list <- c("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
can_l <- lapply(
file_list
, read.csv
)
can_l <- lapply(
can_l
, function(df) {
df %>% mutate(Date = as.Date(as.character(Date), format ="%m/%d/%y"))
}
)
# Rows do need to match when column-binding
can_merge <- left_join(
all.dates.frame
, bind_cols(can_l)
)
can_merge <- can_merge %>%
select(-Bid.Yield.To.Maturity)
One possible solution would be to read all the files into R in the form of a list, and then use lapply to to apply a function to all data files. For example:
# Create vector of file names in working direcotry
files <- list.files()
files <- files[grep("csv", files)]
#create empty list
lst <- vector("list", length(files))
#Read files in to list
for(i in 1:length(files)) {
lst[[i]] <- read.csv(files[i])
}
#Apply a function to the list
l <- lapply(lst, function(x) {
x$Date <- as.Date(as.character(x$Date), format = "%m/%d/%y")
return(x)
})
Hope it's helpful.

Resources