Error looping in R to read multiple files - r

So, Im creating a loop in R that reads through multiple csv files in a directory called "specdata", and afterwards, tells you the mean of a particular colum in common inside those files. This function is represented in the next parragraph the arguments you specify are the directory in which those files are located, the colum you want means to be calculated, and id sequence, that tells you how many files do you want to read depending of de object number represented throudh subsetting []
I made a querie about this function before, and it was solved, now, it works, and gives a result. But it gives an incorrect one, it gives NA or NAN always, when it should give a number.
pollutantmean <- function(directory,pollutant,id) {
for (i in id) {archivo <- list.files(directory,full.names = TRUE)
datapollution <- rbind(read.csv(archivo[i],header = TRUE))
datamatrix <- data.matrix(datapollution)
resultmean <- mean(datamatrix[pollutant],na.rm = TRUE)}
print(resultmean)}
why is it not working? my theory is that im aplying rbind incorrectly.

It's difficult to provide more specific help due to the lack of sample data/code, but I see a couple of issues with your code.
There is no need to repeatedly list.file inside the for loop.
In fact, there is no need for a for loop here, and it will be faster to do something like
archive <- list.files(directory, full.names = TRUE)
datapollution <- do.call(rbind, lapply(archive, read.csv))
PS. For maximum help here on SO, it's always best to provide a minimal & reproducible example including sample data.

Related

For loop data frame error when working with character vectors

I am currently doing data science with R and I generally write loops to access multiple files or objects at once. Normally this goes without any problems but recently a problem occurred when trying to run the following code:
setwd(PROJECT_FOLDER)
climate_forcing <- c("cf-1", "cf-2", "cf-3", "cf-4")
#load all mean stacks from IM and create rasterstack
for (i in 1:NROW(climate_forcing)){
setwd(PROJECT_FOLDER)
setwd(paste0("time frames mcor/X variable/IM/", climate_forcing[i], "/ncstack/"))
file.names <- list.files(pattern = ".nc", recursive=T, full.names=F) #list all files with ".nc"
stopwords <- c(".nc", "stack", "/dLAI") #stopwords
names.short <- gsub(paste(stopwords, collapse="|"), "", file.names)
assign("names.short", paste0(names.short, climate_forcing[i]))
for (j in 1:NROW(file.names)){
assign(paste0(names.short[j], "_stack"), stack(file.names[j]))
}
}
Error message returned:
Error in data.frame(values = unlist(unname(x)), ind, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 1, 0
I wrote this a while ago and I ran it before and I think it used to work since the files being created by a similar script are there.
Anyways I did some testing and it seems that the error occurs in the for loop within the for loop (with the variable j). I am unsure what may cause this bug but has to do something with "file.names" and "names.short" right? When I compare them, their properties appear to be identical though, which I figured would be, since I create the latter out of the former. The reason I am creating them like this is because I want to create objects reading out the corresponding files of file.names.
The error I get refers to data.frame which confuses me because I'm working with character vectors here..
Maybe somebody with more experience can figure this issue out.
Thanks for any help and if there are any questions I will try to answer them.
Alright it turns out something was wrong with the R packages, I reinstalled and reloaded them (raster) and now it works. Thanks to anyone for your contributions!

Calling files with c(x:y)

I have a large number of files (in GB size).I want to run a for loop in which I call some files, do so processing that creates some files, bind them together, and save it.
AA<-c(1,6)
BB<-c(5,10)
for(i in length(AA)){
listofnames<-list.files(pattern="*eng")
listofnames<- listofnames[c(paste(AA[i],BB[i],sep=":"))]
listoffiles <- lapply( listofnames, readRDS)
}
But listofnames has NA. What I am doing wrong?
It took me a while looking at your code to realize that you were actually trying to construct a character representation of the expression 1:5 that was supposed to index a vector by position. This is very wrong; you just can't paste together arbitrary R commands/expressions and expect to drop them in to you code wherever. (Technically, there are tools that do that sort of thing, but they are discouraged.)
Probably you're looking to do something closer to:
listofnames <- list.files(pattern="*eng")
ind <- rep(1:5,each = 5,length.out = length(listofnames))
listofnames_split <- split(listofnames,ind)
for (i in seq_along(listofnames_split)){
my_data <- lapply(listofnames_split[[i]], readRDS)
#Do processing here
#...
rm(my_data) #Assuming memory really is a problem
}
But I'm just sketching out hypothetical code here, I can't really match it to your exact situation since your example isn't really fully fleshed out.

R program does not output

I'm new to R and programming and taking a Coursera course. I've asked in their forums, but nobody can seem to provide an answer in the forums. To be clear, I'm trying to determine why this does not output.
When I first wrote the program, I was getting accurate outputs, but after I tried to upload, something went wonky. Rather than producing any output with [1], [2], etc. when I run the program from RStudio, I only get the the blue +++, but no errors and anything I change still does not produce an output.
I tried with a previous version of R, and reinstalled the most recent version 3.2.1 for Windows.
What I've done:
Set the correct working directory through RStudio
pol <- function(directory, pol, id = 1:332) {
files <- list.files("specdata", full.names = TRUE);
data <- data.frame();
for (i in ID) {
data <- rbind(data, read.csv(files_list[i]))
}
subset <- subset(data, ID %in% id);
polmean <- mean(subset[pol], na.rm = TRUE);
polmean("specdata", "sulfate", 1:10)
polmean("specdata", "nitrate", 70:72)
polmean("specdata", "nitrate", 23)
}
Can someone please provide some direction - debug help?
when I adjust the code the following errors tend to appear:
ID not found
Missing or unexpected } (although I've matched them all).
The updated code is as follow, if I'm understanding:
data <- data.frame();
files <- files[grepl(".csv",files)]
pollutantmean <- function(directory, pollutant, id = 1:332) {
pollutantmean <- mean(subset1[[pollutant]], na.rm = TRUE);
}
Looks like you haven't declared what ID is (I assume: a vector of numbers)?
Also, using 'subset' as a variable name while it's also a function, and pol as both a function name and the name of one of the arguments of that same function is just asking for trouble...
And I think there is a missing ")" in your for-loop.
EDIT
So the way I understand it now, you want to do a couple of things.
Read in a bunch of files, which you'll use multiple times without changing them.
Get some mean value out of those files, under different conditions.
Here's how I would do it.
Since you only want to read in the data once, you don't really need a function to do this (you can have one, but I think it's overkill for now). You correctly have code that makes a vector with the file names, and then loop over over them, rbinding them to each other. The problem is that this can become very slow. Check here. Make sure your directory only contains files that you want to read in, so no Rscripts or other stuff. A way (not 100% foolproof) to do this is using files <- files[grepl(".csv",files)], which makes sure you only have the csv's (grepl checks whether a certain string is a substring of another, and returns a boolean the [] then only keeps the elements for which a TRUE was returned).
Next, there is 'a thing you want to do multiple times', namely getting out mean values. This is where you'd use a function. Apparently you want to get the mean for different types of pollution, and you want this in restricted IDs.
Let's assume that 1. has given you a dataframe df with a column named Type for the type of pollution and a column called Id that somehow represents a sort of ID (substitute with the actual names in your script - if you don't have a column for ID, I'll edit the answer later on). Now you want a function
polmean <- function(type, id) {
# some code that returns the mean of a restricted version of df
}
This is all you need. You write the code that generates df, you then write a function that will get you what you want from that dataframe, and then you call it for the circumstances you want to use it in (the three polmean calls at the end of your original code, but now without the first argument as you no longer need this).
Ok - I finally solved this. Thanks for the help.
I didn't need to call "specdata" in line 2. the directory in line 1 referred to the correct directory.
My for/in statement needed to refer the the id in the first line not the ID in the dataset. The for/in statement doesn't appear to need to be indented (but it looks cleaner)
I did not need a subset
The last 3 lines for pollutantmean did not need to be a part of the program. These are used in the R console to call the results one by one.

Read, process and export analysis results from multiple .csv files in R

I have a bunch of CSV files and I would like to perform the same analysis (in R) on the data within each file. Firstly, I assume each file must be read into R (as opposed to running a function on the CSV and providing output, like a sed script).
What is the best way to input numerous CSV files to R, in order to perform the analysis and then output separate results for each input?
Thanks (btw I'm a complete R newbie)
You could go for Sean's option, but it's going to lead to several problems:
You'll end up with a lot of unrelated objects in the environment, with the same name as the file they belong to. This is a problem because...
For loops can be pretty slow, and because you've got this big pile of unrelated objects, you're going to have to rely on for loops over the filenames for each subsequent piece of analysis - otherwise, how the heck are you going to remember what the objects are named so that you can call them?
Calling objects by pasting their names in as strings - which you'll have to do, because, again, your only record of what the object is called is in this list of strings - is a real pain. Have you ever tried to call an object when you can't write its name in the code? I have, and it's horrifying.
A better way of doing it might be with lapply().
# List files
filelist <- list.files(pattern = "*.csv")
# Now we use lapply to perform a set of operations
# on each entry in the list of filenames.
to_dispose_of <- lapply(filelist, function(x) {
# Read in the file specified by 'x' - an entry in filelist
data.df <- read.csv(x, skip = 1, header = TRUE)
# Store the filename, minus .csv. This will be important later.
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
# Your analysis work goes here. You only have to write it out once
# to perform it on each individual file.
...
# Eventually you'll end up with a data frame or a vector of analysis
# to write out. Great! Since you've kept the value of x around,
# you can do that trivially
write.table(x = data_to_output,
file = paste0(filename, "_analysis.csv"),
sep = ",")
})
And done.
You can try the following codes by putting all csv files in the same directory.
names = list.files(pattern="*.csv") %csv file names
for(i in 1:length(names)){ assign(names[i],read.csv(names[i],skip=1, header=TRUE))}
Hope this helps !

merge multiple files with different rows in R

I know that this question has been asked previously, but answers to the previous posts cannot seem to solve my problem.
I have dozens of tab-delimited .txt files. Each file has two columns ("pos", "score"). I would like to compile all of the "score" columns into one file with multiple columns. The number of rows in each file varies and they are irrelevant for the compilation.
If someone could direct me on how to accomplish this, preferably in R, it would be a lot of helpful.
Alternatively, my ultimate goal is to read the median and mean of the "score" column from each file. So if this could be accomplished, with or without compiling the files, it would be even more helpful.
Thanks.
UPDATE:
As appealing as the idea of personal code ninjas is, I understand this will have to remain a fantasy. Sorry for not being explicit.
I have tried lapply and Reduce, e.g.,
> files <- dir(pattern="X.*\\.txt$")
> File_list <- lapply(filesToProcess,function(score)
+ read.table(score,header=TRUE,row.names=1))
> File_list <- lapply(files,function(z) z[c("pos","score")])
> out_file <- Reduce(function(x,y) {merge(x,y,by=c("pos"))},File_list)
which I know doesn't really make sense, considering I have variable row numbers. I have also tried plyr
> files <- list.files()
> out_list <- llply(files,read.table)
As well as cbind and rbind. Usually I get an error message, because the row numbers don't match up or I just get all the "score" data compiled into one column.
The advice on similar posts (e.g., Merging multiple csv files in R, Simultaneously merge multiple data.frames in a list, and Merge multiple files in a list with different number of rows) has not been helpful.
I hope this clears things up.
This problem could be solved in two steps:
Step 1. Read the data from your csv files into a list of data frames, where files is a vector of file names. If you need to add extra arguments to read.csv, add them like shown below. See ?lapply for details.
list_of_dataframes <- lapply(files, read.csv, stringsAsFactors = FALSE)
Step 2. Calculate means for each data frame:
means <- sapply(list_of_dataframes, function(df) mean(df$score))
Of course, you can always do it in one step like this:
means <- sapply(files, function(filename) mean(read.csv(filename)$score))
I think you want smth like this:
all_data = do.call(rbind, lapply(files,
function(f) {
cbind(read.csv(f), file_name=f)
}))
You can then do whatever "by" type of action you like. Also, don't forget to adjust the various read.csv options to suit your needs.
E.g. once you have the above, you can do the following (and much more):
library(data.table)
dt = data.table(all_data)
dt[, list(mean(score), median(score)), by = file_name]
A small note: you could also use data.table's fread, to read the files in instead of the read.table and its derivatives, and that would be much faster, and while we're at it, use rbindlist instead of do.call(rbind,.

Resources