Lists and matrix using sapply - r

I have a perhaps basic questions and I have searched on the web. I have a problem reading files. Though, I managed to get to read my files, following #Konrad suggestions, which I appreciate: How to get R to read in files from multiple subdirectories under one large directory?
It is a similar problem, however, I have not resolved it.
My problem:
I have large number of files of with same name ("tempo.out") in different folders. This tempo.out has 5 columns/headers. And they are all the same format with 1048 lines and 5 columns:
id X Y time temp
setwd("~/Documents/ewat")
dat.files <- list.files(path="./ress",
recursive=T,
pattern="tempo.out"
,full.names=T)
readDatFile <- function(f) {
dat.fl <- read.table(f)
}
data.filesf <- sapply(dat.files, readDatFile)
# I might not have the right sintax in sub5:
subs5 <- sapply(data.filesf,`[`,5)
matr5 <- do.call(rbind, subs5)
probs <- c(0.05,0.1,0.16,0.25,0.5,0.75,0.84,0.90,0.95,0.99)
q <- rowQuantiles(matr5, probs=probs)
print(q)
I want to extract the fifth column (temp) of each of those thousands of files and make calculations such as quantiles.
I tried first to read all subfiles in "ress"
The latter gave no error, but my main problem is the "data.filesf" is not a matrix but list, and actually the 5th column is not what I expected. Then the following:
matr5 <- do.call(rbind, subs5)
is also not giving the required values/results.
What could be the best way to get columns into what will become a huge matrix?

Try
lapply(data.filef,[,,5)
Hope this will help

Consider extending your defined function, readDatFile, to extract fifth column, temp, and assign directly to matrix with sapply or vapply (since you know ahead the needed structure -numeric matrix length equal to nrows or 1048). Then, run needed rowQuantiles:
setwd("~/Documents/ewat")
dat.files <- list.files(path="./ress",
recursive=T,
pattern="tempo.out",
full.names=T)
readDatFile <- function(f) read.table(f)$temp # OR USE read.csv(f)[[5]]
matr5 <- sapply(dat.files, readDatFile, USE.NAMES=FALSE)
# matr5 <- vapply(dat.files, readDatFile, numeric(1048), USE.NAMES=FALSE)
probs <- c(0.05,0.1,0.16,0.25,0.5,0.75,0.84,0.90,0.95,0.99)
q <- rowQuantiles(matr5, probs=probs)

Related

How do I make all my dataframes' variables in my environment numeric?

I have got a lot of data frames in my R environment and I want to do the as.numeric() function on all of the variables in the data.frames and overwrite them. I do not know how to address all of them.
The following is my attempt, but ls() seemingly just writes the name to x:
for (i in 1:length(ls())){
x <- ls()[i]
for (i in 1:length(x)){
x[i] <- as.numeric(x[i])
}
}
So, there were two helpful answers to my question. One, that was later deleted and another one by #Henrik.
The deleted one followed my approach to convert all data frames in global environment (that has an "V" in it - in my example) as numerics. This is the code:
res <- lapply(mget(ls(pattern = 'V')), \(x) {
x[] <- lapply(x, as.numeric)
return(x)
})
list2env(res, .GlobalEnv)
# Check
str(VA01.000306__ft2)
The second approach uses lists instead of multiple objects. When I have stored my multiple csv files into lists. This is the csv to list import:
F_EB_names <- list.files(pattern="*.csv")### Daten in Liste speichern?
F_EB <- lapply(F_EB_names, read.csv2)
names(F_EB) <- gsub(".wav.csv","_ft2",F_EB_names)
And this is the conversion to numerals:
F_EB <- type.convert(F_EB) # Conversion
str(F_EB) # Check
Thank you both for the help.

For loop returns last result

I have a small number of csv files, each containing two columns with numeric values. I want to write a for loop that reads the files, sums the columns, and stores the sum totals for each csv in a numeric vector. This is the closest I've come:
allfiles <- list.files()
for (i in seq(allfiles)) {
total <- numeric()
total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
total
}
My result is all NA's save a value for the last file. I understand that I'm overwriting each iteration each time the for loop executes and I think* I need to do something with indexing.
The first problem is that you are not pre-allocating the right length of (or properly appending to) total. Regardless, I recommend against that method.
There are several ways to do this, but the R-onic (my term, based on pythonic ... I know, it doesn't flow well) is based on vectors/lists.
alldata <- sapply(allfiles, read.csv, simplify = FALSE)
totals <- sapply(alldata, function(a) sum(subset(a, select=Gift.1), subset(a, select=Gift.2)))
I often like to that, keeping the "raw/unaltered" data in one list and then repeatedly extract from it. For instance, if the files are huge and reading them is a non-trivial amount of time, then if you realize you also need Gift.3 and did it your way, then you'd need to re-read the entire dataset. Using my method, however, you just update the second sapply to include the change and rerun on the already-loaded data. (Most of the my rationale is based on untrusted data, portions that are typically unused, or other factors that may not be there for you.)
If you really wanted to reduce the code to a single line, something like:
totals <- sapply(allfiles, function(fn) {
x <- read.csv(fn)
sum(subset(x, select=Gift.1), subset(x, select=Gift.2))
})
allfiles <- list.files()
total <- numeric()
for (i in seq(allfiles)) {
total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
}
total
if possible try and give the total a known length before hand ie total<-numeric(length(allfiles))

apply multiple functions in sapply

I have a list of .stat files in tmp directory.
sample:
a.stat=>
abc,10
abc,20
abc,30
b.stat=>
xyz,10
xyz,30
xyz,70
and so on
I need to find summary of all .stat files.
Currently I am using
filelist<-list.files(path="/tmp/",pattern=".stat")
data<-sapply(paste("/tmp/",filelist,sep=''), read.csv, header=FALSE)
However I need to apply summary to all files being read. Or simply in n number of .stat files I need summary from 2nd column column
using
data<-sapply(paste("/tmp/",filelist,sep=''), summary, read.csv, header=FALSE) does not work and gives me summary with class character, which is no what I intend.
sapply(filelist, function(filename){df <- read.csv(filename, header=F);print(summary(df[,2]))}) works fine. However my overall objective is to find values that are more than 2 standard deviations away on either side (outliers). So I use sd, but at the same time need to check if all values in the file currently read come under 2SD range.
To apply multiple functions at once:
f <- function(x){
list(sum(x),mean(x))
}
sapply(x, f)
In your case you want to apply them sequentially, so first read csv data then do summary:
sapply(lapply(paste("/tmp/",filelist,sep=''), read.csv), summary)
To subset your datasets to run summary on particular column you can use change outer sapply function from summary to function(x) summary(x[[2]]).
For short functions you don't want to save in the environment, it can also just be done within the sapply call. For #flxflks 's example:
sapply(df, function(x) c(min = min(x), avg = mean(x)))
Adding to #Jangorecki, I changed the function to include a vector and not a list. Only then it worked for me. I am unsure why my function worked and not the other.
f <- function(x){
c(min = min(x), avg = mean(x))
}
sapply(df, f)
I found the solution at https://www.r-bloggers.com/applying-multiple-functions-to-data-frame/

Applying a set of operations across several data frames in r

I've been learning R for my project and have been unable to google a solution to my current problem.
I have ~ 100 csv files and need to perform an exact set of operations across them. I've read them in as separate objects (which I assume is probably improper r style) but I've been unable to write a function that can loop through. Each csv is a dataframe that contain information, including a column with dates in decimal year form. I need to create 2 new columns containing year and day of year. I've figured out how to do it manually I would like to find a way to automate the process. Here's what I've been doing:
#setup
library(lubridate) #Used to check for leap years
df.00 <- data.frame( site = seq(1:10), date = runif(10,1980,2000 ))
#what I need done
df.00$doy <- NA # make an empty column which I'm going to place the day of the year
df.00$year <- floor(df.00$date) # grabs the year from the date column
df.00$dday <- df.00$date - df.00$year # get the year fraction. intermediate step.
# multiply the fraction year by 365 or 366 if it's a leap year to give me the day of the year
df.00$doy[which(leap_year(df.00$year))] <- round(df.00$dday[which(leap_year(df.00$year))] * 366)
df.00$doy[which(!leap_year(df.00$year))] <- round(df.00$dday[which(!leap_year(df.00$year))] * 365)
The above, while inelegant, does what I would like it to. However, I need to do this to the other data frames, df.01 - df.99. So far I've been unable to place it in a function or for loop. If I place it into a function:
funtest <- function(x) {
x$doy <- NA
}
funtest(df.00) does nothing. Which is what I would expect from my understanding of how functions work in r but if I wrap it up in a for loop:
for(i in c(df.00)) {
i$doy <- NA }
I get "In i$doy <- NA : Coercing LHS to a list" several times which tells me that the loop isn't treat the dataframe as a single unit but perhaps looking at each column in the frame.
I would really appreciate some insight on what I should be doing. I feel that I could have solved this easily using bash and awk but I would like to be less incompetent using r
the most efficient and direct way is to use a list.
Put all of your CSV's into one folder
grab a list of the files in that folder
eg: files <- dir('path/to/folder', full.names=TRUE)
iterativly read in all those files into a list of data.frames
eg: df.list <- lapply(files, read.csv, <additional args>)
apply your function iteratively over each data.frame
eg: lapply(df.list, myFunc, <additional args>)
Since your df's are already loaded, and they have nice convenient names, you can grab them easily using the following:
nms <- c(paste0("df.0", 0:9), paste0("df.", 10:99))
df.list <- lapply(nms, get)
Then take everything you have in the #what I need done portion and put inside a function, eg:
myFunc <- function(DF) {
# what you want done to a single DF
return(DF)
}
And then lapply accordingly
df.list <- lapply(df.list, myFunc)
On a separate notes, regarding functions:
The reason your funTest "does nothing" is that it you are not having it return anything. That is to say, it is doing something, but when it finishes doing that, then it does "nothing".
You need to include a return(.) statement in the function. Alternatively, the output of last line of the function, if not assigned to an object, will be used as the return value -- but this last sentence is only loosely true and hence one needs to be cautious. The cleanest option (in my opinion) is to use return(.)
regarding the for loop over the data.frame
As you observed, using for (i in someDataFrame) {...} iterates over the columns of the data.frame.
You can iterate over the rows using apply:
apply(myDF, MARGIN=1, function(x) { x$doy <- ...; return(x) } ) # dont forget to return

R: How to read different files into a two-dim vector?

I have an R newbie question about storing data.
I have 3 different files, each of which contains one column. Now I would like to read them into a structure x so that x[1] is the column of the first file, x[2] is the column of the second file, etc. So x would be a two-dim vector.
I tried this, but it wants x[f] to be a single number rather than a whole vector:
files <- c("dir1/data.txt", "dir2b/data.txt", "dir3/data2.txt")
for(f in 1:length(files)) {
x[f] <- scan(files[f])
}
How can I fix this?
Lists should help. Try
x <- vector(mode="list",length=3)
before the loop and then assign as
x[[f]] <- read.table(files[f])
I would recommend against scan; you should have better luck with read.table() and its cousins like read.csv.
Once you have x filled, you can combine as e.g. via
y <- do.call(cbind, x)
which applies cbind -- a by-column combiner -- to all elements of the list x.

Resources