R: How to read different files into a two-dim vector? - r

I have an R newbie question about storing data.
I have 3 different files, each of which contains one column. Now I would like to read them into a structure x so that x[1] is the column of the first file, x[2] is the column of the second file, etc. So x would be a two-dim vector.
I tried this, but it wants x[f] to be a single number rather than a whole vector:
files <- c("dir1/data.txt", "dir2b/data.txt", "dir3/data2.txt")
for(f in 1:length(files)) {
x[f] <- scan(files[f])
}
How can I fix this?

Lists should help. Try
x <- vector(mode="list",length=3)
before the loop and then assign as
x[[f]] <- read.table(files[f])
I would recommend against scan; you should have better luck with read.table() and its cousins like read.csv.
Once you have x filled, you can combine as e.g. via
y <- do.call(cbind, x)
which applies cbind -- a by-column combiner -- to all elements of the list x.

Related

Converting column names to upper case in a list of data frames using lapply explanation

This is probably a very simple problem but I have been struggling to search for this issue. Basically, I am using lapply to convert the column names to upper in a list of dataframes. My first attempt did not work, however adding ;x works. What exactly is going on?
This does not work:
df.list <- lapply(df.list,function(x) colnames(x) <- toupper(colnames(x)))
This does:
df.list <- lapply(df.list,function(x) {colnames(x) <- toupper(colnames(x));x})
Since you are modifying the object x (or in this case only the colnames of x) inside the function definition, you have to return the modified object x. This is happening by using ;x which can be read as a new line only returning the object x

Lists and matrix using sapply

I have a perhaps basic questions and I have searched on the web. I have a problem reading files. Though, I managed to get to read my files, following #Konrad suggestions, which I appreciate: How to get R to read in files from multiple subdirectories under one large directory?
It is a similar problem, however, I have not resolved it.
My problem:
I have large number of files of with same name ("tempo.out") in different folders. This tempo.out has 5 columns/headers. And they are all the same format with 1048 lines and 5 columns:
id X Y time temp
setwd("~/Documents/ewat")
dat.files <- list.files(path="./ress",
recursive=T,
pattern="tempo.out"
,full.names=T)
readDatFile <- function(f) {
dat.fl <- read.table(f)
}
data.filesf <- sapply(dat.files, readDatFile)
# I might not have the right sintax in sub5:
subs5 <- sapply(data.filesf,`[`,5)
matr5 <- do.call(rbind, subs5)
probs <- c(0.05,0.1,0.16,0.25,0.5,0.75,0.84,0.90,0.95,0.99)
q <- rowQuantiles(matr5, probs=probs)
print(q)
I want to extract the fifth column (temp) of each of those thousands of files and make calculations such as quantiles.
I tried first to read all subfiles in "ress"
The latter gave no error, but my main problem is the "data.filesf" is not a matrix but list, and actually the 5th column is not what I expected. Then the following:
matr5 <- do.call(rbind, subs5)
is also not giving the required values/results.
What could be the best way to get columns into what will become a huge matrix?
Try
lapply(data.filef,[,,5)
Hope this will help
Consider extending your defined function, readDatFile, to extract fifth column, temp, and assign directly to matrix with sapply or vapply (since you know ahead the needed structure -numeric matrix length equal to nrows or 1048). Then, run needed rowQuantiles:
setwd("~/Documents/ewat")
dat.files <- list.files(path="./ress",
recursive=T,
pattern="tempo.out",
full.names=T)
readDatFile <- function(f) read.table(f)$temp # OR USE read.csv(f)[[5]]
matr5 <- sapply(dat.files, readDatFile, USE.NAMES=FALSE)
# matr5 <- vapply(dat.files, readDatFile, numeric(1048), USE.NAMES=FALSE)
probs <- c(0.05,0.1,0.16,0.25,0.5,0.75,0.84,0.90,0.95,0.99)
q <- rowQuantiles(matr5, probs=probs)

How to find common variables in a list of datasets & reshape them in R?

setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
for (i in 1:length(temp)) assign(temp[i], read.dta13(temp[i], nonint.factors = TRUE))
grep(pattern="_m", temp, value=TRUE)
Here I create a list of my datasets and read them into R, I then attempt to use grep in order to find all variable names with pattern _m, obviously this doesn't work because this simply returns all filenames with pattern _m. So essentially what I want, is my code to loop through the list of databases, find variables ending with _m, and return a list of databases that contain these variables.
Now I'm quite unsure how to do this, I'm quite new to coding and R.
Apart from needing to know in which databases these variables are, I also need to be able to make changes (reshape them) to these variables.
First, assign will not work as you think, because it expects a string (or character, as they are called in R). It will use the first element as the variable (see here for more info).
What you can do depends on the structure of your data. read.dta13 will load each file as a data.frame.
If you look for column names, you can do something like that:
myList <- character()
for (i in 1:length(temp)) {
# save the content of your file in a data frame
df <- read.dta13(temp[i], nonint.factors = TRUE))
# identify the names of the columns matching your pattern
varMatch <- grep(pattern="_m", colnames(df), value=TRUE)
# check if at least one of the columns match the pattern
if (length(varMatch)) {
myList <- c(myList, temp[i]) # save the name if match
}
}
If you look for the content of a column, you can have a look at the dplyr package, which is very useful when it comes to data frames manipulation.
A good introduction to dplyr is available in the package vignette here.
Note that in R, appending to a vector can become very slow (see this SO question for more details).
Here is one way to figure out which files have variables with names ending in "_m":
# setup
setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
# logical vector to be filled in
inFileVec <- logical(length(temp))
# loop through each file
for (i in 1:length(temp)) {
# read file
fileTemp <- read.dta13(temp[i], nonint.factors = TRUE)
# fill in vector with TRUE if any variable ends in "_m"
inFileVec[i] <- any(grepl("_m$", names(fileTemp)))
}
In the final line, names returns the variable names, grepl returns a logical vector for whether each variable name matches the pattern, and any returns a logical vector of length 1 indicating whether or not at least one TRUE was returned from grepl.
# print out these file names
temp[inFileVec]

Assign name to a substring in a loop importing raster files

I'm importing some raster files from a PostgreSQL connection into R in a loop. I want to assign my newly gained rasters automatically to a variable whose name is derived from the input variable like this: substring(crop, 12)
crop <- "efsa_capri_barley"
ras <- readGDAL(sprintf("PG:dbname='' host='' port='' user='' schema='' table='%s' mode=2", crop))
paste0(substring(crop, 12)) <- raster(ras, 1)
What function do I have to use that R recognizes the result of substring() as a character string and not as the function itself? I was thinking about paste() but it doesn't work.
Probably this question has already been asked but I couldn't find a proper answer.
Based on your description, assign is technically correct, but recommending it is bad advice.
If you are pulling in multiple rasters in a loop, best practice in R is to initialize a list to hold all the resulting rasters and name each list element accordingly. You can do this one at a time:
# n is number of rasters
raster_list <- vector("list",n)
for (i in seq_len(n)){
...
#crop[i] is the ith crop name
raster_list[[substring(crop[i],12)]] <- raster(...)
}
You can also set the names of each element of the list all at once via setNames. But you should try to avoid using assign pretty much at all costs.
If I understand your question correctly, you are looking for something like assign. For example you can try this:
assign(substring(crop, 12), raster(ras, 1))
To understand how assign works, you can check this code:
x <- 2
# x is now 2
var_to_assign <- "x"
assign(var_to_assign, 3)
# x is now set to 3
x
# 3
Does that give you what you want?

Applying a set of operations across several data frames in r

I've been learning R for my project and have been unable to google a solution to my current problem.
I have ~ 100 csv files and need to perform an exact set of operations across them. I've read them in as separate objects (which I assume is probably improper r style) but I've been unable to write a function that can loop through. Each csv is a dataframe that contain information, including a column with dates in decimal year form. I need to create 2 new columns containing year and day of year. I've figured out how to do it manually I would like to find a way to automate the process. Here's what I've been doing:
#setup
library(lubridate) #Used to check for leap years
df.00 <- data.frame( site = seq(1:10), date = runif(10,1980,2000 ))
#what I need done
df.00$doy <- NA # make an empty column which I'm going to place the day of the year
df.00$year <- floor(df.00$date) # grabs the year from the date column
df.00$dday <- df.00$date - df.00$year # get the year fraction. intermediate step.
# multiply the fraction year by 365 or 366 if it's a leap year to give me the day of the year
df.00$doy[which(leap_year(df.00$year))] <- round(df.00$dday[which(leap_year(df.00$year))] * 366)
df.00$doy[which(!leap_year(df.00$year))] <- round(df.00$dday[which(!leap_year(df.00$year))] * 365)
The above, while inelegant, does what I would like it to. However, I need to do this to the other data frames, df.01 - df.99. So far I've been unable to place it in a function or for loop. If I place it into a function:
funtest <- function(x) {
x$doy <- NA
}
funtest(df.00) does nothing. Which is what I would expect from my understanding of how functions work in r but if I wrap it up in a for loop:
for(i in c(df.00)) {
i$doy <- NA }
I get "In i$doy <- NA : Coercing LHS to a list" several times which tells me that the loop isn't treat the dataframe as a single unit but perhaps looking at each column in the frame.
I would really appreciate some insight on what I should be doing. I feel that I could have solved this easily using bash and awk but I would like to be less incompetent using r
the most efficient and direct way is to use a list.
Put all of your CSV's into one folder
grab a list of the files in that folder
eg: files <- dir('path/to/folder', full.names=TRUE)
iterativly read in all those files into a list of data.frames
eg: df.list <- lapply(files, read.csv, <additional args>)
apply your function iteratively over each data.frame
eg: lapply(df.list, myFunc, <additional args>)
Since your df's are already loaded, and they have nice convenient names, you can grab them easily using the following:
nms <- c(paste0("df.0", 0:9), paste0("df.", 10:99))
df.list <- lapply(nms, get)
Then take everything you have in the #what I need done portion and put inside a function, eg:
myFunc <- function(DF) {
# what you want done to a single DF
return(DF)
}
And then lapply accordingly
df.list <- lapply(df.list, myFunc)
On a separate notes, regarding functions:
The reason your funTest "does nothing" is that it you are not having it return anything. That is to say, it is doing something, but when it finishes doing that, then it does "nothing".
You need to include a return(.) statement in the function. Alternatively, the output of last line of the function, if not assigned to an object, will be used as the return value -- but this last sentence is only loosely true and hence one needs to be cautious. The cleanest option (in my opinion) is to use return(.)
regarding the for loop over the data.frame
As you observed, using for (i in someDataFrame) {...} iterates over the columns of the data.frame.
You can iterate over the rows using apply:
apply(myDF, MARGIN=1, function(x) { x$doy <- ...; return(x) } ) # dont forget to return

Resources