How to load several files into R without overwriting the existing files? - r

Hi I would like to load into R several databases in .sas7bdat format. Each time a new database is loaded I would like to display its name (e.g. file.sas7bdat -> file). I wrote a code in R (shown below) but it does not work. I think it overwrites the existing database with a new database. I would be grateful for any suggestions how to improve it.
getwd()
files<-list.files(pattern="*.sas7bdat")
for (i in 1:length(files)) {
data[i]<-read.sas7bdat(files[i])
}

I don't have any sad7bdat files handy, but this concept should translate across most of the read.* functions. You're on the right track with the for-loop, but can create the list directly by using lapply() like so:
#Make a few CSV files
x <- matrix(rnorm(10), ncol = 2)
write.csv(x, "a.csv")
write.csv(x, "b.csv")
#Read them into a list
fileList <- lapply(list.files(pattern = "*.csv"), function(x) read.csv(x))
#check out what we ended up with
str(fileList)
#---
List of 2
$ :'data.frame': 5 obs. of 3 variables:
..$ X : int [1:5] 1 2 3 4 5
..$ V1: num [1:5] -0.451 -0.317 -1.225 0.445 -1.361
..$ V2: num [1:5] 0.489 -2.8154 0.5147 -0.0561 0.826
$ :'data.frame': 5 obs. of 3 variables:
..$ X : int [1:5] 1 2 3 4 5
..$ V1: num [1:5] -0.451 -0.317 -1.225 0.445 -1.361
..$ V2: num [1:5] 0.489 -2.8154 0.5147 -0.0561 0.826

library(sas7bdat)
setwd(".... WHERE SAS FILES LIVES...")
load.sas <- function(x) {
name <- strsplit(x,"\\.")[[1]][1]
assign(name, read.sas7bdat(x), env=.GlobalEnv)
TRUE
}
sapply(list.files(path=".", pattern="*.sas7bdat", full.names=F), load.sas)`
You can add any futures to this code to revrite only some data or....

Related

Adding a suffix to names when storing results in a loop

I am making some plots in R in a for-loop and would like to store them using a name to describe the function being plotted, but also which data it came from.
So when I have a list of 2 data sets "x" and "y" and the loop has a structure like this:
x = matrix(
c(1,2,4,5,6,7,8,9),
nrow=3,
ncol=2)
y = matrix(
c(20,40,60,80,100,120,140,160,180),
nrow=3,
ncol=2)
data <- list(x,y)
for (i in data){
??? <- boxplot(i)
}
I would like the ??? to be "name" + (i) + "_" separator. In this case the 2 plots would be called "plot_x" and "plot_y".
I tried some stuff with paste("plot", names(i), sep = "_") but I'm not sure if this is what to use, and where and how to use it in this scenario.
We can create an empty list with the length same as that of the 'data' and then store the corresponding output from the for loop by looping over the sequence of 'data'
out <- vector('list', length(data))
for(i in seq_along(data)) {
out[[i]] <- boxplot(data[[i]])
}
str(out)
#List of 2
# $ :List of 6
# ..$ stats: num [1:5, 1:2] 1 1.5 2 3 4 5 5.5 6 6.5 7
# ..$ n : num [1:2] 3 3
# ..$ conf : num [1:2, 1:2] 0.632 3.368 5.088 6.912
# ..$ out : num(0)
# ..$ group: num(0)
# ..$ names: chr [1:2] "1" "2"
# $ :List of 6
# ..$ stats: num [1:5, 1:2] 20 30 40 50 60 80 90 100 110 120
# ..$ n : num [1:2] 3 3
# ..$ conf : num [1:2, 1:2] 21.8 58.2 81.8 118.2
# ..$ group: num(0)
# ..$ names: chr [1:2] "1" "2"
If required, set the names of the list elements with the object names
names(out) <- paste0("plot_", c("x", "y"))
It is better not to create multiple objects in the global environment. Instead as showed above, place the objects in a list
akrun is right, you should try to avoid setting names in the global environment. But if you really have to, you can try this,
> y = matrix(c(20,40,60,80,100,120,140,160,180),ncol=1)
> .GlobalEnv[[paste0("plot_","y")]] <- boxplot(y)
> str(plot_y)
List of 6
$ stats: num [1:5, 1] 20 60 100 140 180
$ n : num 9
$ conf : num [1:2, 1] 57.9 142.1
$ out : num(0)
$ group: num(0)
$ names: chr "1"
You can read up on .GlobalEnv by typing in ?.GlobalEnv, into the R command prompt.

Iterate through a nested list

I am attempting to iterate through a nested list in R, and can't quite get the function/for loop correct.
Sample of my data:
> str(waveforms)
List of 3
$ Sta2_Ev20:List of 7
..$ 1: num [1:10000] 5.88e-05 -2.84e-05 -5.50e-05 7.02e-05 1.90e-06 ...
..$ 2: num [1:10000] 2.61e-05 -2.14e-05 -2.02e-05 2.97e-05 5.94e-06 ...
..$ 3: num [1:10000] 1.08e-05 -4.12e-05 1.95e-05 3.03e-05 -4.55e-05 ...
..$ 4: num [1:10000] 2.45e-05 -1.23e-05 -1.53e-05 2.76e-05 3.07e-06 ...
..$ 5: num [1:10000] 2.29e-05 0.00 5.71e-06 -2.86e-05 5.71e-06 ...
..$ 6: num [1:10000] -1.01e-04 2.37e-05 2.08e-05 -5.93e-06 2.08e-05 ...
..$ 7: num [1:10000] 3.47e-05 -2.75e-05 0.00 1.45e-05 -1.45e-06 ...
$ Sta2_Ev21:List of 34
..$ 1 : num [1:10000] 1.35e-05 -3.46e-05 -3.46e-05 8.65e-05 -2.11e-05 ...
..$ 2 : num [1:10000] 5.68e-05 1.14e-05 -7.38e-05 2.27e-05 4.73e-05 ...
..$ 3 : num [1:10000] 8.21e-06 3.69e-05 -2.46e-05 1.64e-05 -8.21e-06 ...
..$ 4 : num [1:10000] 3.26e-05 -1.34e-05 -1.19e-05 8.90e-06 1.78e-05 ...
..$ 5 : num [1:10000] 2.43e-05 -3.00e-05 1.29e-05 2.86e-06 -1.00e-05 ...
..$ 6 : num [1:10000] -6.87e-06 2.34e-05 -2.34e-05 3.44e-05 -2.20e-05 ...
..$ 7 : num [1:10000] 1.23e-05 -5.75e-05 2.46e-05 1.23e-05 -2.74e-06 ...
..$ 8 : num [1:10000] -2.34e-05 -2.17e-05 1.83e-05 4.17e-05 -4.50e-05 ...
..$ 9 : num [1:10000] 3.34e-05 7.42e-06 -2.04e-05 7.42e-06 0.00 ...
etc...
REPRODUCIBLE DATA
Sta2_Evt1=list(a=runif(10000, min=-12, max=12), b=runif(10000, min=-12, max=12),c=runif(10000, min=-12, max=12))
Sta2_Evt2=list(a=runif(10000, min=-2, max=2), b=runif(10000, min=-2, max=2),c=runif(10000, min=-2, max=2))
...
waveforms=list(Sta2_Evt1,Sta2_Evt2,...))
binsize=5000
And so on. What I need to do it iterate through each list within my list. I tested the data on one of the "Sta#_Evt#" lists. Previously, this code worked:
ch0=list()
for (i in seq_along(Sta2_Evt2)) {
tempobj=head(Sta2_Evt2[[i]],n=binsize)
name <- paste('click',names(Sta2_Evt2)[[i]],sep='')
ch0[[name]] <- tempobj
}
This is simple, just extracting the first 5000 data points from each element. From this new list of elements (ch0), I was able to run multiple scripts to process my data. However, now that I need to expand to include ALL my data, not just the test set I was originally working with, I can't figure out how to run iterations over nested lists (like waveform, above). When I run the code for 'ch0', for instance, over my nested 'waveform' list, it returns the same nested list.
I have tried a few methods: lapply, an additional for loop, llply. I think that maybe writing a function to complete my analysis, and then using llply. However, with this function:
mkChs=function(x,binsize) {for (i in 1:length(x)) {
head(x[[i]],n=binsize)
}}
test=llply(waveforms,mkChs, binsize=5000)
It still does not work. The new list 'test' comes back empty.
I've tried a nest for loop.
ch0=list()
for (i in seq_along(waveforms)) {
a=list(names(waveforms)[[i]])
b=for (j in seq_along(waveforms[i])) {
tempobj=head(waveforms[[i]][[j]],n=binsize)
name <- paste('click',seq_along(waveforms)[[i]][[j]]-1,sep='')
a[[name]] <- tempobj
}
name1 <- names(waveforms)[[i]]
ch0[[name1]] <- b
}
That returns the following:
str(ch0)
List of 3
$ Sta2_Ev20: num [1:5000] 5.88e-05 -2.84e-05 -5.50e-05 7.02e-05 1.90e-06 ...
$ Sta2_Ev21: num [1:5000] 1.35e-05 -3.46e-05 -3.46e-05 8.65e-05 -2.11e-05 ...
$ Sta2_Ev22: num [1:5000] 2.06e-05 3.44e-06 2.06e-05 -3.44e-05 0.00 ...
Not exactly what I am looking for. I'd rather not have a separate list per "Sta#_Evt#" to get this to run properly.
I tried to create a minimal reproducible example which may get close to what you want
waveform <- list("a" = list('1' = c(1,2,3), '2' = c(4,5,6)),
"b" = list('1' = c(7,8,9), '2' = c(10,11,12)))
# arbitrary function
my_fun <- function(vec) {
return(mean(vec))
}
# return list structure
r1 <- lapply(waveform, function (x) {
lapply(x, my_fun)})
# return a two dimensional array
r2 <- sapply(waveform, function (x) {
sapply(x, my_fun)})
str(r1)
# List of 2
# $ a:List of 2
# ..$ 1: num 2
# ..$ 2: num 5
# $ b:List of 2
# ..$ 1: num 8
# ..$ 2: num 11
r2
# a b
# 1 2 8
# 2 5 11
>
I used a nested loop. Turns out my previous loop was missing a pair of parentheses!
ch0=list()
for (i in seq_along(waveforms)) {
a=list()
b=for (j in seq_along(waveforms[[i]])) {
tempobj=head((waveforms[[i]])[[j]],n=binsize)
name <- paste('click',seq_along((waveforms)[[i]])[[j]]-1,sep='')
a[[name]] <- tempobj
}
name1 <- names(waveforms)[[i]]
ch0[[name1]] <- a
}
In the tempobj=head((waveforms[[i]])[[j]],n=binsize) line of the for loop, I had neglected to put parentheses around waveforms[[i]], and again when generating the names.

Accessing dataframes after splitting a dataframe

I'm splitting a dataframe in multiple dataframes using the command
data <- apply(data, 2, function(x) data.frame(sort(x, decreasing=F)))
I don't know how to access them, I know I can access them using df$1 but I have to do that for every dataframe,
df1<- head(data$`1`,k)
df2<- head(data$`2`,k)
can I get these dataframes in one go (like storing them in some form) however the indexes of these multiple dataframes shouldn't change.
str(data) gives
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
str(data[1:2])
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
Thanks to #r2evans I got it done, here is his code from the comments
Yes. Two short demos: lapply(data, head, n=2), or more generically
sapply(data, function(df) mean(df$x)). – r2evans
and after that fetching the indexes
df<-lapply(df, rownames)

How do I read ragged/implied do data into r

How do I read data like the example below? (My actual files are like ftp://ftp.aoml.noaa.gov/hrd/pub/hwind/Operational/2012/AL182012/1030/0730/AL182012_1030_0730.gz formatted per http://www.aoml.noaa.gov/hrd/Storm_pages/grid.html -- They look like fortran implied-do writes)
The issue I have is that there are multiple headers and vectors within the file having differing numbers of values per line. Scan seems to start from the beginning for .gz files, while I want the reads to parse incrementally through the file.
This is a headerline with a name.
The fourth line has the number of elements in the first vector,
and the next vector is encoded similarly
7
1 2 3
4 5 6
7
8
1 2 3
4 5 6
7 8
This doesn't work as I'd like:
fh<-gzfile("junk.gz")
headers<-readLines(fh,3)
nx<-as.numeric(readLines,1)
x<-scan(fh,nx)
ny<-as.numeric(readLines,1)
y<-scan(fh,ny)
This sort of works, but I have to then calculate the skip values:
...
x<-scan(fh,skip=3,nx)
...
Ah... I discovered that using the gzfile() to open does not allow seek operations on the data, so the scan()s all rewind and start at the beginning of the file. If I unzip the file and operate on the uncompressed data, I can read the various bits incrementally with readLines(fh,n) and scan(fh,n=n)
readVector<-function(fh,skip=0,what=double()){
if (skip !=0 ){junk<-readLines(fh,skip)}
n<-scan(fh,1)
scan(fh,what=what,n=n)
}
fh<-file("junk")
headers<-readLines(fh,3)
x<-readVector(fh)
y<-readVector(fh)
xl<-readVector(fh)
yl<-readVector(fh)
... # still need to process a parenthesized complex array, but that is a different problem.
Looking at a few sample files, it looks like you only need to determine the number to be read once, and that can be used for processing all parts of the file.
As I mentioned in a comment, grep would be useful for helping automate the process. Here's a quick function I came up with:
ReadFunky <- function(myFile) {
fh <- gzfile(myFile)
myFile <- readLines(fh)
vecLen <- as.numeric(myFile[5])
startAt <- grep(paste("^\\s+", vecLen), myFile)
T1 <- lapply(startAt[-5], function(x) {
scan(fh, n = vecLen, skip = x)
})
T2 <- gsub("\\(|\\)", "",
unlist(strsplit(myFile[(startAt[5]+1):length(myFile)], ")(",
fixed = TRUE)))
T2 <- read.csv(text = T2, header = FALSE)
T2 <- split(T2, rep(1:vecLen, each = vecLen))
T1[[5]] <- T2
names(T1) <- myFile[startAt-1]
T1
}
You can apply it to a downloaded file. Just replace with the actual path to where you downloaded the file.
temp <- ReadFunky("~/Downloads/AL182012_1030_0730.gz")
The function returns a list. The first four items in the list are the vectors of coordinates.
str(temp[1:4])
# List of 4
# $ MERCATOR X COORDINATES ... KILOMETERS : num [1:159] -476 -470 -464 -458 -452 ...
# $ MERCATOR Y COORDINATES ... KILOMETERS : num [1:159] -476 -470 -464 -458 -452 ...
# $ EAST LONGITUDE COORDINATES ... DEGREES: num [1:159] -81.1 -81 -80.9 -80.9 -80.8 ...
# $ NORTH LATITUDE COORDINATES ... DEGREES: num [1:159] 36.2 36.3 36.3 36.4 36.4 ...
The fifth item is a set of 2-column data.frames that contain the data from your "parenthesized complex array". Not really sure what the best structure for this data was, so I just stuck it in data.frames. You'll get as many data.frames as the expected number of values for the given data set (in this case, 159).
length(temp[[5]])
# [1] 159
str(temp[[5]][1:4])
# List of 4
# $ 1:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.59 7.6 7.59 7.59 7.58 ...
# ..$ V2: num [1:159] -1.33 -1.28 -1.22 -1.16 -1.1 ...
# $ 2:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.66 7.66 7.65 7.65 7.64 ...
# ..$ V2: num [1:159] -1.29 -1.24 -1.19 -1.13 -1.07 ...
# $ 3:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.73 7.72 7.72 7.71 7.7 ...
# ..$ V2: num [1:159] -1.26 -1.21 -1.15 -1.1 -1.04 ...
# $ 4:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.8 7.8 7.79 7.78 7.76 ...
# ..$ V2: num [1:159] -1.22 -1.17 -1.12 -1.06 -1.01 ...
Update
If you want to modify the function so you can read directly from the FTP url, change the first two lines to read as the following and continue from the "myFile" line:
ReadFunky <- function(myFile, fromURL = TRUE) {
if (isTRUE(fromURL)) {
x <- strsplit(myFile, "/")[[1]]
y <- download.file(myFile, destfile = x[length(x)])
fh <- gzfile(x[length(x)])
} else {
fh <- gzfile(myFile)
}
Usage would be like: temp <- ReadFunky("ftp://ftp.aoml.noaa.gov/hrd/pub/hwind/Operational/2012/AL182012/1023/1330/AL182012_1023_1330.gz") for a file that you are going to download directly, and temp <- ReadFunky("~/AL182012_1023_1330.gz", fromURL=FALSE) for a file that you already have saved on your system.

How to access parts of a list in R

I've got the optim function in r returning a list of stuff like this:
[[354]]
r k sigma
389.4 354.0 354.0
but when I try accessing say list$sigma it doesn't exist returning NULL.
I've tried attach and I've tried names, and I've tried assigning it to a matrix, but none of these things would work
Anyone got any idea how I can access the lowest or highest value for sigma r or k in my list??
Many thanks!!
str gives me this output:
List of 354
$ : Named num [1:3] -55.25 2.99 119.37
..- attr(*, "names")= chr [1:3] "r" "k" "sigma"
$ : Named num [1:3] -53.91 4.21 119.71
..- attr(*, "names")= chr [1:3] "r" "k" "sigma"
$ : Named num [1:3] -41.7 14.6 119.2
So I've got a double within a list within a list (?) I'm still mystified as to how I can cycle through the list and pick one out meeting my conditions without writing a function from scratch
The key issue is that you have a list of lists (or a list of data.frames, which in fact is also a list).
To confirm this, take a look at is(list[[354]]).
The solution is simply to add an additional level of indexing. Below you have multiple alternatives of how to accomplish this.
you can use a vector as an index to [[, so for example if you want to access the third element from the 354th element, you can use
myList[[ c(354, 3) ]]
You can also use character indecies, however, all nested levels must have named indecies.
names(myList) <- as.character(1:length(myList))
myList[[ c("5", "sigma") ]]
Lastly, please try to avoid using names like list, data, df etc. This will lead to crashing code and erors which will seem unexplainable and mysterious until one realizes that they've tried to subset a function
Edit:
In response to your question in the comments above: If you want to see the structure of an object (ie the "makeup" of the object), use str
> str(myList)
List of 5
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.654
..$ b : num -0.0823
..$ sigma: num -31
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num -0.656
..$ b : num -0.167
..$ sigma: num -49
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.154
..$ b : num 0.522
..$ sigma: num -89
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.676
..$ b : num 0.595
..$ sigma: num 145
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num -0.75
..$ b : num 0.772
..$ sigma: num 6
If you want -for example- all the sigmas, you can use sapply:
sapply(list, function(x)x["sigma"])
You can use that to find the minimum and maximum:
range(sapply(list, function(x)x["sigma"]))
Using , do.call you can do this :
do.call('[',mylist,354)['sigma']

Resources