Iterate through a nested list - r

I am attempting to iterate through a nested list in R, and can't quite get the function/for loop correct.
Sample of my data:
> str(waveforms)
List of 3
$ Sta2_Ev20:List of 7
..$ 1: num [1:10000] 5.88e-05 -2.84e-05 -5.50e-05 7.02e-05 1.90e-06 ...
..$ 2: num [1:10000] 2.61e-05 -2.14e-05 -2.02e-05 2.97e-05 5.94e-06 ...
..$ 3: num [1:10000] 1.08e-05 -4.12e-05 1.95e-05 3.03e-05 -4.55e-05 ...
..$ 4: num [1:10000] 2.45e-05 -1.23e-05 -1.53e-05 2.76e-05 3.07e-06 ...
..$ 5: num [1:10000] 2.29e-05 0.00 5.71e-06 -2.86e-05 5.71e-06 ...
..$ 6: num [1:10000] -1.01e-04 2.37e-05 2.08e-05 -5.93e-06 2.08e-05 ...
..$ 7: num [1:10000] 3.47e-05 -2.75e-05 0.00 1.45e-05 -1.45e-06 ...
$ Sta2_Ev21:List of 34
..$ 1 : num [1:10000] 1.35e-05 -3.46e-05 -3.46e-05 8.65e-05 -2.11e-05 ...
..$ 2 : num [1:10000] 5.68e-05 1.14e-05 -7.38e-05 2.27e-05 4.73e-05 ...
..$ 3 : num [1:10000] 8.21e-06 3.69e-05 -2.46e-05 1.64e-05 -8.21e-06 ...
..$ 4 : num [1:10000] 3.26e-05 -1.34e-05 -1.19e-05 8.90e-06 1.78e-05 ...
..$ 5 : num [1:10000] 2.43e-05 -3.00e-05 1.29e-05 2.86e-06 -1.00e-05 ...
..$ 6 : num [1:10000] -6.87e-06 2.34e-05 -2.34e-05 3.44e-05 -2.20e-05 ...
..$ 7 : num [1:10000] 1.23e-05 -5.75e-05 2.46e-05 1.23e-05 -2.74e-06 ...
..$ 8 : num [1:10000] -2.34e-05 -2.17e-05 1.83e-05 4.17e-05 -4.50e-05 ...
..$ 9 : num [1:10000] 3.34e-05 7.42e-06 -2.04e-05 7.42e-06 0.00 ...
etc...
REPRODUCIBLE DATA
Sta2_Evt1=list(a=runif(10000, min=-12, max=12), b=runif(10000, min=-12, max=12),c=runif(10000, min=-12, max=12))
Sta2_Evt2=list(a=runif(10000, min=-2, max=2), b=runif(10000, min=-2, max=2),c=runif(10000, min=-2, max=2))
...
waveforms=list(Sta2_Evt1,Sta2_Evt2,...))
binsize=5000
And so on. What I need to do it iterate through each list within my list. I tested the data on one of the "Sta#_Evt#" lists. Previously, this code worked:
ch0=list()
for (i in seq_along(Sta2_Evt2)) {
tempobj=head(Sta2_Evt2[[i]],n=binsize)
name <- paste('click',names(Sta2_Evt2)[[i]],sep='')
ch0[[name]] <- tempobj
}
This is simple, just extracting the first 5000 data points from each element. From this new list of elements (ch0), I was able to run multiple scripts to process my data. However, now that I need to expand to include ALL my data, not just the test set I was originally working with, I can't figure out how to run iterations over nested lists (like waveform, above). When I run the code for 'ch0', for instance, over my nested 'waveform' list, it returns the same nested list.
I have tried a few methods: lapply, an additional for loop, llply. I think that maybe writing a function to complete my analysis, and then using llply. However, with this function:
mkChs=function(x,binsize) {for (i in 1:length(x)) {
head(x[[i]],n=binsize)
}}
test=llply(waveforms,mkChs, binsize=5000)
It still does not work. The new list 'test' comes back empty.
I've tried a nest for loop.
ch0=list()
for (i in seq_along(waveforms)) {
a=list(names(waveforms)[[i]])
b=for (j in seq_along(waveforms[i])) {
tempobj=head(waveforms[[i]][[j]],n=binsize)
name <- paste('click',seq_along(waveforms)[[i]][[j]]-1,sep='')
a[[name]] <- tempobj
}
name1 <- names(waveforms)[[i]]
ch0[[name1]] <- b
}
That returns the following:
str(ch0)
List of 3
$ Sta2_Ev20: num [1:5000] 5.88e-05 -2.84e-05 -5.50e-05 7.02e-05 1.90e-06 ...
$ Sta2_Ev21: num [1:5000] 1.35e-05 -3.46e-05 -3.46e-05 8.65e-05 -2.11e-05 ...
$ Sta2_Ev22: num [1:5000] 2.06e-05 3.44e-06 2.06e-05 -3.44e-05 0.00 ...
Not exactly what I am looking for. I'd rather not have a separate list per "Sta#_Evt#" to get this to run properly.

I tried to create a minimal reproducible example which may get close to what you want
waveform <- list("a" = list('1' = c(1,2,3), '2' = c(4,5,6)),
"b" = list('1' = c(7,8,9), '2' = c(10,11,12)))
# arbitrary function
my_fun <- function(vec) {
return(mean(vec))
}
# return list structure
r1 <- lapply(waveform, function (x) {
lapply(x, my_fun)})
# return a two dimensional array
r2 <- sapply(waveform, function (x) {
sapply(x, my_fun)})
str(r1)
# List of 2
# $ a:List of 2
# ..$ 1: num 2
# ..$ 2: num 5
# $ b:List of 2
# ..$ 1: num 8
# ..$ 2: num 11
r2
# a b
# 1 2 8
# 2 5 11
>

I used a nested loop. Turns out my previous loop was missing a pair of parentheses!
ch0=list()
for (i in seq_along(waveforms)) {
a=list()
b=for (j in seq_along(waveforms[[i]])) {
tempobj=head((waveforms[[i]])[[j]],n=binsize)
name <- paste('click',seq_along((waveforms)[[i]])[[j]]-1,sep='')
a[[name]] <- tempobj
}
name1 <- names(waveforms)[[i]]
ch0[[name1]] <- a
}
In the tempobj=head((waveforms[[i]])[[j]],n=binsize) line of the for loop, I had neglected to put parentheses around waveforms[[i]], and again when generating the names.

Related

Adding a suffix to names when storing results in a loop

I am making some plots in R in a for-loop and would like to store them using a name to describe the function being plotted, but also which data it came from.
So when I have a list of 2 data sets "x" and "y" and the loop has a structure like this:
x = matrix(
c(1,2,4,5,6,7,8,9),
nrow=3,
ncol=2)
y = matrix(
c(20,40,60,80,100,120,140,160,180),
nrow=3,
ncol=2)
data <- list(x,y)
for (i in data){
??? <- boxplot(i)
}
I would like the ??? to be "name" + (i) + "_" separator. In this case the 2 plots would be called "plot_x" and "plot_y".
I tried some stuff with paste("plot", names(i), sep = "_") but I'm not sure if this is what to use, and where and how to use it in this scenario.
We can create an empty list with the length same as that of the 'data' and then store the corresponding output from the for loop by looping over the sequence of 'data'
out <- vector('list', length(data))
for(i in seq_along(data)) {
out[[i]] <- boxplot(data[[i]])
}
str(out)
#List of 2
# $ :List of 6
# ..$ stats: num [1:5, 1:2] 1 1.5 2 3 4 5 5.5 6 6.5 7
# ..$ n : num [1:2] 3 3
# ..$ conf : num [1:2, 1:2] 0.632 3.368 5.088 6.912
# ..$ out : num(0)
# ..$ group: num(0)
# ..$ names: chr [1:2] "1" "2"
# $ :List of 6
# ..$ stats: num [1:5, 1:2] 20 30 40 50 60 80 90 100 110 120
# ..$ n : num [1:2] 3 3
# ..$ conf : num [1:2, 1:2] 21.8 58.2 81.8 118.2
# ..$ group: num(0)
# ..$ names: chr [1:2] "1" "2"
If required, set the names of the list elements with the object names
names(out) <- paste0("plot_", c("x", "y"))
It is better not to create multiple objects in the global environment. Instead as showed above, place the objects in a list
akrun is right, you should try to avoid setting names in the global environment. But if you really have to, you can try this,
> y = matrix(c(20,40,60,80,100,120,140,160,180),ncol=1)
> .GlobalEnv[[paste0("plot_","y")]] <- boxplot(y)
> str(plot_y)
List of 6
$ stats: num [1:5, 1] 20 60 100 140 180
$ n : num 9
$ conf : num [1:2, 1] 57.9 142.1
$ out : num(0)
$ group: num(0)
$ names: chr "1"
You can read up on .GlobalEnv by typing in ?.GlobalEnv, into the R command prompt.

NA to replace NULL in list/for loop

I am trying to replace NULL values with NAs in a list pulled from an API, but the lengths are different and therefore can't be replaced.
I have tried using the nullToNA function in the toxboot package (found here), but it won't locate the function in R when I try to call it (I don't know if there have been changes to the package which I can't locate or whether it is because the list is not pulled from a MongoDB). I have also tried all the function call checks here . My code is below. Any help?
library(httr)
library(toxboot)
library(RJSONIO)
library(lubridate)
library(xlsx)
library(reshape2)
resUrl <- "http://api.eia.gov/series/?api_key=2B5239FA427673D22505DBF45664B12E&series_id=NG.N3010CO3.M"
comUrl <- "http://api.eia.gov/series/?api_key=2B5239FA427673D22505DBF45664B12E&series_id=NG.N3020CO3.M"
indUrl <- "http://api.eia.gov/series/?api_key=2B5239FA427673D22505DBF45664B12E&series_id=NG.N3035CO3.M"
apiList <- list(resUrl, comUrl, indUrl)
results <- vector("list", length(apiList))
for(i in length(apiList)){
raw <- GET(url = as.character(apiList[i]))
char <- rawToChar(raw$content)
list <- fromJSON(char)
for (j in length(list$series[[1]]$data)){
if (is.null(list$series[[1]]$data[[j]][[2]])== TRUE)
##nullToNA(list$series[[1]]$data[[j]][[2]])
##list$series[1]$data[[j]][[2]] <- NA
else
next
}
##seriesData <- list$series[[1]]$data
unlistResult <- lapply(list, unlist)
##unlistResult <- lapply(seriesData, unlist)
##unlist2 <- lapply(unlistResult,unlist)
##results[[i]] <- unlistResult
results[[i]] <- unlistResult
}
My hashtags have some of the things that I have tried. But there are a few other methods I haven't tried.
I have seen lapply(list, function(x) ifelse (x == "NULL", NA, x)) but haven't had any luck with that eiter.
Try this:
library(httr)
resUrl <- "http://api.eia.gov/series/?api_key=2B5239FA427673D22505DBF45664B12E&series_id=NG.N3010CO3.M"
x <- GET(resUrl)
y <- content(x)
str(head(y$series[[1]]$data))
# List of 6
# $ :List of 2
# ..$ : chr "201701"
# ..$ : NULL
# $ :List of 2
# ..$ : chr "201612"
# ..$ : num 6.48
# $ :List of 2
# ..$ : chr "201611"
# ..$ : num 7.42
# $ :List of 2
# ..$ : chr "201610"
# ..$ : num 9.75
# $ :List of 2
# ..$ : chr "201609"
# ..$ : num 12.1
# $ :List of 2
# ..$ : chr "201608"
# ..$ : num 14.3
In this first URL, only the first within $series[[1]]$data contained a NULL. BTW: be clear to distinguish between NULL (the literal) and "NULL" (a character string with 4 letters).
Here are some ways (with various data types) to check for NULLs:
is.null(NULL)
# [1] TRUE
length(NULL)
# [1] 0
Simple enough so far, let's try to list with NULLs:
l <- list(NULL, 1)
is.null(l)
# [1] FALSE
sapply(l, is.null)
# [1] TRUE FALSE
length(l)
# [1] 2
lengths(l)
# [1] 0 1
sapply(l, length)
# [1] 0 1
(The "0" lengths indicate NULLs.) I'll use lengths here:
y$series[[1]]$data <- lapply(y$series[[1]]$data, function(z) { z[ lengths(z) == 0 ] <- NA; z; })
str(head(y$series[[1]]$data))
# List of 6
# $ :List of 2
# ..$ : chr "201701"
# ..$ : logi NA
# $ :List of 2
# ..$ : chr "201612"
# ..$ : num 6.48
# $ :List of 2
# ..$ : chr "201611"
# ..$ : num 7.42
# $ :List of 2
# ..$ : chr "201610"
# ..$ : num 9.75
# $ :List of 2
# ..$ : chr "201609"
# ..$ : num 12.1
# $ :List of 2
# ..$ : chr "201608"
# ..$ : num 14.3

How to use lapply to remove columns with too many missing values in a list in R?

I have a list of data frames called ls.df.val.dcas. Each dataframe has various columns with some missing values which are NA. I would like to use lappy() to the list so that I can remove those columns that more than X % (e.g. 40%) of their values are NA. To give you a view of how the dataframes within the list look like I am showing an example:
$ SK_VALUES_IMV_EU28_INTRA :'data.frame': 74 obs. of 65 variables:
..$ PERIOD : Date[1:74], format: "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" ...
..$ 2207 : num [1:74] 1078759 1850083 1872924 1038070 626471 ...
..$ 2208 : num [1:74] 3329179 7061890 1351550 1371469 1557605 ...
..$ 220710 : num [1:74] 1030704 1804495 1831958 972263 574855 ...
..$ 220720 : num [1:74] 48055 45588 40966 65807 51616 ...
..$ 220820 : num [1:74] 380843 1014933 71804 126348 138138 ...
..$ 220830 : num [1:74] 380007 459653 155033 205879 297446 ...
..$ 220840 : num [1:74] 41561 88449 31549 60768 117534 ...
..$ 220850 : num [1:74] 94483 340439 44949 32949 37550 ...
..$ 220860 : num [1:74] 371217 728521 143974 179311 254546 ...
..$ 220870 : num [1:74] 731231 1374532 228087 227772 230129 ...
..$ 22082014: num [1:74] NA 2531 1776 NA NA ...
$ RO_VALUES_IMV_EU28_EXTRA :'data.frame': 74 obs. of 44 variables:
..$ PERIOD : Date[1:74], format: "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" ...
..$ 2207 : num [1:74] NA NA NA NA NA 5 NA NA NA NA ...
..$ 2208 : num [1:74] 312035 840540 315008 884357 100836 ...
..$ 220710 : num [1:74] NA NA NA NA NA 5 NA NA NA NA ...
..$ 220720 : num [1:74] NA NA NA NA NA NA NA NA NA NA ...
..$ 220820 : num [1:74] 3570 698 483 1087 1802 ...
My incomplete solution is based on counting the number of NA in each column of each dataframe and calculating the percentage of NA. Then removing those columns that the percentage is more than X%.
# Counting the number of NA
ls.Nan <- lapply(ls.df.val.dcas, function(x) colSums(!is.na(x)))
# Calculating the lengths of all column
ls.size <- lapply(ls.df.val.dcas, function(x) dim(x))
# we want the first element of size which shows the number of rows.
ls.percen <- mapply(function(x,y) x/y[1] , x=ls.Nan, y=ls.size)
# keeping those columns that have more than half of the data on that category
mis.list <- sapply(ls.df.val.dcas, "]]" sapply(ls.percen, function(x) x >= NPI))
I get the following error from running the last line.
Error: unexpected symbol in "mis.list <- sapply(ls.df.val.dcas, "]]" sapply"
Ultimately I also like to merge all of these functions into a single functions and then use lapply once. But right now, I am struggling to understand the indexing system of lapply applied to list of dataframes. If any one can demonstrate with an example how to use lapply with different granularity of lists then that would be great. For instance how functions should be written when you want to change an element of a list or a dataframe within list, or a column within a dataframe of a list.
EDIT
Given the comment below about forgetting to put a comma after "]]". I corrected the code but still getting the error
> mis.list <- sapply(ls.df.val.dcas, "]]", sapply(ls.percen, function(x) x >= NPI))
Error in get(as.character(FUN), mode = "function", envir = envir) :
object ']]' of mode 'function' was not found
By the way, the NPI is just a percentage threshold of NAs in the column. For instance I have set it to NPI= 0.35
Since I suspect there the error is related to the structure of my data, I added the more info on the structure of the ls.percen.
> str(ls.percen)
List of 69
$ AT_VALUES_IMV_EU28_EXTRA : Named num [1:59] 1 0.635 1 0.378 0.338 ...
..- attr(*, "names")= chr [1:59] "PERIOD" "2207" "2208" "220710" ...
$ AT_VALUES_IMV_EU28_INTRA : Named num [1:67] 1 0.986 0.986 0.986 0.986 ...
..- attr(*, "names")= chr [1:67] "PERIOD" "2207" "2208" "220710" ...
$ BE_VALUES_IMV_EU28_EXTRA : Named num [1:57] 1 1 1 1 0.365 ...
..- attr(*, "names")= chr [1:57] "PERIOD" "2207" "2208" "220710" ...
$ BE_VALUES_IMV_EU28_INTRA : Named num [1:69] 1 0.986 0.986 0.986 0.986 ...
..- attr(*, "names")= chr [1:69] "PERIOD" "2207" "2208" "220710" ...
Might be a simple typo (and not a problem with indexing): that message says you are missing a comma, and it should perhaps be:
mis.list <- sapply( ls.df.val.dcas, "]]", sapply(ls.percen, function(x) x >= NPI))
We don't see a definition of 'NPI'. Might be simpler to merge the first two 'lapply' calls (and return the desired list of shorted df's) with:
mis.lst <- lapply( ls.df.val.dcas,
function(x) x[ , colSums(!is.na(x))/nrow(x) > .40 ] )
You can use logical indexing in the "j" position for the two argument version of "[".

How to load several files into R without overwriting the existing files?

Hi I would like to load into R several databases in .sas7bdat format. Each time a new database is loaded I would like to display its name (e.g. file.sas7bdat -> file). I wrote a code in R (shown below) but it does not work. I think it overwrites the existing database with a new database. I would be grateful for any suggestions how to improve it.
getwd()
files<-list.files(pattern="*.sas7bdat")
for (i in 1:length(files)) {
data[i]<-read.sas7bdat(files[i])
}
I don't have any sad7bdat files handy, but this concept should translate across most of the read.* functions. You're on the right track with the for-loop, but can create the list directly by using lapply() like so:
#Make a few CSV files
x <- matrix(rnorm(10), ncol = 2)
write.csv(x, "a.csv")
write.csv(x, "b.csv")
#Read them into a list
fileList <- lapply(list.files(pattern = "*.csv"), function(x) read.csv(x))
#check out what we ended up with
str(fileList)
#---
List of 2
$ :'data.frame': 5 obs. of 3 variables:
..$ X : int [1:5] 1 2 3 4 5
..$ V1: num [1:5] -0.451 -0.317 -1.225 0.445 -1.361
..$ V2: num [1:5] 0.489 -2.8154 0.5147 -0.0561 0.826
$ :'data.frame': 5 obs. of 3 variables:
..$ X : int [1:5] 1 2 3 4 5
..$ V1: num [1:5] -0.451 -0.317 -1.225 0.445 -1.361
..$ V2: num [1:5] 0.489 -2.8154 0.5147 -0.0561 0.826
library(sas7bdat)
setwd(".... WHERE SAS FILES LIVES...")
load.sas <- function(x) {
name <- strsplit(x,"\\.")[[1]][1]
assign(name, read.sas7bdat(x), env=.GlobalEnv)
TRUE
}
sapply(list.files(path=".", pattern="*.sas7bdat", full.names=F), load.sas)`
You can add any futures to this code to revrite only some data or....

How to access parts of a list in R

I've got the optim function in r returning a list of stuff like this:
[[354]]
r k sigma
389.4 354.0 354.0
but when I try accessing say list$sigma it doesn't exist returning NULL.
I've tried attach and I've tried names, and I've tried assigning it to a matrix, but none of these things would work
Anyone got any idea how I can access the lowest or highest value for sigma r or k in my list??
Many thanks!!
str gives me this output:
List of 354
$ : Named num [1:3] -55.25 2.99 119.37
..- attr(*, "names")= chr [1:3] "r" "k" "sigma"
$ : Named num [1:3] -53.91 4.21 119.71
..- attr(*, "names")= chr [1:3] "r" "k" "sigma"
$ : Named num [1:3] -41.7 14.6 119.2
So I've got a double within a list within a list (?) I'm still mystified as to how I can cycle through the list and pick one out meeting my conditions without writing a function from scratch
The key issue is that you have a list of lists (or a list of data.frames, which in fact is also a list).
To confirm this, take a look at is(list[[354]]).
The solution is simply to add an additional level of indexing. Below you have multiple alternatives of how to accomplish this.
you can use a vector as an index to [[, so for example if you want to access the third element from the 354th element, you can use
myList[[ c(354, 3) ]]
You can also use character indecies, however, all nested levels must have named indecies.
names(myList) <- as.character(1:length(myList))
myList[[ c("5", "sigma") ]]
Lastly, please try to avoid using names like list, data, df etc. This will lead to crashing code and erors which will seem unexplainable and mysterious until one realizes that they've tried to subset a function
Edit:
In response to your question in the comments above: If you want to see the structure of an object (ie the "makeup" of the object), use str
> str(myList)
List of 5
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.654
..$ b : num -0.0823
..$ sigma: num -31
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num -0.656
..$ b : num -0.167
..$ sigma: num -49
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.154
..$ b : num 0.522
..$ sigma: num -89
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num 0.676
..$ b : num 0.595
..$ sigma: num 145
$ :'data.frame': 1 obs. of 3 variables:
..$ a : num -0.75
..$ b : num 0.772
..$ sigma: num 6
If you want -for example- all the sigmas, you can use sapply:
sapply(list, function(x)x["sigma"])
You can use that to find the minimum and maximum:
range(sapply(list, function(x)x["sigma"]))
Using , do.call you can do this :
do.call('[',mylist,354)['sigma']

Resources