I'm trying to improve my R code by removing plenty of for loops.
I would like to apply censtats from the NADA package to all of my data, grouped by several factor.
Here is an example of my code (with the for loops) using a simple database :
Data <- data.frame("A"=c("a","a","a","a","b","b","b","b"), "B" = c("c","c","c","d","c","c","d","d"),"X"=c(2,1,3,1,1,2,1,1), "Y"=c(FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,TRUE,TRUE), Z = c(1,1,1,0,1,1,0,0))
Data_calc <- data.frame( #create empty database to increment at each loop
"K-M"=numeric(),check.names=FALSE, #result of censtats
MLE=numeric(), #result of censtats
ROS=numeric(), #result of censtats
A=factor(),
B=factor(),
stringsAsFactors=FALSE)
List_A <- unique(Data$A)
List_B <- unique(Data$B)
for (a in seq_along(List_A)){
for (b in seq_along(List_B)){
Temp <- subset(Data, A == List_A[a] & B == List_B[b]) # subset by A and B
if (nrow(Temp) > 1){ #condition 1 recquired by censtats
if (Temp$Z > 0) { #condition 2 recquired by censtats
Temp <- censtats(Temp$X, Temp$Y) #formating the results
Temp$myNames <- rownames(Temp)
Temp<- spread(Temp[c(2,4)], myNames, mean)
Temp$A <- List_A[a]
Temp$B <- List_B[b]
Data_calc <- bind_rows(Data_calc, Temp)
} else {}
} else {} }}
This is the results we obtain :
> Data_calc
K-M MLE ROS A B
1 2.333333 1.977163 1.991738 a c
2 2.000000 1.369061 2.000000 b c
In order to improve my code, I would like to remove the loops by grouping the factor using nest().
Data_nest <- nest(group_by(Data, A, B))
> Data_nest
# A tibble: 4 x 3
A B data
<fct> <fct> <list>
1 a c <tibble [3 x 3]>
2 a d <tibble [1 x 3]>
3 b c <tibble [2 x 3]>
4 b d <tibble [2 x 3]>
I'm stuck here as before using censtats I have to apply conditions 1 and 2 but I cannot find how to apply the conditions row by row.
Could anyone tell me the best solution (with or without nest) to improve the code as in reality my database has 4 factors and almost 2000 rows containing a list and the loop method take lot of time.
Thanks in advance.
Related
To start: I've seen this post and no, tidyr's unnest doesn't work here. I am doing an lapply where the returning function returns a list with named entries (see example func at the bottom for clarity):
ls <- lapply(x, func)
Now if I look at ls, it is a list of lists, and in the R studio data viewer it appears as having Name, Type, and Value columns.
Now, if I use
df <- bind_rows(ls)
I get exactly what I want, except I then need to bind the dataframe containing x to df. This is the problem, because for each x, func will return a variable number of rows, which means I need to run an equivalent of bind_rows after I have already attached ls to my dataframe.
An example is as below:
func <- function(x){
res <- list()
res$name <- 1:x
res$val <- 1:x
return(res)
}
df <- data.frame(nums <- c(1:3), letters <- c("A", "B", "C"))
ls <- lapply(df$nums, func)
bind_rows(ls) gives:
name val
<int> <int>
1 1 1
2 1 1
3 2 2
4 1 1
5 2 2
6 3 3
and the desired output is:
name val nums letters
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 2 B
3 2 2 2 B
4 1 1 3 C
5 2 2 3 C
6 3 3 3 C
Note that func here creates n rows given x = n. This is not the case for my actual function. func(n) can produce any positive number of rows.
Maybe you're looking for something more "canned", but you could write a function that would produce the desired output like this:
out <- function(data, varname){
l <- lapply(data[[varname]], func)
l <- lapply(1:length(l), function(x)do.call(data.frame, c(l[[x]], zz_obs=x)))
l <- do.call(rbind, l)
data$zz_obs <- 1:nrow(data)
if(!all(data$obs %in% l$obs))warning("Not all rows of data in output\n")
data <- dplyr::full_join(l, data, by="zz_obs")
data[,-which(names(data) == "zz_obs")]
}
out(df, "nums")
# name val nums letters
# 1 1 1 1 A
# 2 1 1 2 B
# 3 2 2 2 B
# 4 1 1 3 C
# 5 2 2 3 C
# 6 3 3 3 C
You can try mapply which is similar to lapply, but allows multiple vectors or lists to be passed to iterate over their values:
library(dplyr)
func <- function(x, y){
res <- list()
res$name <- 1:x
res$val <- 1:x
res$let <- rep(y, x)
return(res)
}
df <- data.frame(nums <- c(1:3), letters <- c("A", "B", "C"))
ls <- mapply(
func,
x = df$nums,
y =df$letters,
SIMPLIFY = FALSE
)
bind_rows(ls)
# A tibble: 6 x 3
# name val let
# <int> <int> <chr>
# 1 1 1 A
# 2 1 1 B
# 3 2 2 B
# 4 1 1 C
# 5 2 2 C
# 6 3 3 C
In the interim, the function I will be using to do this is:
merge_and_flatten <- function(x, y){
for (i in 1:nrow(x)){
y[[i]][names(x)] <- lapply(x[i, ], rep, times = length(y[[i]][[1]]))
}
return(bind_rows(y))
}
This is the cleanest solution I could come up with. Here, x serves and my df, and y serves as ls. It works by reducing the problem to bind_rows: it simply adds elements to ls which contain the columns in x. I absolutely want a cleaner solution, but this works for anyone who needs it.
I have a tibble in which one column is a list containing 2x2 matrices. I want to be able to select a specific element from the matrices across all rows in the tibble. I am able to select a specific element from one tibble row using indexing:
t1 <- tibble(x = 1:2, y = 1, z = x ^ 2 + y)
rM1 <- matrix(c(2,3,1,4), nrow=2, ncol=2, byrow = TRUE)
rM2 <- matrix(c(10,19,9,15), nrow=2, ncol=2, byrow = TRUE)
t1$my.lists <- list(rM1,rM2)
t1[[4]][[2]][[2,2]]
[1] 15
However when I try to access that specific element across multiple rows I get an error:
t1[[4]][1:2][[2,2]]
Error in t1[[4]][1:2][[2, 2]] : incorrect number of subscripts
I have also tried using piping and functions such as slice but still haven't been able to acheive the desired result. In this example I expect a return of:
[1] 4 15
where 4 is the 2x2 element from rM1 and 15 is the 2x2 element from rM2. Of course I could write a loop to achieve this but I assume there is also a more direct way to do this.
We can use sapply to loop over the list column number 4, and extract the elements based on row/column index
sapply(t1[[4]], function(x) x[2, 2])
#[1] 4 15
Or with map
library(dplyr)
library(purrr)
t1 %>%
mutate(new = map_dbl(my.lists, ~ .x[2, 2]))
# A tibble: 2 x 5
# x y z my.lists new
# <int> <dbl> <dbl> <list> <dbl>
#1 1 1 2 <dbl[,2] [2 × 2]> 4
#2 2 1 5 <dbl[,2] [2 × 2]> 15
The OP's code didn't work out because the below is a list
t1[[4]][1:2]
#[[1]]
# [,1] [,2]
#[1,] 2 3
#[2,] 1 4
#[[2]]
# [,1] [,2]
#[1,] 10 19
#[2,] 9 15
and the row/column indexing can be done by selecting each list element one by one or using a loop
t1[[4]][1:2][[2]][2,2]
#[1] 15
I'm trying to get the overall distance that an animal traveled by using a function that uses differences in lat long positions to output a distance traveled. Having some issues with nested loops.
The function ComputeDistance takes the arguments Lat1, Lat2, Long1, Long 2 in that order. Column 5 of DistTest contains latitude values and 6 contains longitude values.
So for the object "output", I'm trying to get sequential distances going through all 38 rows.
e.g.
ComputeDistance(DistTest[1,5],DistTest[2,5],DistTest[1,6],DistTest[2,6]
followed by:
ComputeDistance(DistTest[2,5],DistTest[3,5],DistTest[2,6],DistTest[3,6]
followed by:
ComputeDistance(DistTest[3,5],DistTest[4,5],DistTest[3,6],DistTest[4,6]
....
ComputeDistance(DistTest[37,5],DistTest[38,5],DistTest[37,6],DistTest[38,6]
I'm thinking that the problem is that the loop is going through every possible combination of DL and EL, not just going sequentially in order.
Below is the code I'm using currently.
## rows 1-37 and rows 2-38
DL <- 1:37
EL <- 2:38
## subsetting for one tagged animal
DistTest <- subset(Dispsum, Tag.ID == 1658)
## creating blank objects to save output in
testid <- c()
testdistance <- c()
for( j in DL){
for( k in EL){
output <- (ComputeDistance(DistTest[j,5], DistTest[k,5],DistTest[j,6], DistTest[k,6]))
Name <- 1658
testid <- rbind(testid, Name)
testdistance <- rbind(testdistance,output)
}
}
Generally in R, it is better to find functions that do the looping for you, as most of them are set up for that. In this case, you can try using mutate and lead from the dplyr package:
library(dplyr)
df <- dplyr::tibble(lat = 1:5, lon = 5:1)
df
# A tibble: 5 x 3
# lat lon distance
# <int> <int> <dbl>
# 1 1 5 1.41
# 2 2 4 1.41
# 3 3 3 1.41
# 4 4 2 1.41
# 5 5 1 NA
df %>% mutate(distance = ComputeDistance(lat, lead(lat), lon, lead(lon)))
# A tibble: 5 x 3
# lat lon distance
# <int> <int> <dbl>
# 1 1 10 1.41
# 2 2 9 1.41
# 3 3 8 1.41
# 4 4 7 NA
If you really want to stick with for loops, you only need one for this problem. You were right in saying that you're going through every combination. One alternative would be:
for (i in 1:37) {
output <- ComputeDistance(DistTest[i, 5], DistTest[i + 1, 5],
DistTest[i, 6], DistTest[i + 1, 6])
Name <- 1658
testid <- rbind(testid, Name)
testdistance <- rbind(testdistance, output)
}
One reason to avoid this construct is that you are incrementally growing an object (see here for more about that).
I want to know to what extent it is possible to use purrr's mapping functions to create objects in general, though at the moment and with the example below I'm looking at data frames.
A<-seq(1:5)
B<-seq(6:10)
C<-c("x","y","x","y","x")
dat<data.frame(A,B,C)
cols<-names(dat)
create_df<-function(x) {
x<- dat[x]
return(x)
}
A<-create_df("A")
This will create a data frame called A with column A from dat. I want to create data frames A/B/C, each with one column. I have tried different ways of specifying the .f argument as well as different map functions (map, map2, map_dfc, etc.). My original best guess:
map(.x=cols,~create_df(.x))
Clarification: I am asking for help because all of the specifications of map that I have tried have given an error.
Code that worked:
map(names(dat), ~assign(.x, dat[.x], envir = .GlobalEnv))
This creates A/B/C as data frames and prints to the console (which I don't need but does not bother me for now).
Using the purrr package, I think your custom function is not necessary.
The function includes a reference to the data, which is not optimal (especially if it doesn't exist in the environment).
to return as a list of single column dataframes:
cols<-names(dat)
map(cols, ~dat[.x])
or alternatively: map(names(dat), ~dat[.x])
returns:
[[1]]
# A tibble: 5 x 1
A
<int>
1 1
2 2
3 3
4 4
5 5
[[2]]
# A tibble: 5 x 1
B
<int>
1 1
2 2
3 3
4 4
5 5
[[3]]
# A tibble: 5 x 1
C
<chr>
1 x
2 y
3 x
4 y
5 x
If you want to stick with tidyverse principles, you can store them within a dataframe as a list-column.
dfs <-
data_frame(column = cols) %>%
mutate(data = map(cols, ~dat[.x]))
# A tibble: 3 x 2
column data
<chr> <list>
1 A <tibble [5 x 1]>
2 B <tibble [5 x 1]>
3 C <tibble [5 x 1]>
You can pull out individual data as needed:
B <- dfs$data[[2]]
# A tibble: 5 x 1
B
<int>
1 1
2 2
3 3
4 4
5 5
Along the lines of your original suggestion, here's an alternative function that uses purrr:map within it. I'm not sure how good of an idea this is, but maybe it has a use:
create_objects_from_df <- function(dat) {
map(names(dat), ~assign(.x, dat[.x], envir = .GlobalEnv))
}
create_objects_from_df(dat)
This creates the objects in your global environment, as individual objects with the column names.
We can use split from base R to get a list of one column data.frames
lst <- split.default(dat, names(dat))
It is better to keep it in a list, but if the intention is to have multiple objects in the global environment
list2env(lst, envir = .GlobalEnv)
I have the following code:
generator <- function(n){
nodes <- c()
distances <- c()
for (main in 1:n){
for (i in 1:n) {
for (j in 1:n){
if (main != i & i != j & main != j & i < j){
nodes <- c(nodes, paste(main, i, j, sep=",", collapse=""))
distances <- c(distances, main+i+j)}}}}
data <- data.frame(nodes, distances)
return(data)}
Running generator(4), I get the following output:
nodes distances
1 1,2,3 6
2 1,2,4 7
3 1,3,4 8
4 2,1,3 6
5 2,1,4 7
6 2,3,4 9
7 3,1,2 6
8 3,1,4 8
9 3,2,4 9
10 4,1,2 7
11 4,1,3 8
12 4,2,3 9
What I would like is to have the items in the "nodes" column be actual vectors of values, with the goal of comparing different triplets of nodes and finding the common members. It is currently just a string. So, for instance, I would like a$nodes[1][1], to yield 1, or something along those lines, so I can extract every individual value from the nodes triplets.
I currently have paste(main, i, j, sep=",") and I tried replacing this with c(main, i, j) but what happened is that I got a data frame of 36 rows instead of twelve: each individual node was its own row.
Thank you.
You need to work with list columns to do what you want to do.
I suggest working with tibbles instead of regular data frames, they're much more convenient for this type of work.
If you'd rather stick with base R you can replace data <- tibble::tibble(nodes, distances) by:
data <- data.frame(distances)
data$nodes <- nodes
new function
generator <- function(n){
nodes <- list() # changed this from `c()` to `list()`
distances <- c()
for (main in 1:n){
for (i in 1:n) {
for (j in 1:n){
if (main != i & i != j & main != j & i < j){
nodes <- append(nodes, list(c(main, i, j))) # changed this to append a list instead of a vector
distances <- c(distances, main+i+j)}}}}
data <- tibble::tibble(nodes, distances) # changed `data.frame` to `tibble`
return(data)}
output
df <- generator(4)
df
# A tibble: 12 x 2
# nodes distances
# <list> <int>
# 1 <int [3]> 6
# 2 <int [3]> 7
# 3 <int [3]> 8
# 4 <int [3]> 6
# 5 <int [3]> 7
# 6 <int [3]> 9
# 7 <int [3]> 6
# 8 <int [3]> 8
# 9 <int [3]> 9
# 10 <int [3]> 7
# 11 <int [3]> 8
# 12 <int [3]> 9
df$nodes[[1]]
# [1] 1 2 3
side note
You're growing a list, which can be slow, if you run into performance issues try defining nodes with nodes <- vector("list", length = max_length) and trim it in the end with nodes <- nodes[lengths(nodes)!=0].