I have sets of weather station data which I wish to compare by site. I need to do this efficiently because each set is large and I wish to build my experience with PURRR. My issue concerns use of the keep/discard (or list.exclude (rlist)) to remove days (id) with incomplete data - it should be a doozy but I can't get the syntax right. I have tried to approach this problem by computing the dimensions of each tibble, and then use the length to give me a unitary list). I am using R 3.6.1 on a PC running Windows 10. Here is a trivial example. I wish 'mylist' to comprise id = 'a' only in this example.
mylist <- tibble(id = c(rep("a",5),rep("b",4)),
dl = c(seq(1,5,1), seq(1,4,1)),
v = c(seq(0, 40, 10), seq(50, 80, 10))) %>%
group_by(id) %>%
nest() %>%
mutate(ddim = map(data, dim)) %>%
mutate(nn = map(ddim, extract(1)))
mylist
# A tibble: 2 x 4
# Groups: id [2]
id data ddim nn
<chr> <list<df[,2]>> <list> <list>
1 a [5 x 2] <int [2]> <int [1]>
2 b [4 x 2] <int [2]> <int [1]>
It is not clear how "incomplete data" is defined but since the question is more about how to filter rows where a certain condition is satisfied in a list, I have considered a temporary condition which is select rows where v column of tibble has first value as 0. This condition can be changed after clarification from OP.
We can use filter to select rows and map_lgl to loop over data column for each id.
library(tidyverse)
mylist %>% filter(map_lgl(data, ~first(.x$v) == 0))
# id data
# <chr> <list<df[,2]>>
#1 a [5 × 2]
Similarly, in base R, we can use subset with sapply
subset(mylist, sapply(data, function(x) x$v[1] == 0))
data
mylist <- tibble(id = c(rep("a",5),rep("b",4)),
dl = c(seq(1,5,1), seq(1,4,1)),
v = c(seq(0, 40, 10), seq(50, 80, 10))) %>%
group_by(id) %>% nest()
I'm trying to use purr to summarize a particular column of a nested list column.
library(tidyverse)
z <- tibble(name = c("Bill","Bill","Bill","Sue","Sue"), grade =c(90L,95L,70L,100L,98L), time=c(10L,11L,10L,15L,16L))
summary <- z %>%
group_by(name) %>%
nest %>%
mutate(n = map_int(data,nrow)) %>%
mutate(avg = map(data$grade,mean)) %>%
mutate(ttl_time = map(data$time, sum))
When I run this I get an error:: Column y must be length 3 (the number of rows) or one, not 2
My target output is:
name data n avg ttl_time
Bill [3x3] 3 92 31
Sue [2x3] 2 99 31
When I remove the last two mutate function the script works as anticipated. This leads me to believe that I'm not isolating the grade and time columns within the data column, but I can figure out what I'm doing wrong?
I watched this r studio video and I believe I'm doing the same thing I saw in the video. Working with List Columns
z %>%
group_by(name) %>%
nest() %>%
mutate(n = map_int(data, nrow),
avg = map_dbl(data, ~ mean(.x$grade)),
ttl_time = map_dbl(data, ~ sum(.x$time)))
# # A tibble: 2 x 5
# name data n avg ttl_time
# <chr> <list> <int> <dbl> <dbl>
# 1 Bill <tibble [3 × 2]> 3 85 31
# 2 Sue <tibble [2 × 2]> 2 99 31
The formula notation with ~ is a shortcut for e.g. function(.x) mean(.x$grade)
OP's error indeed stems from the fact that map cannot iterate directly over each grade element of the data list, at least not with this syntax.
data$grade is understood as an element of the list data that has name grade, and there is no such element.
This alternative syntax might help understand how this is achievable:
z %>%
group_by(name) %>%
nest() %>%
mutate(n = map_int(data, nrow),
avg = map_dbl(map(data, "grade"), mean),
ttl_time = map_dbl(map(data, "time"), sum))
where map(data, "grade") extracts each grade component from the elements of the list column data.
Though this is, in my opinion, less readable than the first suggestion.
I am trying to create a list of dataframes and then using that list of dataframes to create another dataframe about the attributes of that dataframe. I wanted to do this by creating a loop.
I tried creating a list of dataframes. Then I used that list in a loop that says for each row in my new dataframe, put in the name of the dataframe in one column and the number of rows in that dataframe in another column.
df_Months <- as.list(c(df_Jan2018, df_Feb2018, df_March2018, df_April2018, df_May2018))
for i in 1:length(df_Months) {
Monthly_Size$Month[i] <- paste(df_Months [i])
Monthly_Size$Size[i] <- nrow(df_Months[i])
}
if I do nrow(df_Months[1]) the result is NULL even though I know that is not the case because if i just do nrow(df_Jan2018) it gives me back the correct number of rows.
Here is a solution using the purrr and dplyr package that should work on your data. You wouldn't need the for loop anymore.
library("purrr")
library("dplyr")
test_df <- data.frame( a = c(1,2,3,4,NA),
b = c(NA,6,5,7,9))
test_df2 <- data.frame(c = c(1:10),
d = c(11:20))
df_list <- list(test_df = test_df, test_df2 = test_df2)
res <- map_dbl(df_list,nrow)
tibble(df = names(res), nrow = res)
The output looks like this
# A tibble: 2 x 2
df nrow
<chr> <dbl>
1 test_df 5
2 test_df2 10
A slightly different approach would be to put the above list df_list into a tibble and then do operations on that tibble and create new rows with the information you are looking for.
df_tibble <- tibble(name = names(df_list), df = df_list)
df_tibble %>% mutate(nrow = map_dbl(df, ~ nrow(.x)))
# A tibble: 2 x 3
name df nrow
<chr> <list> <dbl>
1 test_df <data.frame [5 × 2]> 5
2 test_df2 <data.frame [10 × 2]> 10
You could go on and include more information in this way. For example the number of columns.
df_tibble %>% mutate(nrow = map_dbl(df, ~ nrow(.x)),
ncol = map_dbl(df, ~ ncol(.x)))
I would like to conduct a very involved loop. I have multiple regions, each with hundreds of plots in my real data frame. I would like to subset by region and then plot and preform various functions on the subsets to ultimately calculate dissimilarity owed to only species that are shared. I will preface by saying each row is representative of an interaction.
My example df:
set.seed(540)
df<- data.frame(region= c(rep(1, 16), rep(2,8)),
plot= c(rep("A",5), rep("B",9), rep("C", 2), rep("D", 6),rep("E", 2)),
plantsp= sample(1:24,24, replace= TRUE),
lepsp= sample(1:24,24,replace= TRUE),
psitsp= sample(1:24,24,replace= TRUE))
df[] <- lapply(df, as.character)
df$plantsp<-paste('plantsp', df$plantsp, sep='_')
df$lepsp<-paste('lepsp', df$lepsp, sep='_')
df$psitsp<-paste('psitsp', df$psitsp, sep='_')
df$paste1<- paste(df$plantsp, df$lepsp, sep='_')
df$paste2<- paste(df$lepsp, df$psitsp, sep='_')
df$paste3<- paste(df$plantsp,df$lepsp, df$psitsp)
Step1: Subset df by region. Example:
region_sub <- split(df, df$region)
Step2: Subset df by plot. Example:
plot_sub <- split(region_sub[[1]], region_sub[[1]][[2]])
Step3: We will call each subset (each list component) from the step above a plot subset. In this example I will use the first subset (region1, plotA) as an example for all subsequent outputs. I will call this region1, plotA subset plot_sub1. I want to compare plot_sub1 to the original df to make three df subsets. We will call these df_sub1, df_sub2, df_sub3. First, df_sub1 consists of matches among entries in the plantsp, lepsp columns among plot_sub1 and df. Rows with any unique entries are removed, as well as and rows where a plantsp match, but not the lepsp and visa versa. Example of df_sub1:
df_sub1<- df[c(1,2,3,4,5,22),c(1:4,6)]
Notice, only those rows with shared species remain. Further, only those rows with shared species that also interact remain. Also, I have removed unnecessary columns (e.g. psitsp, paste2, paste3) to draw your attention to the results of this step. These columns do not need to be removed for the code.
Step4: Repeat step3 for lepsp and psitsp columns to make df_sub2. Example:
df_sub2<- df[1:5,c(1:2,4,5,7)]
Step5: Repeat step3 for plantsp,lepsp and psitsp column to make df_sub3. Example:
df_sub3<- df[1:5,c(1:5,8)]
Step6: Now that all subsets are made, I want to count matching elements in the paste1 column among plot_sub1 and df_sub1 (=5). Example:
This would be stored in a vector match. The results will be stored in the match or unique vector, accordingly. Example:
match<- length(intersect(df_sub1$paste1, plot_sub[[1]]$paste1))
match
I also want to count the unique elements (=1). This would be stored in a vector unique. This will be repeated for plot_sub1 and df_sub2and plot_sub1 and df_sub3. I am not sure how to count unique elements among two df so I cannot offer example code for that.
unique<- 1
Note: Matches among plot_sub only need to be counted 1 time in the event the df_sub has repeated interactions or matches. This needs to account for presence- absence of matches, not the abundance.
In summary for this subset, the two vectors would be:
match<- c( length(intersect(df_sub1$paste1, plot_sub[[1]]$paste1)),
length(intersect(df_sub2$paste2, plot_sub[[1]]$paste2)),
length(intersect(df_sub3$paste3, plot_sub[[1]]$paste3))
match
unique<-c(1,0,0)
The sum will then be totaled for each vector. Example:
sum_match<- 15
sum_unique<- 1
Step7: Lastly, these values would be input into a function:
((a + b)/((2*a + b)/2) - 1) Where a= sum_match and b=sum_unique.
The value is then input into the result vector res_vec.
Step8: This process (step3-7) would be iterated for each plot subset.
Effectively, this will calculate the dissimilarity of shared interactions among plot interactions and the corresponding metaweb (all possible interactions). This is a modification from (Poisot et al 2012) to account for tritrophic interactions.
It's quite pathetic, but to start the for loop I have:
res_vec<- NA
for (i in 1:length(unique(df$region)))
{
for (j in 1:length(unique(df$plot)))
{
I really appreciate any time one is willing to help me realize the arguments within the loop. That is where it gets tricky for me.
Thans #Gregor for all the clarification you've already done in the comments!
Here is my solution using the the tidyverse.
CODE + EXPLANATION
## Load packages
library(tidyverse)
## Nest data
new_df <- df %>%
group_by(region, plot) %>%
nest(.key = plot_sub)
new_df
# A tibble: 5 x 3
# region plot plot_sub
# <dbl> <fctr> <list>
# 1 1 A <tibble [5 x 3]>
# 2 1 B <tibble [9 x 3]>
# 3 1 C <tibble [2 x 3]>
# 4 2 D <tibble [6 x 3]>
# 5 2 E <tibble [2 x 3]>
The column plot_sub contains the same data as the list with the same name in your question. Think of this column as a list of dataframes.
I know write a function to create the df_sub's. This keeps our code more clean, and avoids unecessary repetition. This function will then be applied to our column plot_sub
# Function to create the df_sub
# Takes the plot_sub, original dataframe (df) and a list of columns, which should be compared
# Returns the desired df_sub with new interactions of species which are in plot_sub
# Only unique interactions are returned
create_df_sub <- function(plot_sub, df, col_list){
# Filter df such that it only contains species which are in plot_sub
for (x in col_list) {
df <- df[df[[x]] %in% plot_sub[[x]], ]
}
# Combine plot_sub and filtered df
df_sub <- rbind(plot_sub[, col_list], df[, col_list])
# Paste relevant colums together
df_sub$paste_col <- do.call(paste, c(df_sub[, col_list], sep = '_'))
# Exclude duplicated values
df_sub <- df_sub[!duplicated(df_sub$paste_col), ]
return(df_sub)
}
Now I define the columns I want to create the df_sub with and then apply the function to the plot_sub-column
col_list1 <- c('plantsp', 'lepsp')
col_list2 <- c('lepsp', 'psitsp')
col_list3 <- c('plantsp', 'lepsp', 'psitsp')
new_df <- new_df %>%
mutate(df_sub1 = map(plot_sub, create_df_sub, df = df, col_list = col_list1),
df_sub2 = map(plot_sub, create_df_sub, df = df, col_list = col_list2),
df_sub3 = map(plot_sub, create_df_sub, df = df, col_list = col_list3))
map takes a vector or list as argument and applies the specified function to each element (like lapply). Compare the first elements of df_sub1 and plot_sub to see the difference.
new_df$plot_sub[[1]]
# A tibble: 5 x 3
# plantsp lepsp psitsp
# <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 psitsp_19
# 2 plantsp_21 lepsp_19 psitsp_4
# 3 plantsp_19 lepsp_2 psitsp_11
# 4 plantsp_9 lepsp_13 psitsp_24
# 5 plantsp_24 lepsp_9 psitsp_2
new_df$df_sub1[[1]]
# A tibble: 6 x 3
# plantsp lepsp paste_col
# <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 plantsp_2_lepsp_19
# 2 plantsp_21 lepsp_19 plantsp_21_lepsp_19
# 3 plantsp_19 lepsp_2 plantsp_19_lepsp_2
# 4 plantsp_9 lepsp_13 plantsp_9_lepsp_13
# 5 plantsp_24 lepsp_9 plantsp_24_lepsp_9
# 6 plantsp_9 lepsp_2 plantsp_9_lepsp_2
The new interaction is added in df_sub1.
To extract matching and unique values, I use inner_join and anti_join on the plot_sub-column and the different df_sub's
new_df <- new_df %>%
mutate(match1 = map2(df_sub1, plot_sub, inner_join, by = col_list1),
match2 = map2(df_sub2, plot_sub, inner_join, by = col_list2),
match3 = map2(df_sub3, plot_sub, inner_join, by = col_list3),
unique1 = map2(df_sub1, plot_sub, anti_join, by = col_list1),
unique2 = map2(df_sub2, plot_sub, anti_join, by = col_list2),
unique3 = map2(df_sub3, plot_sub, anti_join, by = col_list3))
The inner_join returns all rows, which have matching values in the columns specified in the by-argument, whereas the anti_join returns all rows of df_sub, which are not matched.
Here I use the map2-function, which takes two vectors/list and applies the specified function.
new_df$match1[[1]]
# A tibble: 5 x 4
# plantsp lepsp psitsp paste_col
# <chr> <chr> <chr> <chr>
# 1 plantsp_2 lepsp_19 psitsp_19 plantsp_2_lepsp_19
# 2 plantsp_21 lepsp_19 psitsp_4 plantsp_21_lepsp_19
# 3 plantsp_19 lepsp_2 psitsp_11 plantsp_19_lepsp_2
# 4 plantsp_9 lepsp_13 psitsp_24 plantsp_9_lepsp_13
# 5 plantsp_24 lepsp_9 psitsp_2 plantsp_24_lepsp_9
new_df$unique1[[1]]
# A tibble: 1 x 3
# plantsp lepsp paste_col
# <chr> <chr> <chr>
# 1 plantsp_9 lepsp_2 plantsp_9_lepsp_2
In the last step I extract the number of rows of each match and unique and sum it up. I also calculate the res_vec.
new_df <- new_df %>%
mutate(sum_match = map_int(match1, nrow) + map_int(match2, nrow) + map_int(match3, nrow),
sum_unique = map_int(unique1, nrow) + map_int(unique2, nrow) + map_int(unique3, nrow),
res_vec = ((sum_match + sum_unique)/((2*sum_match + sum_unique)/2)) - 1)
Here I use map_int as my return value is an integer and I want to directly use it in a sum. Using map only would return a list which I first have to convert to a integer vector.
new_df %>% select(region, plot, sum_match, sum_unique, res_vec)
# A tibble: 5 x 5
# region plot sum_match sum_unique res_vec
# <dbl> <fctr> <int> <int> <dbl>
# 1 1 A 15 1 0.03225806
# 2 1 B 27 3 0.05263158
# 3 1 C 6 2 0.14285714
# 4 2 D 18 1 0.02702703
# 5 2 E 6 0 0.00000000
DATA
set.seed(540)
df <- data.frame(region = c(rep(1, 16), rep(2, 8)),
plot = c(rep('A', 5), rep('B', 9), rep('C', 2), rep('D', 6),rep('E', 2)),
plantsp = sample(1:24, 24, replace = TRUE),
lepsp = sample(1:24, 24, replace = TRUE),
psitsp = sample(1:24, 24, replace = TRUE))
df$plantsp <- paste('plantsp', df$plantsp, sep = '_')
df$lepsp <- paste('lepsp', df$lepsp, sep = '_')
df$psitsp <- paste('psitsp', df$psitsp, sep = '_')
Following from: Use filter() (and other dplyr functions) inside nested data frames with map()
I want to nest on multiple columns, and then filter out rows by the number of items that were nested into that row. For example,
df <- tibble(
a = sample(x = c(rep(c('x','y'),4), 'w', 'z')),
b = sample(c(1:10)),
c = sample(c(91:100))
)
I want to nest on column a, as in:
df_nest <- df %>%
nest(-a)
Then, I want to filter out the rows that only have 1 observation in the data column (where a = w or a = z, in this case.) How can I do that?
You can use map/map_int on the data column to return the nrow in each nested tibble, and construct the filter condition based on it:
df %>%
nest(-a) %>%
filter(map_int(data, nrow) == 1)
# filter(map(data, nrow) == 1) works as well
# A tibble: 2 x 2
# a data
# <chr> <list>
#1 w <tibble [1 x 2]>
#2 z <tibble [1 x 2]>