apply function to grouped rows in dataframe [duplicate] - r

This question already has answers here:
Split dataframe using two columns of data and apply common transformation on list of resulting dataframes
(3 answers)
Closed 5 years ago.
I have created a function that computes a number of biological statistics, such as species range edges. Here is a simplified version of the function:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[,lat], w=rangedf[,weighting], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[,lon], w=rangedf[,weighting], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
I would like to apply this to a large dataframe where every row is an observation of a species. As such, I want the function to group rows by a specified set of columns, and then computer these statistics for each group. Here is a test dataframe:
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data.table(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
I want to apply this function to every unique combination of species and year to calculate summary statistics, i.e., the following:
testresult <- testdf %>%
group_by(SPECIES, YEAR) %>%
range_stats(lat="LATITUDE",lon="LONGITUDE",weighting="BIOMASS",na.rm=T)
However, the code above does not work (I get a (list) object cannot be coerced to type 'double' error) and I am not sure how else to approach the problem.

Since you add the tag of dplyr and purrr, I assume you are interested in a tidyverse solution. So below I will demonstrate a solution based on the tidyverse.
First, your range_stats is problematic. This is why you got the error message. The weighted.mean is expecting a vector for both the x and w argument. However, if rangedf is a tibble, the way you subset the tibble, such as rangedf[,lat] will still return a one-column tibble. A better way is to use pull from the dplyr package.
library(tidyverse)
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x = rangedf %>% pull(lat),
w = rangedf %>% pull(weighting), na.rm=T)
cent_lon <- weighted.mean(x = rangedf %>% pull(lon),
w = rangedf %>% pull(weighting), na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
Next, the way you created the data frame is OK, but data.table is from the data.table package and you will create a data.table, not a tibble. I thought you want to use an approach from tidyverse, so I changed data.table to data_frame as follows.
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data_frame(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
Now, you said you want to apply the range_stats function to each combination of SPECIES and YEAR. One approach is to split the data frame to a list of data frames, and use lapply family function. But here I want to show you how to use the map family function to achieve this task as map is from the purrr package, which is part of the tidyverse.
We can first create a group indices based on SPECIES and YEAR.
testdf2 <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR))
testdf2
# A tibble: 6 x 6
LATITUDE LONGITUDE BIOMASS SPECIES YEAR Group
<dbl> <dbl> <dbl> <chr> <chr> <int>
1 27.91977 -175.8617 4.3540488 Abudefduf abdominalis 2005 1
2 21.29066 -157.8645 0.2406332 Abudefduf abdominalis 2005 1
3 26.06340 -173.9593 0.2406332 Abudefduf abdominalis 2014 2
4 28.38918 -178.3571 2.1419699 Chaetodon lunulatus 2009 3
5 25.97517 -173.9679 0.3451426 Chaetodon lunulatus 2009 3
6 27.96313 -175.7837 1.0946017 Chaetodon lunulatus 2015 4
As you can see, Group is a new column showing the index number. Now we can split the data frame based on Group, and then use map_dfr to apply the range_stats function.
testresult <- testdf2 %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
testresult
Group cent_lat cent_lon
1 1 27.57259 -174.9191
2 2 26.06340 -173.9593
3 3 28.05418 -177.7480
4 4 27.96313 -175.7837
Notice that map_dfr can automatic bind the output list of data frames to a single data frame. .id = "Group" means we want to create a column called Group based on the name of the list element.
I separated the process into two steps, but of course they can be all in one pipeline as follows.
testresult <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR)) %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
If you want, testresult can be merged with testdf using left_join, but I will stop here as testresult is probably already the desired output you want. I hope this helps.

Fundamentally, the main issue involves weighted.mean() where you are passing a dataframe object and not a vector that can be coerced to double. To fix within method, simply change:
x=rangedf[,lat]
To double brackets:
x=rangedf[[lat]]
Adjusted method:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[[lat]], w=rangedf[[weighting]], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[[lon]], w=rangedf[[weighting]], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
As for overall group by slice computation, do forgive me in bypassing, dplyr and data.table which you use and consider base R's underutilized but useful method, by().
The challenge with your current setup is the output of range_stats method return is a data.frame of two columns and dplyr's group_by() expects one aggregation vector operation. However, by passes dataframe objects (sliced by factors) into a defined function to return a list of data.frames which you can then rbind for one final dataframe:
df_List <- by(testdf, testdf[, c("SPECIES", "YEAR")], FUN=function(df)
data.frame(species=df$SPECIES[1],
year=df$YEAR[1],
range_stats(df,"LATITUDE","LONGITUDE","BIOMASS"))
)
finaldf <- do.call(rbind, df_List)
finaldf
# species year cent_lat cent_lon
# 1 Abudefduf abdominalis 2005 27.57259 -174.9191
# 2 Chaetodon lunulatus 2009 28.05418 -177.7480
# 3 Abudefduf abdominalis 2014 26.06340 -173.9593
# 4 Chaetodon lunulatus 2015 27.96313 -175.7837

Related

Deriving cosine values for vector contrasts distributed over rows in a dataframe (rows to individual vectors)

I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
Here's a base R solution:
# Load {lsa}
library(lsa)
# Generate data with 250k rows and 300 columns
gen_list <- lapply(1:250000, function(i){
rnorm(300)
})
# Convert to matrix
mat <- t(simplify2array(gen_list))
# Obtain desired values
vals <- unlist(
lapply(
2:nrow(mat), function(i){
cosine(mat[i-1,], mat[i,])
}
)
)
You can ignore the gen_list code as this was to generate example data.
You will want to convert your data frame to a matrix to make it compatible with the {lsa} package.
Runs quickly -- 3.39 seconds on my computer
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i + 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i + 1, ])))}))

Create multiple datafame

I intend to create multiple data frame from a data like below:
ID Time Ethnicity LDL HDL ....
1 1 black
2 2 white
3 1 black
4 2 White
each data frame is mean values of the column LDL, HDL, ... in 4 rows displayed in the data. I used the following code but the problem is all the data frames are identical. I mean DF[[1]] is the same as DF[[2]], ...DF[[15]]. I would appreciate if you could help me find the solution.
dv=c(names(data[,4:15]))
library(ggplot2)
require(plyr)
for (i in 1:12) {
DF[[i]] = ddply(data, c("Time", "Ethnicity"), summarize,
Mean = mean(data[[paste(dv[i])]], na.rm = T))
}
plyr is retired, you could use dplyr. When you do mean(data[[paste(dv[i])]], you are subsetting the entire column and not respecting groups. Hence, you get the same mean for all the values in DF[[1]], DF[[2]] etc.
library(dplyr)
output_df <- data %>%
group_by(Time, Ethnicity) %>%
summarise_at(4:15, mean, na.rm = TRUE) %>%
ungroup
If you want list of dataframes you could use group_split :
DF <- output_df %>% group_split(Time, Ethnicity)

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

Using magrittr and lapply to divide a column in each df in a list by a list of values

I have a list of dataframes containing different time series of different lengths. I want to summarize the count of a variable and then normalize it by the number of years of data that is contained in that particular dataset.
so with a sample dataframe:
data_list <- list(data.frame(temp_bin = rep(1:4, 2:5), value = runif(14)),
data.frame(temp_bin = rep(1:4, 3:6), value = runif(18)),
data.frame(temp_bin = rep(1:4, 4:7), value = runif(22)))
# this might be ~10 different data sets with ~ 100k observations each
count <- lapply(data_list, function(x) {nrow(x)/5} )
# for real data this would be divided by 8760 for the # of hours in a year.
Here is approximately what I want to do, but the n()/count doesn't work because count is a list.
data_bin <- data_list %>%
lapply(., group_by, temp_bin) %>%
lapply(., summarise, n = n()/count)
I tried doing an lapply or mapply within the definition of n, but that didn't seem to work. also tried doing it in two steps - create get a raw n value and then divide in the next step with mapply, but that didn't work either.
If you put the count step in your data_bin step I think it accomplishes what you want, though I am a little hazy on exactly what you mean but I think this works: (Note that you can remove the . assignment from the first argument of lapply, that's the default behavior of %>%)
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
# We need x so I put summarize in a manual function
lapply(function(x){summarize(x,n = 5*n()/nrow(x))}) # move the 5 to numerator
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 0.7142857
2 2 1.0714286
3 3 1.4285714
4 4 1.7857143
Is this what you wanted? You can double check the summarize is part is doing what you want by just returning the nrow(x) result.
data_bin <- data_list %>%
lapply(group_by, temp_bin) %>%
lapply(function(x){summarize(x,n = nrow(x))})
data_bin[[1]]
Source: local data frame [4 x 2]
temp_bin n
1 1 14
2 2 14
3 3 14
4 4 14
I would try to avoid using lapply on every row of a dplyr statement. You could wrap individual data.frame transformation in a function and then lapply that function to data_list
library(dplyr)
ret_db <- function(df) {
db <- df %>%
group_by(.,temp_bin) %>%
summarise(.,n=n()/(nrow(df)/5))
return(db)
}
data_bin <- lapply(data_list,ret_db)

Iteratively create columns based on grouped variables

I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.
set.seed(1234)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas"), 50, replace = TRUE),
june = sample(1:50, 50, replace=TRUE),
july = sample(100:150, 50, replace=TRUE)
)
So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).
Basically, I want to apply this function
library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))
across every month, without having to specify "june" or "july". My initial take:
testfun <- function(df, col) {
name <- paste(col, "_tot", sep="")
data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
return(data2)
}
but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.
Any ideas how to lapply this sort of argument?
Here are possible solutions to your problems using dplyr (first, since that is what you tried), and followed by data.table as well as base R solutions:
dplyr:
cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)
Assumes every column but the first two are monthly data. An explanation by line:
we use as.name and lapply to generate a list of the columns names we want to mutate as symbols
we give the new names we want (i.e. month_tot) to the list of symbols from 1.
we use the mutate_each_q (known as mutate_each_ in dplyr 0.3.0.2) to apply sum to the list of expressions we created in 1. and 2.
This is the (sample) result:
Source: local data frame [50 x 6]
Groups: region
biz region june july june_tot july_tot
1 shipping mideast 17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4 tech americas 24 135 465 2901
... rows omitted
data.table:
new.names <- paste0(tail(names(data), 2L), "_tot") # Make new names
data.table(data)[,
(new.names):=lapply(.SD, sum), # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
by=region, .SDcols=-1 # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][] # extra `[]` just to force printing
Here, similar logic, except we use the special .SD object that represents every column in the data.table that we are not grouping by.
base:
do.call(
cbind,
list(
data,
setNames(
lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
paste0(names(data[-(1:2)]), "_tot")
) ) )
Here we use ave to compute the per region sums, use lapply to apply ave to each column, and use do.call(cbind, ...) to reconstruct the final data frame.
Try:
> for(i in 3:4) print(tapply(data[[i]], data$region, sum))
americas mideast
563 768
americas mideast
2538 3802
You can get all outputs in a list if you want.
Restructuring the data works well for this.
require(tidyr)
# wide to long
d2 <- gather(data = data,key = month,value = monthval,-c(biz,region))
# get totals and rename month
month_tots <- aggregate(x = list(total = d2$monthval),by = list(region = d2$region,month = d2$month),sum)
month_tots$month <- paste0(month_tots$month,'_tot')
# long to wide
month_tots <- spread(data = month_tots,key = month,value = total)
# recombine
merge(data,month_tots,by = 'region',all.x = T)

Resources