R Looping aggregate by group count - r

I want to write a loop that can aggregate the number of instances (of certain values) that are grouped by year. More specifically, say the variable is x1. I want to have two groups, one is when x1 = 1, and the other when it is a combination of some values (2,3, and 5 in the below example):
year x1
2000 1
2000 1
2000 2
2000 3
2000 5
The end result should look like this:
year x2 x3
2000 2 3
where x2 and x3 are the counts when x1 = 1 and x1 = c(2,3,5), respectively. How can one accomplish this?
Edit: Probably should have mentioned this earlier. I work with two datasets; one df1 is yearly (spanning approx. 200 years) and the other df2 is incident-based (around 50k observations; this is where x1 is currently located). So the idea of the loop is to look at each year[i] in df2 and aggregate the counts by grouping them as x2 and x3 in df1.
Edit2: Ah, I solved why the submitted answers were not working for me. Apparently I ran into the dplyr before plyr problem discussed in this answer; I followed ManneR's answer and detached plyr. Now the group_by command works again.

I am not sure what was wrong with user3349904's answer as it seems to do what you are asking. Its not easy to know exactly what you are asking for without knowing what your data looks like. If your issue with the other solution due to the fact that df1 needs to hold the x2 and x3 values? The last part will solve for that.
I tried to replicate your problem from scratch so here's my shot at a solution.
library(dplyr)
#create DF1 (years)
df1 <- as.data.frame(matrix(ncol=3,nrow = 200))
df1$V1 <- c(1800:1999)
colnames(df1) <- c("year","x2","x3")
#create DF2 (transactions)
df2 <- as.data.frame(matrix(ncol=2,nrow=50000))
#add random sample data
df2$V1 <- sample(1800:1999,50000,replace = T)
df2$V2 <- sample(1:5,5000,replace = T)
colnames(df2) <- c("year","x1")
# group by year in df2 and aggregate counts based on categories
df2 %>% group_by(year) %>%
summarise(x2 = sum(x1==1), x3 = sum(x1 %in% c(2,3,5))) -> df3
# match years in df3 and df1 and bring lookup value to df1
df1$x2 <- df3$x2[match(df1$year,df3$year)]
df1$x3 <- df3$x3[match(df1$year,df3$year)]

Here is another option using dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(year, grp = paste0("x", (x1 != 1) + 2)) %>%
summarise(x1= n()) %>%
spread(grp, x1)
# year x2 x3
#* <int> <int> <int>
#1 2000 2 3
Or using base R
xtabs(Freq~year + x1, transform(df1, x1= paste0("x", (x1!=1)+2), Freq= 1))

Assuming you are starting from a data frame called df, this will count the cases as you describe them by year:
library(dplyr)
df %>% group_by(year) %>% summarise(x2 = sum(x1==1), x3 = sum(x1 %in% c(2,3,5)))

Related

How to subtract using max(date) and second latest (month) date

I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

dplyr: ignore grouping variables for function input

I am trying to use tidyverse tools (instead of for loops) on some groups to be evaluated with procedures from the mvabund package.
Basically, for the procedure I need a dataframe with just numeric columns (species abundances) first and then grouping variables for a downstream procedure.
But if I want to do this on multiple groupings, I need to include grouping variables. However, when using group_by these non-numeric variables are still present and the procedure will not run.
How can I use dplyr to pass the numeric variables to a (mvabund) function?
If I were to just one group, the process is as follows:
library(tidyverse)
library(mvabund)
df <- data.frame(Genus.species1 = rep(c(0, 1), each = 10),
Genus.species2 = rep(c(1, 0), each = 10),
Genus.species3 = sample(1:100,20,replace=T),
Genus.species4 = sample(1:100,20,replace=T),
GroupVar1 = rep(c("Site1", "Site2"), each=2, times=5),
GroupVar2 = rep(c("AA", "BB"), each = 10),
GroupVar3 = rep(c("A1", "B1"), times=10))
df1 <- filter(df, GroupVar2 == "AA" & GroupVar3 == "A1") # get desired subset/group
df2 <- select(df1, -GroupVar1, -GroupVar2, -GroupVar3) # retain numeric variables
MVA.fit <- mvabund(df2) # run procedure
MVA.model <- manyglm(MVA.fit ~ df1$GroupVar1, family="negative binomial") # here I need to bring back GroupVar1 for this procedure
MVA.anova <- anova(MVA.model, nBoot=1000, test="wald", p.uni="adjusted")
MVA.anova$table[2,] # desired result
I have tried using map, do, nest, etc to no avail.
Without groupings this works
df.t <- as_tibble(df)
nest.df <- df.t %>% nest(-GroupVar1, -GroupVar2, -GroupVar3)
mva.tt <- nest.df %>%
mutate(mva.tt = map(data, ~ mvabund(.x)))
but this next step does not
mva.tt %>% mutate(MANY = map(data, ~ manyglm(.x ~ GroupVar1, family="negative binomial")))
Moreover, once I try to remove columns that sum to zero or include groupings, everything fails.
Is there a smart way to to this with dplyr and pipes? Or is a for loop the answer?
Edit:
Originally, I asked about this :Also, when broken into groups, the dataframe will contain columns that are all zeroes, normally I'd remove these. Can I have dplyr groupings that vary in the number of variables?" but the comments revealed this is not possible given my proposed set up. So I am still interested in the above.
Copied the steps into a function. Also added group information to differentiate in the last line.
fun <- function(df) {
df1 <- select(df, -GroupVar1, -GroupVar2, -GroupVar3)
df3 <- df1 %>% select_if(~sum((.)) > 0)
MVA.fit <- mvabund(df3)
MVA.model <- manyglm(MVA.fit ~ df$GroupVar1, family="negative binomial")
MVA.anova <- anova(MVA.model, nBoot=1000, test="wald", p.uni="adjusted")
cbind(Group2 = df$GroupVar2[1], Group3 = df$GroupVar3[1], MVA.anova$table[2,])
}
Split the dataframe into groups and apply the function
library(tidyverse)
library(mvabund)
df %>%
group_split(GroupVar2, GroupVar3) %>%
map_dfr(fun)
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
#Time elapsed: 0 hr 0 min 0 sec
# Group2 Group3 Res.Df Df.diff wald Pr(>wald)
#1 AA A1 3 1 1.028206 0.7432567
#2 AA B1 3 1 2.979169 0.1608392
#3 BB A1 3 1 2.330708 0.2137862
#4 BB B1 3 1 1.952617 0.2567433

How to numbering unique pairs X,Y

Ok, so I have the following data.frame:
v1<-c(456,234,981,776,112,998)
v2<-c(981,112,456,998,234,776)
df<- data.frame(v1,v2)
I want to obtain an extra variable with a numeric count of pairs of v1 and v2 values. The trick is that I need to number them by unique pairs so, for example (456,981 and 981,456) should be numbered 1.
So the outcome would be something like this:
v1<-c(456,234,981,776,112,998)
v2<-c(981,112,456,998,234,776)
v3<-c(1,2,1,3,2,3)
df<- data.frame(v1,v2,v3)
You can sort rowwise and use match, i.e.
v1 <- do.call(paste, data.frame(t(apply(df, 1, sort))))
match(v1, unique(v1))
#[1] 1 2 1 3 2 3
How about this using dplyr. Basically you would sort the columns for each row. Not sure if it would be more efficient or not. Obviously it is a lot more lines.
library(dplyr)
df <- data.frame(v1,v2)
# Sort by v1 and v2 elements by row
df.new <- df %>%
mutate(z1 = pmin(v1,v2),
z2 = pmax(v1,v2))
# Build a distinct coding table
df.codes <- df.new %>%
distinct(z1, z2) %>%
mutate(v3 = 1:n())
# Join it back together
df.new %>%
left_join(df.codes, by = c("z1", "z2")) %>%
select(v1, v2, v3)

Conditional subset of data frame by special condition

df1 <-
data.frame(Sector=c("auto","auto","auto","industry","industry","industry"),
Topic=c("1","2","3","3","5","5"),
Frequency=c(1,2,5,2,3,2))
df1
df2 <-
data.frame(Sector=c("auto","auto","auto"),
Topic=c("1","2","3"),
Frequency=c(1,2,5))
df2
I have the dataframe 1 (df1) above and want a conditional subset of it that looks like df2. The condition is as followed:
"If at least one observation of the corresponding sectors has a larger frequency than 3 it should keep all observation of the sector, if not, all observations of the corresponding sector should be dropped."
In the example obove, only the three observations of the auto-sector remain, industry is dropped.
Has anybody an idea by which condition I might achieve the aimed subset?
We can use group_by and filter from dplyr to achieve this.
library(dplyr)
df2 <- df1 %>%
group_by(Sector) %>%
filter(any(Frequency > 3)) %>%
ungroup()
df2
# # A tibble: 3 x 3
# Sector Topic Frequency
# <fct> <fct> <dbl>
# 1 auto 1 1.
# 2 auto 2 2.
# 3 auto 3 5.
Here is a solution with base R:
df1 <-
data.frame(Sector=c("auto","auto","auto","industry","industry","industry"),
Topic=c("1","2","3","3","5","5"),
Frequency=c(1,2,5,2,3,2))
subset(df1, ave(Frequency, Sector, FUN=max) >3)
and a solution with data.table:
library("data.table")
setDT(df1)[, if (max(Frequency)>3) .SD, by=Sector]

Resources