Get each row's value and that of the group in dplyr - r

Edit: I asked this question poorly. For a more clear question, please see Find the variance over a sliding window in dplyr
I'm trying to call a function using each row's value and that of the group.
# make some data with categories a and b
library(dplyr)
df = expand.grid(
a = LETTERS[1:3],
b = 1:3,
x = 1:5
)
# add a variable that changes within group
df$b2 = df$b + floor(runif(nrow(df))*100)
df %>%
# group the data
group_by(a, b) %>%
# row by row analysis
rowwise() %>%
# do some function based on this row's value and the vector for the group
mutate(y = x + 100*max(.$b2))
I want .$b2 to correspond to only items in the current group. Instead it's the entire data frame.
Is there any way to get just the group's data?
Note: I don't actually care about max. It's just a standin for a more complicated function. I need to be able to call foo(one_value, group_vector).

Try
df %>%
group_by(a,b) %>%
mutate(y=x+100*max(b2))

Related

R: Add count for unique values within Group, disregarding other variables within dataframe

I would like to add a new variable to my data frame, which, for each group says the number of unique entries with relation to one variable (state), while disregaring others.
Data input
df <- data.frame(id=c(1,2,3,4,5,6,7,8,9),
state=c("CT","CT","AK","TX","TX","AZ","GA","TX","WA"),
group=c(1,1,2,3,3,3,4,4,4),
age=c(12,33,57,98,45,67,16,85,22)
)
df
Desired output
want <- data.frame(id=c(1,2,3,4,5,6,7,8,9),
state=c("CT","CT","AK","TX","TX","AZ","GA","TX","WA"),
group=c(1,1,2,3,3,3,4,4,4),
age=c(12,33,57,98,45,67,16,85,22),
count=c(1,1,1,2,2,2,3,3,3)
)
want
We need a group by n_distinct
library(dplyr)
df %>%
group_by(group) %>%
mutate(count = n_distinct(state)) %>%
ungroup

How to calculate weighted mean using mutate_at in R?

I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)

call variable that has been grouped by

Some sample data:
df <- data.frame(lang = rep(c("A", "B", "C"), 3),
answer = rep(c("1", "2", "3"), each=3))
I am getting an error when I try to call a variable that I recently grouped by:
df2 <- df %>%
Total = count(lang) %>% # count is short hand for tally + group_by()
filter(answer=='2') %>%
mutate(prop = NROW(answer)/NROW(Total))
Error in group_vars(x) : object 'lang' not found
I would like a new column on my dataframe that says the proportion of the answer '2' to total observations in each level of lang. So how many times does '2' occur in 'A' in proportion to the total number of observations in 'A'?
Here's a solution that does what you want:
df %>%
group_by(lang) %>%
summarize(
prop = length(lang[answer==2])/n()
)
Here, we group by the variable or variables that you want set as the unique groups you want to get the proportion of and then use summarize to calculate the length of the vector of one of the variables where answer is equal to 2 and divide that by the number of rows in the grouping. If, for whatever reason, you want the prop column AND the answer column, just change summarize to mutate.
The reason you were getting the error about not finding lang is because count needs to be used as a function like mutate, i.e.
df %>%
count(lang, name = "Total")
You could achieve the same thing adapting your code, but you should use add_count (so your answer column is preserved) or mutate(Total = n()). However, group_by was designed to address problems such as this and is definitely worth spending some time to learn about.
df %>%
add_count(lang, name = "Total") %>%
filter(answer == 2) %>%
add_count(lang, name = "Twos") %>%
distinct(lang, .keep_all = TRUE) %>%
mutate(prop = Twos/Total) %>%
select(lang, prop)
Alternate solution with data.table
I prefer to use data.table than data frames everywhere personally. Here is the implementation with that method, although admittedly it looks a bit more cryptic than the solution in dplyr (The syntax to accomplish something like this may be more involved, but getting used to it ends up giving you a whole bag of tricks, and with simple queries the syntax actually looks better)
You end up trying to use "lang" like its a variable, when its a name of a column.
To get the values requested, 0.3333 for each,
library(data.table)
df <- data.table(df)
df[, nrow(.SD[answer == 2])/nrow(.SD), by="lang"]
lang V1
1: A 0.3333333
2: B 0.3333333
3: C 0.3333333
(the special variable .SD allows you to manipulate every subset of the data, split by by)

In R, how can I filter based on the maximum value in each row of my data?

I have a tibble (or data frame, if you like) that is 19 columns of pure numerical data and I want to filter it down to only the rows where at least one value is above or below a threshold. I prefer a tidyverse/dplyr solution but whatever works is fine.
This is related to this question but a distinct in at least two ways that I can see:
I have no identifier column (besides the row number, I suppose)
I need to subset based on the max across the current row being evaluated, not across a column
Here are attempts I've tried:
data %>% filter(max(.) < 8)
data %>% filter(max(value) < 8)
data %>% slice(which.max(.))
Here's a way which will keep rows having value above threshold. For keeping values below threshold, just reverse the inequality in any -
data %>%
filter(apply(., 1, function(x) any(x > threshold)))
Actually, #r2evans has better answer in comments -
data %>%
filter(rowSums(. > threshold) >= 1)
Couple more options that should scale pretty well:
library(dplyr)
# a more dplyr-y option
iris %>%
filter_all(any_vars(. > 5))
# or taking advantage of base functions
iris %>%
filter(do.call(pmax, as.list(.))>5)
Maybe there are better and more efficient ways, but these two functions should do what you need if I understood correctly. This solution assumes you have only numerical data.
You transpose the tibble (so you obtain a numerical matrix)
Then you use map to get the max or min by column (which is the max/min by row in the initial dataset).
You obtain the row index you are looking for
Finally, you can filter your dataset.
# Random Data -------------------------------------------------------------
data <- as.tibble(replicate(10, runif(20)))
# Threshold to be used -----------------------------------------------------
max_treshold = 0.9
min_treshold = 0.1
# Lesser_max --------------------------------------------------------------
lesser_max = function(data, max_treshold = 0.9) {
index_max_list =
data %>%
t() %>%
as.tibble() %>%
map(max) %>%
unname()
index_max =
index_max_list < max_treshold
data[index_max,]
}
# Greater_min -------------------------------------------------------------
greater_min = function(data, min_treshold = 0.1) {
index_min_list =
data %>%
t() %>%
as.tibble() %>%
map(min) %>%
unname()
index_min =
index_min_list > min_treshold
data[index_min,]
}
# Examples ----------------------------------------------------------------
data %>%
lesser_max(max_treshold)
data %>%
greater_min(min_treshold)
We can use base R methods
data[Reduce(`|`, lapply(data, `>`, threshold)),]`

How to return a value from a variable based on a condition in another variable within a grouped data frame?

I am calculating some metrics on each of a set of variables within a grouped dataframe using the basic group_by() + summarize_at approach. Each group represents a small timeseries. One metric I would like to calculate is the initial value (in this case, day == 1) of each variable within each group. Thus, the generalized problem is to return a value of a variable based on a criterion in another variable, within groups of a grouped dataframe. Within the group_by() + summarize_at approach, I believe I need a custom function that summarize_at can then apply to each variable. I can successfully deploy other custom functions that depend only on the data variable at hand. I seem to be hung up on getting the function to go look in other columns of the dataframe.
I am not married to this approach, and welcome alternate recommendations. However, I am most comfortable with dplyr.
# a dataset
df <- data.frame(day = rep(c(1:5),3),
group = c(rep(1,5),rep(2,5),rep(3,5)),
var_a = seq(1:15),
var_b = seq(2,30, length.out = 15),
var_c = seq(3,45, length.out = 15))
# the logic of what I am going for, on a manually extracted example group:
# initial value (day == 1) of var_a for group 2
df_subset <- df %>%
filter(group == 2)
df_subset$var_a[which(df_subset$day == 1)]
# [1] 6
# my laughable attempt at a function
initial <- function(x){
ini <- which(.$day == 1)
x[ini]
}
# custom function deployed in dplyr pipe (which of course doesn't work)
df %>%
group_by(group) %>%
summarize_at(c("var_a","var_b","var_c"),
list(max = max, ini = initial))
Many thanks.
After the group_by step, specify the variables to select in summarise_at using one of the select_helpers (here starts_with works fine), and within the list, apply the different functions on each of the columns (~ is one way to prefix the anonymous call instead of explicitly specifying function(x)), For the second function, 'day' is not part of the selected columns, but it can be selected with the unquoted column name
library(dplyr)
df %>%
group_by(group) %>%
summarise_at(vars(starts_with('var')),
list(max = ~max(.), ini = ~ .[day == 1]))

Resources