Why does dplyr::distinct behave like this for grouped data frames - r

My question involves the distinct function from dplyr.
First, set up the data:
set.seed(0)
df <- data.frame(
x = sample(10, 100, rep = TRUE),
y = sample(10, 100, rep = TRUE)
)
Consider the following two uses of distinct.
df %>%
group_by(x) %>%
distinct()
df %>%
group_by(x) %>%
distinct(y)
The first produces a different result to the second. As far as I can tell, the first set of operations finds "All distinct values of x, and return first value of y", where as the second finds "For each value of x, find all distinct values of y".
Why should this be so when
df %>%
distinct(x, y)
df %>% distinct()
produce the same result?
EDIT: It looks like this is a known bug already: https://github.com/hadley/dplyr/issues/1110

As far as I can tell, the answer is that distinct considers grouping columns when determining distinctness, which to me seems inconsistent with how the rest of dplyr works.
Thus:
df %>%
group_by(x) %>%
distinct()
Group by x, find values that are distinct in x(!). This seems to be a bug.
However:
df %>%
group_by(x) %>%
distinct(y)
Group by x, find values that are distinct in y given x. This is equivalent to either of these cases:
df %>%
distinct(x, y)
df %>% distinct()
Both find distinct values in x and y.
The take-home message seems to be: Don't use grouping and distinct. Just use the relevant column names as arguments in distinct.

Related

Creating Groups based on Column Position

Good afternoon!
I think this is pretty straight forward question, but I think I am missing a couple of steps. Would like to create groups based on column position.
Am working with a dataframe / tibble; 33 rows long, and 66 columns wide. However, every sequence of 6 columns, should really be separated into its own sub-dataframe / tibble.
The sequence of the number columns is arbitrary to the dataframe. Below is an attempt with mtcars, where I am trying to group every 2 columns into its own sub-dataframe.
mtcars %>%
tibble() %>%
group_by(across(seq(1,2, length.out = 11))) %>%
nest()
However, that method generates errors. Something similar applies when working just within nest() as well.
Using mtcars, would like to create groups using a sequence for every 3 columns, or some other number.
Would ultimately like the mtcars dataframe to be...
Columns 1:3 to be group 1,
Columns 4:6 to be group 2,
Columns 7:9 to be group 3, etc... while retaining the information for the rows in each column.
Also considered something with pivot_longer...
mtcars %>%
tibble() %>%
pivot_longer(cols = seq(1,3, by = 1))
...but that did not generate defined groups, or continue the sequencing along all columns of the dataframe.
Hope one of you can help me with this! Would make certain tasks for work much easier.
PS - A plus if you can keep the workflow to tidyverse centric code :)
You could try this. It splits the dataframe into a list of dataframes based on the number of columns you want (3 in your example):
library(tidyverse)
list_of_dataframes <- mtcars %>%
tibble() %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
group_by(row) %>%
mutate(group = ceiling(row_number()/ 3)) %>%
ungroup() %>%
group_split(group) %>%
map(
~select(.x, row, name, value) %>%
pivot_wider()
)
EDIT
Here, based on comments from the question asker, we will avoid pivoting the data. Instead, we map the groups across the dataframe.
list_of_dataframes <- map(seq(1, ncol(mtcars), by = 3),
~mtcars %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+2, ncol(mtcars))))))
We can then wrap this in a function to make it a little easier to use and change group sizes and dataframes:
group_split_cols <- function(.data, ncols_per_group){
map(seq(1, ncol(.data), by = ncols_per_group),
~.data %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+ncols_per_group-1, ncol(.data))))))
}
list_of_dataframes <- group_split_cols(.data = mtcars, ncols_per_group = 3)

How to group by one column then convert the other column into vectors?

For example, my df now is:
person <- c("a","a","a","b","b","b","c","c","c")
score <- c(31,2,13,5,6,7,8,9,4)
df <- data.frame(person,score)
what I want to get is a two-column table with three rows.
[1,1]="a", [1,2]= a vector of c(31,2,13)
[2,1]="b", [2,2]= a vector of c(5,6,7)
[3,1]="c", [3,2]= a vector of c(8,9,4)
Actually, I just want the three vectors to perform another function but I tried something like the following code, it didn't work(the actual function is much more complex but it takes in two vectors of the same length where one is provided).
f <- function(x,y){x-y}
df <- df %>%
group_by(person) %>%
summarise(diff = f(c(1,2,3), score))
Thanks so much in advance!
Base R solution:
aggregate(
score ~ person,
df,
list
)
Tidyverse solution:
library(dplyr)
df %>%
group_by(person) %>%
summarise(score = list(score))

How to calculate weighted mean using mutate_at in R?

I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)

Standardize data columns in R in subgrups

I'm struggling with standardization of data columns in R in subgroups.
I created the data frame:
df<-data.frame(
salesPerson=sample(c('Alan','Bob','Cindy'),20 ,replace=TRUE)
, quater=sample(c('Q1','Q2','Q3'),20 ,replace=TRUE)
,salesValue=runif(20, 5.0, 7.5)
)
I would like to add additional column to the data frame with scaled values of Sales.
To scale all column I can use code:
df$salesValueScaled<-scale(df$salesValue)
The problem is that I would like to scale sales separably for each combination of columns salesPerson and quater. Sth like:
df$salesValueScaled<-scale(df$salesValue, by =c(df$salesPerson,df$quater))
I have been searching for this solution on this forum but I couldn't find a solution to this problem.
Thank you in advance for help.
You can use dplyr for this:
library(dplyr)
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
To work around rows that return NAs, you can either keep the original values as they are or filter them out before scaling:
Keeping the original values (by keeping scaling only instances where NROW is greater than 1):
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = ifelse(NROW(salesValue) > 1, scale(salesValue), salesValue)) %>%
ungroup
Filtering them out (as suggested by #steveb):
new_df <- df %>% group_by(salesPerson, quater) %>%
filter(n() > 1) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
I hope this helps.

Run function on all pairs of objects in column of data frame

Suppose I have a data frame with factor "subject", and continuous variables "a" and "b". For each level of subject, I create a distance matrix from a and b:
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.)))
This returns an n-by-2 data frame, with subject and dmat as columns. What I would like to do matrix norms of each pairwise subtraction. Something along the lines of:
norm(data$dmat[[1]]-data$dmat[[2]])
norm(data$dmat[[1]]-data$dmat[[3]])
# etc etc
Ideally, I'd get out an n^2-by-3 data frame, with the first two columns indicating the two subject levels that are being compared, and the third column containing this norm calculation.
Apologies for not providing a sample dataset. I'm hoping the answer is simple enough, but if one is required I will try to write some code to generate one.
You can use mapply for this.
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.))) %>%
ungroup %>%
do(data.frame(s1 = rep(.$subject, each=nrow(.)),
s2 = rep(.$subject, times=nrow(.)),
dist = mapply(rep(.$dmat, each=nrow(.)),
rep(.$dmat, times=nrow(.)),
FUN=function(x, y) norm(x-y))))
I would probably find the matrix representation of this result easier to understand:
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.))) %>%
ungroup %>%
do(data.frame(matrix(mapply(rep(.$dmat, each=nrow(.)),
rep(.$dmat, times=nrow(.)),
FUN=function(x, y) norm(x-y)) , nrow=nrow(.))))

Resources