How to use group_by() with rep_len() r - r

Let me know if I need a dummy example for this but essentially I have a df of subgroups, each subgroup a different length (typically 30-35k values). I'd like to bind in a vector with partial vector recycling of c(1:200). From this question I figure I can use rep_len() to get around the dataframe's anti-partial-recycling. The problem is, I can't define length.out in rep_len(), as length.out changes with each subgroup. Any help would be appreciated. I tried doing this:
df_new <- df %>%
group_by(subgroup) %>%
mutate(newcol <- rep_len(1:200, length.out=.))
Which threw an invalid length.out error. I also tried
df_new <- df %>%
group_by(subgroup) %>%
mutate(newcol <- rep_len(1:200, length.out=nrow(.)))
But this throws an error that length.out is the length of my entire df, not the previous subgroup. Any help would be appreciated!

The dplyr package has a count function n() which could work.
mtcars %>%
group_by(cyl) %>%
mutate(newcol = rep_len(1:200, length.out=n()))
Also in the mutate statement it should be a "=" and not "<-"

Related

How to calculate weighted mean using mutate_at in R?

I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)

Error when filering on rowSums using dplyr

I have the following df where df <- data.frame(V1=c(0,0,1),V2=c(0,0,2),V3=c(-2,0,2))
If I do filter(df,rowSums!=0) I get the following error:
Error in filter_impl(.data, quo) :
Evaluation error: comparison (6) is possible only for atomic and list types.
Does anybody know why is that?
Thanks for your help
PS: Plain rowSums(df)!=0 works just fine and gives me the expected logical
A more tidyverse style approach to the problem is to make your data tidy, i.e., with only one data value.
Sample data
my_mat <- matrix(sample(c(1, 0), replace=T, 60), nrow=30) %>% as.data.frame
Tidy data and form implicit row sums using group_by
my_mat %>%
mutate(row = row_number()) %>%
gather(col, val, -row) %>%
group_by(row) %>%
filter(sum(val) == 0)
This tidy approach is not always as fast as base R, and it isn't always appropriate for all data types.
OK, I got it.
filter(df,rowSums(df)!=0)
Not the most difficult one...
Thanks.

Subset a dplyr result

Im trying to subset the result of a dplyr call. Can someone explain why this doesnt work?
library(dplyr)
df<-data.frame(name=c("bob","ann"),age=c(22,24),random=c(1,2))
View(df%>%filter(name=="bob")) #works fine
#Now to avoid showing the random column I tried:
View(df%>%filter(name="bob")[,c(1,2)]) #standard subset notation to remove column 3 doesnt work here
I think if you're going to use dplyr to filter the df, you should use dplyr to select from the df. Not sure if there's any performance differences.
df %>%
filter(name == "bob") %>%
select(1,2)
df %>%
filter(name == "bob") %>%
select(name, age)

R: Using piping to pass a single argument to multiple locations in a function

I am attempting to exclusively use piping to rewrite the following code (using babynames data from babynames package:
library(babynames)
library(dplyr)
myDF <- babynames %>%
group_by(year) %>%
summarise(totalBirthsPerYear = sum(n))
slice(myDF, seq(1, nrow(myDF), by = 20))
The closest I have gotten is this code (not working):
myDF <- babyNames %>%
group_by(year) %>%
summarise(totalBirthsPerYear = sum(n)) %>%
slice( XXX, seq(1, nrow(XXX), by = 20))
where XXX is meant to be passed via pipes to slice, but I'm stuck. Any help is appreciated.
You can reference piped data in a different position in the function by using the . In your case:
myDF2 <- babynames %>%
group_by(year) %>%
summarize(totalBirthsPerYear = sum(n)) %>%
slice(seq(1, nrow(.), by = 20))
Not sure if this should be opened as a separate question & answer but in case anybody arrives here as I did looking for the answer to the MULTIPLE in the title:
R: Using piping to pass a single argument to multiple locations in a function
Using the . from Andrew's answer in multiple places also achieves this.
[example] To get the last element of a vector vec <- c("first", "middle", "last")
we could use this code.
vec[length(vec)]
Using piping, the following code achieves the same thing:
vec %>% .[length(.)]
Hopefully this is helpful to others as it would have helped me (I knew about the . but couldn't get it working in multiple locations).

Standardize data columns in R in subgrups

I'm struggling with standardization of data columns in R in subgroups.
I created the data frame:
df<-data.frame(
salesPerson=sample(c('Alan','Bob','Cindy'),20 ,replace=TRUE)
, quater=sample(c('Q1','Q2','Q3'),20 ,replace=TRUE)
,salesValue=runif(20, 5.0, 7.5)
)
I would like to add additional column to the data frame with scaled values of Sales.
To scale all column I can use code:
df$salesValueScaled<-scale(df$salesValue)
The problem is that I would like to scale sales separably for each combination of columns salesPerson and quater. Sth like:
df$salesValueScaled<-scale(df$salesValue, by =c(df$salesPerson,df$quater))
I have been searching for this solution on this forum but I couldn't find a solution to this problem.
Thank you in advance for help.
You can use dplyr for this:
library(dplyr)
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
To work around rows that return NAs, you can either keep the original values as they are or filter them out before scaling:
Keeping the original values (by keeping scaling only instances where NROW is greater than 1):
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = ifelse(NROW(salesValue) > 1, scale(salesValue), salesValue)) %>%
ungroup
Filtering them out (as suggested by #steveb):
new_df <- df %>% group_by(salesPerson, quater) %>%
filter(n() > 1) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
I hope this helps.

Resources