How to summarize the data by factor levels in r - r

I have the following data and i want to summarise(min/max/mean/median/mode/sd the date by factor levels which is cluster.kmeans column
head(MS.DATA.IMPVAR.KMEANS,10)
subscribers arpu handset3g mou rechargesum cluster.kmeans
1 105822 197704.10 19040 2854801.0 235430 5
2 18210 34799.21 2856 419109.0 39820 6
3 71351 133842.38 13056 2021183.0 157099 3
4 44975 104681.58 9439 1303220.6 121697 2
5 75860 133190.55 12605 1714640.8 144262 5
6 63740 119389.91 11067 1651303.2 143333 1
7 59368 117792.03 11747 1690910.7 136902 5
8 40064 80427.09 7217 886214.5 89226 2
9 51966 99385.52 9972 1407985.7 117353 5
10 70811 141131.66 12362 1373104.7 158206 4
I tried using dplyr and i got as below:
s_kmeans <- MS.DATA.IMPVAR.KMEANS %>% group_by(cluster.kmeans) %>% summarise_all(c("mean", "median", "min", "max", "sd"))
s_kmeans <- gather(s_kmeans, key, value, -cluster.kmeans)
s_kmeans$variable <- sapply(strsplit(s_kmeans$key, "_"), `[`,1)
s_kmeans$stat <- sapply(strsplit(s_kmeans$key, "_"), `[`, 2)
MS.DATA.STATS.KMEANS <- select(s_kmeans, -key) %>% spread(key = stat, value = value)
head(MS.DATA.STATS.KMEANS)
A tibble: 6 × 7
cluster.kmeans variable max mean median min
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 arpu 250153.5 164652.99 163718.33 88306.53
2 1 handset3g 21809.0 13736.38 13598.00 6936.00
3 1 mou 1143639.1 338834.54 313010.20 116523.59
4 1 rechargesum 270169.0 173397.03 171897.00 89080.00
5 1 subscribers 41428.0 26515.01 26321.00 13794.00
6 2 arpu 163566.9 84552.09 82402.23 29477.03
I would like do in some other way with fewer lines of codes where i do not use dplyr......using base r functions like by ..aggregate etc....

It is not clear whether fewer lines of code or base R is the priority. However, with the current Hadleyverse format, we can place of the code within in the %>% and use separate instead of the two sapply steps to make it more compact
library(dplyr)
library(tidyr)
MS.DATA.IMPVAR.KMEANS %>%
group_by(cluster.kmeans) %>%
summarise_all(funs(mean, median, min, max, sd)) %>%
gather(key, value, -cluster.kmeans) %>%
separate(key, into = c("variable", "stats")) %>%
spread(stats, value)

Related

Aggregate string variable using summarise and across function

df_input is the input file, and the ideal output file is df_output.
df_input <- data.frame(id = c(1,2,3,4,4,5,5,5,6,7,8,9,10),
party = c("A","B","C","D","E","F","G","H","I","J","K","L","M"),
winner= c(1,1,1,1,1,1,1,1,1,1,1,1,1))
df_output <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
party = c("A","B","C","D,E","F_G_H","I","J","K","L","M"),
winner_sum = c(1,1,1,2,3,1,1,1,1,1))
Previously the code worked using the "summarise_at" function as follows:
df_output <- df_input %>%
dplyr::group_by_at(.vars = vars(id)) %>%
{left_join(
dplyr::summarise_at(., vars(party), ~ str_c(., collapse = ",")),
dplyr::summarise_at(., vars(winner), funs(sum))
)}
But it no longer works as it seems both "summarise_at" and "funs" has been deprecated.
I am trying to replicate using across with dplyr (1.0.10), but I am getting an error. Here is my attempt:
df_output <- df_input %>%
group_by(id) %>%
summarise(across(winner, sum, na.rm=T)) %>%
summarise(across(party, str_c(., collapse = ",")))
I have multiple numeric and character variables,s not just one, as in the example. Thanks a lot.
We don't need across if we need to apply different functions on single columns
library(dplyr)
library(stringr)
df_input %>%
group_by(id) %>%
summarise(party = str_c(party, collapse = ","),
winner_sum = sum(winner))
-output
# A tibble: 10 × 3
id party winner_sum
<dbl> <chr> <dbl>
1 1 A 1
2 2 B 1
3 3 C 1
4 4 D,E 2
5 5 F,G,H 3
6 6 I 1
7 7 J 1
8 8 K 1
9 9 L 1
10 10 M 1
If there are multiple 'party', 'winner' columns, loop across them in a single summarise as after the first summarise we have only the summarised column with the group column
df_input %>%
group_by(id) %>%
summarise(across(winner, sum, na.rm=TRUE),
across(party, ~ str_c(.x, collapse = ",")), .groups = "drop")
-output
# A tibble: 10 × 3
id winner party
<dbl> <dbl> <chr>
1 1 1 A
2 2 1 B
3 3 1 C
4 4 2 D,E
5 5 3 F,G,H
6 6 1 I
7 7 1 J
8 8 1 K
9 9 1 L
10 10 1 M
NOTE: If the columns have a simplar prefix then use starts_with to select all those columns i.e. across(starts_with("party"), or if there are different column names - across(c(party, othercol), or if the functions applied are based on their type - across(where(is.numeric), sum,, na.rm = TRUE)
df_input %>%
group_by(id) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE),
across(where(is.character), str_c, collapse = ","),
.groups = 'drop')

Replacing NA values with mode from multiple imputation in R

I ran 5 imputations on a data set with missing values. For my purposes, I want to replace missing values with the mode from the 5 imputations. Let's say I have the following data sets, where df is my original data, ID is a grouping variable to identify each case, and imp is my imputed data:
df <- data.frame(ID = c(1,2,3,4,5),
var1 = c(1,NA,3,6,NA),
var2 = c(NA,1,2,6,6),
var3 = c(NA,2,NA,4,3))
imp <- data.frame(ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
var1 = c(1,2,3,3,2,5,4,5,6,6,7,2,3,2,5,6,5,6,6,6,3,1,2,3,2),
var2 = c(4,3,2,3,2,4,6,5,4,4,7,2,4,2,3,6,5,6,4,5,3,3,4,3,2),
var3 = c(7,6,5,6,6,2,3,2,4,2,5,4,5,3,5,1,2,1,3,2,1,2,1,1,1))
I have a method that works, but it involves a ton of manual coding as I have ~200 variables total (I'm doing this on 3 different data sets with different variables). My code looks like this for one variable:
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
var1 <- imp %>% group_by(ID) %>% summarise(var1 = mode(var1))
df3 <- df %>%
left_join(var1, by = "ID") %>%
mutate(var1 = coalesce(var1.x, var1.y)) %>%
select(-var1.x, -var1.y)
Thus, the original value in df is replaced with the mode only if the value was NA.
It is taking forever to keep manually coding this for every variable. I'm hoping there is an easier way of calculating the mode from the imputed data set for each variable by ID and then replacing the NAs with that mode in the original data. I thought maybe I could put the variable names in a vector and somehow iterate through them with one code where i changes to each variable name, but I didn't know where to go with that idea.
x <- colnames(df)
# Attempting to iterate through variables names using i
i = as.factor(x[[2]])
This is where I am stuck. Any help is much appreciated!
Here is one option using tidyverse. Essentially, we can pivot both dataframes long, then join together and coalesce in one step rather than column by column. Mode function taken from here.
library(tidyverse)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
imp_long <- imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
pivot_longer(-ID)
df %>%
pivot_longer(-ID) %>%
left_join(imp_long, by = c("ID", "name")) %>%
mutate(var1 = coalesce(value.x, value.y)) %>%
select(-c(value.x, value.y)) %>%
pivot_wider(names_from = "name", values_from = "var1")
Output
# A tibble: 5 × 4
ID var1 var2 var3
<dbl> <dbl> <dbl> <dbl>
1 1 1 3 6
2 2 5 1 2
3 3 3 2 5
4 4 6 6 4
5 5 3 6 3
You can use -
library(dplyr)
mode_data <- imp %>%
group_by(ID) %>%
summarise(across(starts_with('var'), Mode))
df %>%
left_join(mode_data, by = 'ID') %>%
transmute(ID,
across(matches('\\.x$'),
function(x) coalesce(x, .[[sub('x$', 'y', cur_column())]]),
.names = '{sub(".x$", "", .col)}'))
# ID var1 var2 var3
#1 1 1 3 6
#2 2 5 1 2
#3 3 3 2 5
#4 4 6 6 4
#5 5 3 6 3
mode_data has Mode value for each of the var columns.
Join df and mode_data by ID.
Since all the pairs have name.x and name.y in their name, we can take all the name.x pairs replace x with y to get corresponding pair of columns. (.[[sub('x$', 'y', cur_column())]])
Use coalesce to select the non-NA value in each pair.
Change the column name by removing .x from the name. ({sub(".x$", "", .col)}) so var1.x becomes only var1.
where Mode function is taken from here
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr, warn.conflicts = FALSE)
imp %>%
group_by(ID) %>%
summarise(across(everything(), Mode)) %>%
bind_rows(df) %>%
group_by(ID) %>%
summarise(across(everything(), ~ coalesce(last(.x), first(.x))))
#> # A tibble: 5 × 4
#> ID var1 var2 var3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 6
#> 2 2 5 1 2
#> 3 3 3 2 5
#> 4 4 6 6 4
#> 5 5 3 6 3
Created on 2022-01-03 by the reprex package (v2.0.1)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

Flip group_by variable to columns, and flip columns to rows dplyr

thank you in advance for your response! I am working in Rstudio, trying to create a specific table format that my customer is looking for. Specifically, I would like to show each metric as a row and the group_by variable, in this case application type, as a column. I'm using group_by to consolidate all my data by application type, and I'm using the summarise function to create the new variables.
subs <- data.frame(
App_type = c('A','A','A','B','B','B','C','C','C','C'),
Has_error = c(1,1,1,0,0,1,1,0,1,1),
Has_critical_error = c(1,0,1,0,0,1,0,0,1,1)
)
I'm able to group the submissions together by application type to see total submissions with errors and total with critical errors. Here's what I've done -
subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)
)
# A tibble: 3 x 4
App_type total_sub total_error total_critical_error
<fct> <int> <dbl> <dbl>
1 A 3 3 2
2 B 3 1 1
3 C 4 3 2
However, my customer would like to see it this way with application totals.
A B C TOTAL
1 total_sub 3 3 4 10
2 total_error 3 1 3 7
3 total_critical_error 2 1 2 5
We can pivot to 'wide' format after reshaping to 'long' and then change the column name 'name' to rowname
library(dplyr)
library(tidyr)
library(tibble)
subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)) %>%
pivot_longer(cols = -App_type) %>%
pivot_wider(names_from = App_type, values_from = value) %>%
mutate(TOTAL = A + B + C) %>%
column_to_rownames("name")
# A B C TOTAL
#total_sub 3 3 4 10
#total_error 3 1 3 7
#total_critical_error 2 1 2 5
Or another option is transpose from data.table
library(data.table)
data.table::transpose(setDT(out), make.names = 'App_type',
keep.names = 'name')[, TOTAL := A + B + C][]
where out is the OP's summarised output
out <- subs %>%
group_by(App_type) %>%
summarise(
total_sub = n(),
total_error = sum(Has_error),
total_critical_error = sum(Has_critical_error)
)
Or with base R
addmargins(t(cbind(total_sub = as.integer(table(subs$App_type)),
rowsum(subs[-1], subs$App_type))), 2)
# A B C Sum
#total_sub 3 3 4 10
#Has_error 3 1 3 7
#Has_critical_error 2 1 2 5

How to Use Rank Function in R (using dplyr)

I have a data table called prob72. I want to add a column for rank. I want to rank each row by frac_miss_arr_delay. The highest value of frac_miss_arr_delay should get rank 1 and the lowest value should get the highest ranking (for my data that is rank 53). frac_miss_arr_delay are decimal values all less than 1. When I use the following line of code it ranks every single row as "1"
prob72<- prob72 %>% mutate(rank=rank(desc(frac_miss_arr_delay), ties.method = "first"))
I've tried using row_number as well
prob72<- prob72 %>% mutate(rank=row_number())
This STILL outputs all "1s" in the rank column.
week arrDelayIsMissi~ n n_total frac_miss_arr_d~
<dbl> <lgl> <int> <int> <dbl>
1 6. TRUE 1012 6101 0.166
2 26. TRUE 536 6673 0.0803
3 10. TRUE 518 6549 0.0791
4 50. TRUE 435 6371 0.0683
5 49. TRUE 404 6398 0.0631
6 21. TRUE 349 6285 0.0555
prob72[6]
# A tibble: 53 x 1
rank
<int>
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
# ... with 43 more rows
flights_week = mutate(flights, week=lubridate::week(time_hour))
prob51<-flights_week %>%
mutate(pos_arr_delay=if_else(arr_delay<0,0,arr_delay))
prob52<-prob51 %>% group_by(week) %>% mutate(avgDelay =
mean(pos_arr_delay,na.rm=T))
prob52 <- prob52 %>% mutate(ridic_late=TRUE)
prob52$ridic_late<- ifelse(prob52$pos_arr_delay>prob52$avgDelay*10,TRUE, FALSE)
prob53<- prob52 %>% group_by(week) %>% count(ridic_late) %>% arrange(desc(ridic_late))
prob53<-prob53 %>% filter(ridic_late==TRUE)
prob54<- prob52 %>% group_by(week) %>% count(n())
colnames(prob53)[3] <- "n_ridiculously_late"
prob53["n"] <- NA
prob53$n <- prob54$n
table5 = subset(prob53, select=c(week,n, n_ridiculously_late))
prob71 <- flights_week
prob72 <- prob71 %>% group_by(week) %>% count(arrDelayIsMissing=is.na(arr_delay)) %>% arrange(desc(arrDelayIsMissing)) %>% filter(arrDelayIsMissing==TRUE)
prob72["n_total"] <- NA
prob72$n_total<- table5$n
prob72<-prob72 %>% mutate(percentageMissing = n/n_total)
prob72<-prob72 %>% arrange(desc(percentageMissing))
colnames(prob72)[5]="frac_miss_arr_delay"

r: Summarise for rowSums after group_by

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581

Resources