In R , how to summarize data frame in multiple dimensions - r

There is dataframe raw_data as below, How can i change it to wished_data in easy way ?
I currently know group_by/summarise the data serval times (and add variables) , then rbind them. But this is little boring , especially when variables more then this example in occasion.
I want to know ,if is there any general method for similar situation ? Thanks!
library(tidyverse)
country <- c('UK','US','UK','US')
category <- c("A", "B", "A", "B")
y2021 <- c(17, 42, 21, 12)
y2022 <- c(49, 23, 52, 90)
raw_data <- data.frame(country,category,y2021,y2022)

We may use rollup/cube/groupingsets from data.table
library(data.table)
out <- rbind(setDT(raw_data), groupingsets(raw_data, j = lapply(.SD, sum),
by = c("country", "category"),
sets = list("country", "category", character())))
out[is.na(out)] <- 'TOTAL'
-output
> out
country category y2021 y2022
<char> <char> <num> <num>
1: UK A 17 49
2: US B 42 23
3: UK A 21 52
4: US B 12 90
5: UK TOTAL 38 101
6: US TOTAL 54 113
7: TOTAL A 38 101
8: TOTAL B 54 113
9: TOTAL TOTAL 92 214
Or with cube
out <- rbind(raw_data, cube(raw_data,
j = .(y2021= sum(y2021), y2022=sum(y2022)), by = c("country", "category")))
out[is.na(out)] <- 'TOTAL'

We can use the adorn_totals function from janitor. get_totals accepts a data frame and a column and it outputs the data frame with totals for the numeric columns, one such row for each level of the specified column. It then extracts out the total rows and since adorn_totals can rearrange the column order uses select to put the order back to the original so that we can later bind mulitiple instances together. We then bind together the orignal data frame and each of the total row data frames that we want.
library(dplyr)
library(janitor)
get_totals <- function(data, col) {
data %>%
group_by({{col}}) %>%
group_modify(~ adorn_totals(.)) %>%
ungroup %>%
filter(rowSums(. == "Total") > 0) %>%
select(any_of(names(data)))
}
bind_rows(
raw_data,
get_totals(raw_data, category),
get_totals(raw_data, country),
get_totals(raw_data)
)
giving:
country category y2021 y2022
1 UK A 17 49
2 US B 42 23
3 UK A 21 52
4 US B 12 90
5 Total A 38 101
6 Total B 54 113
7 UK Total 38 101
8 US Total 54 113
9 Total - 92 214

Related

Keeping one row and discarding others in R using specific criteria?

I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1
I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124
In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))
slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124

Sum Specific Rows by Multiple Groups

I have a dataframe like the one below...
df <- data.frame(row.names = c(1,2,3,4,5,6,7,8), Week = c(1,1,2,2,52,52,53,53), State = c("Florida", "Georgia","Florida", "Georgia","Florida", "Georgia","Florida", "Georgia"), Count_2001 = c(25,16,83,45,100,98,22,34), Count_2002 = c(3, 78, 22, 5, 78, 6, 88, 97))
I am now trying to manipulate this dataset such that only weeks 52 and 53 get summed together for each state in the list, across all of the Count columns. Similar to this example.. GROUP BY for specific rows
The new dataset should have these rows summed together to create the new Week 52 row for each state, like this example below...
df2 <- data.frame(row.names = c(1,2,3,4,5,6), Week = c(1,1,2,2,52,52), State = c("Florida", "Georgia","Florida", "Georgia","Florida", "Georgia"), Count_2001 = c(25,16,83,45,122,132), Count_2002 = c(3, 78, 22, 5, 166, 103))
Is there an easy solution for this in R?
Change your 53s to 52s and do a sum by group:
library(dplyr)
df %>%
mutate(Week = case_when(Week == 53 ~ 52, TRUE ~ Week)) %>%
group_by(State, Week) %>%
summarize(across(everything(), sum))
# # A tibble: 6 x 4
# # Groups: State [2]
# State Week Count_2001 Count_2002
# <chr> <dbl> <dbl> <dbl>
# 1 Florida 1 25 3
# 2 Florida 2 83 22
# 3 Florida 52 122 166
# 4 Georgia 1 16 78
# 5 Georgia 2 45 5
# 6 Georgia 52 132 103
Using aggregate.
s <- 52:53
tp <- transform(aggregate(cbind(Count_2001, Count_2002) ~ State, df[df$Week %in% s, ], sum),
Week=52)
df <- merge(df[!df$Week %in% s, ], tp, all=T)
df
# Week State Count_2001 Count_2002
# 1 1 Florida 25 3
# 2 1 Georgia 16 78
# 3 2 Florida 83 22
# 4 2 Georgia 45 5
# 5 52 Florida 122 166
# 6 52 Georgia 132 103
A simple alternative to using anything state specific would just be to create a new column with weeks at the level of aggregation that works!
I'd get this by doing: (using the tidyverse library)
df <- df %>%
mutate(week1 = if_else(week %in% c(52,53),52,week)
and then you can summate as
dfsumm <- df %>%
group_by(state, week1)%>%
summarise()

sum all visits by customer in the next week

I need to count of future visits by specific customer in the next 7 days. I solved this with purrr:map2 but I'm experiencing very slow performance. I think I must be missing something basic about how to use purrr. How do I speed this up? Thanks
This toy example takes 2.3 secs with 100 rows, but 3.3 minutes with 1000 rows on my machine. My actual data has 400K rows!
library(tidyverse)
set.seed(123)
rows <- 1000
df= data.frame(cust_num = sample(c("123","124","128"),rows,replace=T),
date = sample(seq(as.Date('2017/01/01'), as.Date('2017/01/31'), by="day"), rows, replace=T))
df <- df %>%
rowwise() %>%
mutate( visits.next.7.days = map2_lgl(df$cust_num,df$date,~.x==cust_num&.y>date&.y<(date+7)) %>% sum() )
Here's an option that uses purrr::reduce to sum the list of vectors returned by data.table::shift (a vectorized version of lead/lag). pmap_int with sum would do the same as reduce with + if you like, but it a little slower. You could similarly do map(1:7, ~lead(n, .x, default = 0L)) instead of data.table::shift, but it's more code and slower.
library(tidyverse)
set.seed(123)
rows <- 1000
df = data.frame(cust_num = sample(c("123","124","128"), rows, replace = TRUE),
date = sample(seq(as.Date('2017/01/01'),
as.Date('2017/01/31'),
by = "day"),
rows, replace = TRUE))
df2 <- df %>%
count(cust_num, date) %>%
group_by(cust_num) %>%
# add dates with no occurrences; none in sample data, but quite possible in real
complete(date = seq(min(date), max(date), by = 'day'), fill = list(n = 0L)) %>%
mutate(visits_next_7 = reduce(data.table::shift(n, 1:7, type = 'lead', fill = 0L), `+`)) %>%
right_join(df)
df2
#> # A tibble: 1,000 x 4
#> # Groups: cust_num [?]
#> cust_num date n visits_next_7
#> <fctr> <date> <int> <int>
#> 1 123 2017-01-09 10 78
#> 2 128 2017-01-19 12 70
#> 3 124 2017-01-05 15 73
#> 4 128 2017-01-27 14 37
#> 5 128 2017-01-27 14 37
#> 6 123 2017-01-15 19 74
#> 7 124 2017-01-24 12 59
#> 8 128 2017-01-10 10 78
#> 9 124 2017-01-03 19 77
#> 10 124 2017-01-14 8 84
#> # ... with 990 more rows
This may not be the most efficient algorithm, as depending on the spacing of your data, complete could potentially expand your data dramatically.
Further, with data this size, you may find data.table is more practical unless you want to put your data in a database and access it with dplyr.
A solution using the zoo package. The idea is to group the data by cust_num and date and count the row number first, and then use the lead function to shift the count number by 1 and use rollapply to calculate the sum of the next six days (not include the beginning date). Finally, use left_join to merge the results back to the original data frame. This should be much faster than your original approach. df3 is the final output.
library(dplyr)
library(zoo)
df2 <- df %>%
count(cust_num, date) %>%
ungroup() %>%
mutate(n2 = lead(n)) %>%
mutate(visits.next.7.days = rollapply(n2, width = 6, FUN = sum, na.rm = TRUE,
align = "left", partial = TRUE)) %>%
select(cust_num, date, visits.next.7.days)
df3 <- df %>% left_join(df2, by = c("cust_num", "date"))
head(df3)
# cust_num date visits.next.7.days
# 1 123 2017-01-09 70
# 2 128 2017-01-19 54
# 3 124 2017-01-05 58
# 4 128 2017-01-27 37
# 5 128 2017-01-27 37
# 6 123 2017-01-15 68

Using dplyr to summarise values and store as vector in data frame?

I have a simple data.frame that looks like this:
Group Person Score_1 Score_2 Score_3
1 1 90 80 79
1 2 74 83 28
1 3 74 94 89
2 1 33 9 8
2 2 94 32 78
2 3 50 90 87
I need to first need to find the mean of Score_1, collapsing across persons within a group (i.e., the Score_1 mean for Group 1, the Score_1 mean for Group 2, etc.), and then I need to collapse across all both groups to find the mean of Score_1. How can I calculate these values and store them as individual objects? I have used the "summarise" function in dplyr, with the following code:
summarise(group_by(data,Group),mean(bias,na.rm=TRUE))
I would like to ultimately create a 6th column that gives the mean, repeated across persons for each group, and then a 7th column that gives the grand mean across all groups.
I'm sure there are other ways to do this, and I am open to suggestions (although I would still like to know how to do it in dplyr). Thanks!
data.table is good for tasks like this:
library(data.table)
dt <- read.table(text = "Group Person Score_1 Score_2 Score_3
1 1 90 80 79
1 2 74 83 28
1 3 74 94 89
2 1 33 9 8
2 2 94 32 78
2 3 50 90 87", header = T)
dt <- data.table(dt)
# Mean by group
dt[, score.1.mean.by.group := mean(Score_1), by = .(Group)]
# Grand mean
dt[, score.1.mean := mean(Score_1)]
dt
To create a column, we use mutate and not summarise. We get the grand mean (MeanScore1), then grouped by 'Group', get the mean by group ('MeanScorebyGroup') and finally order the columns with select
library(dplyr)
df1 %>%
mutate(MeanScore1 = mean(Score_1)) %>%
group_by(Group) %>%
mutate(MeanScorebyGroup = mean(Score_1)) %>%
select(1:5, 7, 6)
But, this can also be done using base R in simple way
df1$MeanScorebyGroup <- with(df1, ave(Score_1, Group))
df1$MeanScore1 <- mean(df1$Score_1)
#akrun you just blew my mind!
Just to clarify what you said, here's my interpretation:
library(plyr)
Group <- c(1,1,1,2,2,2)
Person <- c(1,2,3,1,2,3)
Score_1 <- c(90,74,74,33,94,50)
Score_2 <- c(80,83,94,9,32,90)
Score_3 <- c(79,28,89,8,78,87)
df <- data.frame(cbind(Group, Person, Score_1, Score_2, Score_3))
df2 <- ddply(df, .(Group), mutate, meanScore = mean(Score_1, na.rm=T))
mutate(df2, meanScoreAll=mean(meanScore))

Assign value to variable based on values on multiple other columns (alternative to ifelse)

I have a data frame describing a large number of people. I want to assign each person to a group, based on several variables. For example, let's say I have the variable "state" with 5 states, the variable "age group" with 4 groups and the variable "income" with 5 groups. I will have 5x4x5 = 100 groups, that I want to name with numbers going from 1 to 100. I have always done this in the past using a combination of ifelse statements, but now as I have 100 possible outcomes I am wondering if there is a faster way than specifying each combination by hand.
Here's a MWE with the expected outcome:
mydata <- as.data.frame(cbind(c("FR","UK","UK","IT","DE","ES","FR","DE","IT","UK"),
c("20","80","20","40","60","20","60","80","40","60"),c(1,4,2,3,1,5,5,3,4,2)))
colnames(mydata) <- c("Country","Age","Income")
group_grid <- transform(expand.grid(state = c("IT","FR","UK","ES","DE"),
age = c("20","40","60","80"), income = 1:5), val = 1:100)
desired_result <- as.data.frame(cbind(c("FR","UK","UK","IT","DE","ES","FR","DE","IT","UK"),
c("20","80","20","40","60","20","60","80","40","60"),
c(1,4,2,3,1,5,5,3,4,2),
c(2,78,23,46,15,84,92,60,66,33)))
colnames(desired_result) <- c("Country","Age","Income","Group_code")
mydata$Group_code <- with(mydata, as.integer(interaction(Country, Age, Income))) should do it.
Here is left_join option using dplyr
library(dplyr)
grpD <- group_grid %>%
mutate_if(is.factor, as.character) %>% #change to character class as joining
mutate(income = as.character(income))#with same class columns are reqd.
mydata %>%
mutate_if(is.factor, as.character) %>% #change class here too
left_join(., grpD, by= c("Country" = "state", "Age" = "age", "Income" = "income"))
# Country Age Income val
#1 FR 20 1 2
#2 UK 80 4 78
#3 UK 20 2 23
#4 IT 40 3 46
#5 DE 60 1 15
#6 ES 20 5 84
#7 FR 60 5 92
#8 DE 80 3 60
#9 IT 40 4 66
#10 UK 60 2 33

Resources