R - Aggregate and count instances of grouping [duplicate] - r

This question already has answers here:
Aggregate and reshape from long to wide
(2 answers)
Closed 2 years ago.
Dataset is a breakdown of responders and the number of contacts they have had within a given time period along with details on their age bracket, something similar to:
participant participant_age contact contact_age
1 18-30 1 18-30
1 18-30 2 30-40
2 30-40 1 18-30
3 18-30 1 18-30
3 18-30 2 50-60
My aim is to calculate the mean number of contacts each age group of participant has had with each age bracket of contact. Something similar to:
age_bracket 18-30 30-40 40-50
18-30 1 3 2
30-40 1.5 4 2
40-50 3 4 1
I have been attempting to use the group_by and spread functions available in dplyr. The closest I have come is using
data%>%
group_by(participant_age, contact_age) %>%
tally() %>%
spread(key = participant_age, value = n)
But this produces the total number (n) of each contact, rather than the mean number of contacts per age bracket.

In base R use tapply.
t(with(dat, tapply(contact, list(contact_age, participant_age), mean)))
# 18-30 30-40 50-60
# 18-30 1 2 2
# 30-40 1 NA NA
Data:
dat <- structure(list(participant = c(1L, 1L, 2L, 3L, 3L), participant_age = c("18-30",
"18-30", "30-40", "18-30", "18-30"), contact = c(1L, 2L, 1L,
1L, 2L), contact_age = c("18-30", "30-40", "18-30", "18-30",
"50-60")), class = "data.frame", row.names = c(NA, -5L))

If I understood correctly your aim, you were pretty close to the right solution:
data %>%
group_by(participant_age, contact_age) %>%
summarise(mean = mean(contact), .groups = "drop") %>%
spread(key = participant_age, value = mean)

You can use pivot_wider and pass the function to apply in values_fn :
tidyr::pivot_wider(df, names_from = contact_age, values_from = contact, values_fn = mean)

Related

How to count the number of customer per month in R?

So I have a table of customers with the respective date as below:
ID
Date
1
2019-04-17
4
2019-05-12
1
2019-04-25
2
2019-05-19
I just want to count how many Customer is there for each month-year like below:
Month-Year
Count of Customer
Apr-19
2
May-19
2
EDIT:
Sorry but I think my Question should be clearer.
The same customer can appear more than once in a month and would be counted as 2 customer for the same month. I would basically like to find the number of transaction per month based on customer id.
My assumed approach would be to first change the date into a month-year format? And then I count each customer and grouped it for each month? but I am not sure how to do this in R. Thank you!
You can use count -
library(dplyr)
df %>% count(Month_Year = format(as.Date(Date), '%b-%y'))
# Month_Year n
#1 Apr-19 2
#2 May-19 2
Or table in base R -
table(format(as.Date(df$Date), '%b-%y'))
#Apr-19 May-19
# 2 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))
We can use zoo::as.yearmon
library(dplyr)
df %>%
count(Date = zoo::as.yearmon(Date))
Date n
1 Apr 2019 2
2 May 2019 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))

Dplyr merge rows based on one column value and sum other columns

My current df looks like the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 10 15 .05
2018-00 5 10 .1
2018-01 7 9 .1
....
2018-52 10 12 .06
2019-00 6 10 .05
....
What I would like to do is combine the last two weeks of each year together into the final week of the year and combine COUNT, COUNT2, and PERCENTAGE. The weeks I currently have that I would like to combine are: 2017-53 and 2018-00, 2018-52 and 2019-00, 2019-52 and 2020-00. Which I would like to merge into 2017-53, 2018-52, 2019-52 My expected output would be the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 15 25 .15
2018-01 7 9 .1
....
2018-52 16 22 .11
....
With tidyverse, after converting the 'WEEK' to Date class, arrange by that column, extract the 'year', create a grouping with 'WEEK' based on the difference of adjacent elements of 'year', and then summarise to get the sum of the columns that matches 'COUNT' or 'PERCENTAGE'
library(stringr)
library(lubridate)
library(dplyr) #1.0.0
df1 %>%
mutate(Date = as.Date(str_c(WEEK, "-01"), format = '%Y-%U-%w')) %>%
arrange(Date) %>%
mutate(year = year(Date)) %>%
group_by(WEEK = case_when(lag(year, default = first(year)) - year < 0 ~
lag(WEEK), TRUE ~ WEEK)) %>%
summarise(across(matches("COUNT|PERCENTAGE"), sum))
# A tibble: 3 x 4
# WEEK COUNT COUNT2 PERCENTAGE
# <chr> <int> <int> <dbl>
#1 2017-53 15 25 0.15
#2 2018-01 7 9 0.1
#3 2018-52 16 22 0.11
data
df1 <- structure(list(WEEK = c("2017-53", "2018-00", "2018-01", "2018-52",
"2019-00"), COUNT = c(10L, 5L, 7L, 10L, 6L), COUNT2 = c(15L,
10L, 9L, 12L, 10L), PERCENTAGE = c(0.05, 0.1, 0.1, 0.06, 0.05
)), class = "data.frame", row.names = c(NA, -5L))
You could use colSums() as is shown here, but it's a bit convoluted. I'd recommend using aggregate and pipes, as is shown further down in the same link.
Hope this helps!

Percent change for grouped subjects at multiple timepoints R

id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113
df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.
Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)
in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000
Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)

cannot visualise using ggplot with NA values

I am trying to create a stacked bar chart showing % frequency of occurrences by group
library(dplyr)
library(ggplot2)
brfss_2013 %>%
group_by(incomeLev, mentalHealth) %>%
summarise(count_mentalHealth=n()) %>%
group_by(incomeLev) %>%
mutate(count_inc=sum(count_mentalHealth)) %>%
mutate(percent=count_mentalHealth / count_inc * 100) %>%
ungroup() %>%
ggplot(aes(x=forcats::fct_explicit_na(incomeLev),
y=count_mentalHealth,
group=mentalHealth)) +
geom_bar(aes(fill=mentalHealth),
stat="identity") +
geom_text(aes(label=sprintf("%0.1f%%", percent)),
position=position_stack(vjust=0.5))
However, this is the traceback I receive:
1. dplyr::group_by(., incomeLev, mentalHealth)
8. plyr::summarise(., count_mentalHealth = n())
9. [ base::eval(...) ] with 1 more call
11. dplyr::n()
12. dplyr:::from_context("..group_size")
13. `%||%`(...)
In addition: Warning message:
Factor `incomeLev` contains implicit NA, consider using `forcats::fct_explicit_na`
>
Here is a sample of my data
brfss_2013 <- structure(list(incomeLev = structure(c(2L, 3L, 3L, 2L, 2L, 3L,
NA, 2L, 3L, 1L, 3L, NA), .Label = c("$25,000-$35,000", "$50,000-$75,000",
"Over $75,000"), class = "factor"), mentalHealth = structure(c(3L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("Excellent",
"Ok", "Very Bad"), class = "factor")), row.names = c(NA, -12L
), class = "data.frame")
Update:
Output of str(brfss_2013):
'data.frame': 491775 obs. of 9 variables:
$ mentalHealth: Factor w/ 5 levels "Excellent","Good",..: 5 1 1 1 1 1 3 1 1 1 ...
$ pa1min_ : int 947 110 316 35 429 120 280 30 240 260 ...
$ bmiLev : Factor w/ 6 levels "Underweight",..: 5 1 3 2 5 5 2 3 4 3 ...
$ X_drnkmo4 : int 2 0 80 16 20 0 1 2 4 0 ...
$ X_frutsum : num 413 20 46 49 7 157 150 67 100 58 ...
$ X_vegesum : num 53 148 191 136 243 143 216 360 172 114 ...
$ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
$ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
$ incomeLev : Factor w/ 4 levels "$25,000-$35,000",..: 2 4 4 2 2 4 NA 2 4 1 ...
First of all, your code works incredibly well when you transform everything into character. So you could just do
brfss_2013[c("incomeLev", "mentalHealth")] <-
lapply(brfss_2013[c("incomeLev", "mentalHealth")], as.character)
and then just run your code as you figured it out.
But, let's do it with factors (don't run the lapply(.) line in this case!).
You want a "missing" category, which you can obtain by adding a new level "missing" for the NAs.
levels(brfss_2013$incomeLev) <- c(levels(brfss_2013$incomeLev), "missing")
brfss_2013$incomeLev[is.na(brfss_2013$incomeLev)] <- "missing"
Then, your aggregation (in a base R way).
b1 <- with(brfss_2013, aggregate(list(count_mentalHealth=incomeLev),
by=list(mentalHealth=mentalHealth, incomeLev=incomeLev),
length))
b2 <- aggregate(mentalHealth ~ ., brfss_2013, length)
names(b2)[2] <- "count_inc"
brfss_2013.agg <- merge(b1, b2)
rm(b1, b2) # just to clean up
Add the "percent" column.
brfss_2013.agg$percent <- with(brfss_2013.agg, count_mentalHealth / count_inc)
Plot.
library(ggplot2)
ggplot(brfss_2013.agg, aes(x=incomeLev, y=count_mentalHealth, group=mentalHealth)) +
geom_bar(aes(fill=mentalHealth), stat="identity") +
geom_text(aes(label=sprintf("%0.1f%%", percent)),
position=position_stack(vjust=0.5))
Result
So your code actually works fine for me. It looks like it might be an issue with package versions because it seems odd that you're using the plyr summarise function.
However, here's a slightly more concise way to create that graph (and hopefully this is helpful for whatever you want to add to this plot)
brfss_2013 %>%
# Add count of income levels first (note this only adds a variable)
add_count(incomeLev) %>%
rename(count_inc = n) %>%
# Count observations per group (this transforms data)
count(incomeLev, mentalHealth, count_inc) %>%
rename(count_mentalHealth = n) %>%
mutate(percent= count_mentalHealth / count_inc) %>%
ggplot(aes(x= incomeLev,
y= count_mentalHealth,
# Technically you don't need this group here but groups can be handy
group= mentalHealth)) +
geom_bar(aes(fill=mentalHealth),
stat="identity")+
# Using the scales package does the percent formatting for you
geom_text(aes(label = scales::percent(percent)), vjust = 1)+
theme_minimal()

Merging two Dataframes in R by ID, One is the subset of the other

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]

Resources