Group By and summaries with condition - r

I have data frame df. After group_by(id, Year, Month, new_used_ind) and summarise(n = n()) it looks like:
id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2
I want to add and get total for id, year and month but also want a total of ' N' from new_used_ind in a new column.
Something like this
id Year Month Total_New total
1 2001 apr 3 5
2 2002 mar 5 8
4 2004 july 4 6

library(dplyr)
read.table(text= "id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2", header = T) -> df
df %>%
group_by(id, Year, Month) %>%
mutate(total_New=sum(n*(new_used_ind=="N"))) %>%
mutate(total_n=sum(n)) %>%
summarise_at(c("total_New", "total_n"), mean)
#> # A tibble: 4 x 5
#> # Groups: id, Year [4]
#> id Year Month total_New total_n
#> <int> <int> <fct> <dbl> <dbl>
#> 1 1 2001 apr 3 5
#> 2 2 2002 mar 5 5
#> 3 3 2003 mar 0 3
#> 4 4 2004 july 4 6
Created on 2019-06-11 by the reprex package (v0.3.0)

Related

Ranking of values in one quarter [duplicate]

This question already has answers here:
Calculate rank by group
(4 answers)
How to emulate SQLs rank functions in R?
(5 answers)
Closed 8 days ago.
I am trying to implement a calculation that will rank the Price values in a separate partition. Below you can see my data
df<-data.frame( year=c(2010,2010,2010,2010,2010,2010),
quarter=c("q1","q1","q1","q2","q2","q2"),
Price=c(10,20,30,10,20,30)
)
df
Now I want to count over each quarter and I expect to have 1 for the smallest Price and 3 for the highest Price
df %>% group_by(quarter) %>% mutate(id = row_number(Price))
Instead of the expected results, I received different results. Below you can see the result from the code. Instead of ranking in separate quarter, ranging is in both quarters.
So can anybody help me how to solve this problem and to receive results as in table below
You probably want rank.
transform(df, id=ave(Price, year, quarter, FUN=rank))
# year quarter Price id
# 1 2010 q1 10 1
# 2 2010 q1 20 2
# 3 2010 q1 30 3
# 4 2010 q2 10 1
# 5 2010 q2 20 2
# 6 2010 q2 30 3
With dplyr, use dense_rank
library(dplyr)
df %>%
group_by(quarter) %>%
mutate(id = dense_rank(Price)) %>%
ungroup
# A tibble: 6 × 4
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
In the newer version of dplyr, can also use .by in mutate
df %>%
mutate(id = dense_rank(Price), .by = 'quarter')
year quarter Price id
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
Alternatively with row_number()
library(tidyverse)
df %>% group_by(year, quarter) %>% mutate(id=row_number())
Created on 2023-02-12 with reprex v2.0.2
# A tibble: 6 × 4
# Groups: year, quarter [2]
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3

Is it possible to make groups based on an ID of a person in R?

I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on.
Person "4102" has no siblings and should be a NA in the new column.
It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers.
I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:
library(dplyr)
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1)) %>%
mutate(fam_gp = cur_group_id())
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id fam_gp
<dbl> <dbl> <dbl> <int>
1 4102 1992 4100 1
2 13102 1994 13100 2
3 27101 1993 27100 3
4 27102 1992 27100 3
5 28101 1995 28100 4
6 28106 1999 28100 4
7 42101 2000 42100 5
8 42102 2001 42100 5
9 56102 2000 56100 6
10 73102 1994 73100 7
11 74103 1999 74100 8
12 103104 1978 103100 9
13 117103 1986 117100 10
14 117104 1998 117100 10
15 117105 1999 117100 10
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10

Calculating cumulative sum for multiple columns in R

R newb, I'm trying to calculate the cumulative sum grouped by year, month, group and subgroup, also having multiple columns to calculate.
Sample of the data:
df <- data.frame("Year"=2020,
"Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
"Group"=c("A","A","A","B","A","B","B","B"),
"SubGroup"=c("a","a","b","b","a","b","a","b"),
"V1"=c(10,10,20,20,50,50,10,10),
"V2"=c(0,1,2,2,0,5,1,1))
Year Month Group SubGroup V1 V2
1 2020 Jan A a 10 0
2 2020 Jan A a 10 1
3 2020 Jan A b 20 2
4 2020 Jan B b 20 2
5 2020 Feb A a 50 0
6 2020 Feb B b 50 5
7 2020 Feb B a 10 1
8 2020 Feb B b 10 1
Resulting Table wanted:
Year Month Group SubGroup V1 V2
1 2020 Jan A a 20 1
2 2020 Feb A a 70 1
3 2020 Jan A b 20 2
4 2020 Feb A b 20 2
5 2020 Jan B a 0 0
6 2020 Feb B a 10 1
7 2020 Jan B b 20 2
8 2020 Feb B b 80 8
From Sample Table, on Jan 2020, the sum of Group 'A' Subgroup 'a' was 10+10 = 20... On Feb 2020, the value was 50, therefore 20 from Jan + 50 = 70, and so on...
If there is no value, it should consider 0.
I've tried few codes but none didn't get even close to the output I need. Would really appreciate if someone could help me with some tips for this problem.
This is a simple group_by/mutate problem. The columns V1, V2 are chosen with across and cumsum applied to them.
df$Month <- factor(df$Month, levels = c("Jan", "Feb"))
df %>%
group_by(Year, Group, SubGroup) %>%
mutate(across(V1:V2, ~cumsum(.x))) %>%
ungroup() %>%
arrange(Year, Group, SubGroup, Month)
## A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <chr> <fct> <chr> <chr> <dbl> <dbl>
#1 2020 Jan A a 10 0
#2 2020 Jan A a 20 1
#3 2020 Feb A a 70 1
#4 2020 Jan A b 20 2
#5 2020 Feb B a 10 1
#6 2020 Jan B b 20 2
#7 2020 Feb B b 70 7
#8 2020 Feb B b 80 8
If I understand what you are doing, you're taking the sum for each month, then doing the cumulative sums for the months. This is usuaully pretty easy in dplyr.
library(dplyr)
df %>%
group_by(Year, Month, Group, SubGroup) %>%
summarize(
V1_sum = sum(V1),
V2_sum = sum(V2)
) %>%
group_by(Year, Group, SubGroup) %>%
mutate(
V1_cumsum = cumsum(V1_sum),
V2_cumsum = cumsum(V2_sum)
)
# A tibble: 6 x 8
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1_sum V2_sum V1_cumsum V2_cumsum
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 Feb A a 50 0 50 0
# 2 2020 Feb B a 10 1 10 1
# 3 2020 Feb B b 60 6 60 6
# 4 2020 Jan A a 20 1 70 1
# 5 2020 Jan A b 20 2 20 2
# 6 2020 Jan B b 20 2 80 8
But you'll notice that the monthly cumulative sums are backwards (i.e. January comes after February), because by default group_by groups alphabetically. Also, you don't see the empty values because dplyr doesn't fill them in.
To fix the order of the months, you can either make your months numeric (convert to dates) or turn them into factors. You can add back 'missing' combinations of the grouping variables by using aggregate in base R instead of dplyr::summarize. aggregate includes all combinations of the grouping factors. aggregate converts the missing values to NA, but you can replace the NA with 0 with tidyr::replace_na, for example.
library(dplyr)
library(tidyr)
df <- data.frame("Year"=2020,
"Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
"Group"=c("A","A","A","B","A","B","B","B"),
"SubGroup"=c("a","a","b","b","a","b","a","b"),
"V1"=c(10,10,20,20,50,50,10,10),
"V2"=c(0,1,2,2,0,5,1,1))
df$Month <- factor(df$Month, levels = c("Jan", "Feb"), ordered = TRUE)
# Get monthly sums
df1 <- with(df, aggregate(
list(V1_sum = V1, V2_sum = V2),
list(Year = Year, Month = Month, Group = Group, SubGroup = SubGroup),
FUN = sum, drop = FALSE
))
df1 <- df1 %>%
# Replace NA with 0
mutate(
V1_sum = replace_na(V1_sum, 0),
V2_sum = replace_na(V2_sum, 0)
) %>%
# Get cumulative sum across months
group_by(Year, Group, SubGroup) %>%
mutate(V1cumsum = cumsum(V1_sum),
V2cumsum = cumsum(V2_sum)) %>%
ungroup() %>%
select(Year, Month, Group, SubGroup, V1 = V1cumsum, V2 = V2cumsum)
This gives the same result as your example:
# # A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <dbl> <ord> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 20 1
# 2 2020 Feb A a 70 1
# 3 2020 Jan B a 0 0
# 4 2020 Feb B a 10 1
# 5 2020 Jan A b 20 2
# 6 2020 Feb A b 20 2
# 7 2020 Jan B b 20 2
# 8 2020 Feb B b 80 8
library(dplyr)
library(zoo)
df %>%
arrange(as.yearmon(paste0(Year, '-', Month), '%Y-%b'), Group, SubGroup) %>%
group_by(Year, Group, SubGroup) %>%
mutate(
V1 = cumsum(V1),
V2 = cumsum(V2)
) %>%
arrange(Year, Group, SubGroup, as.yearmon(paste0(Year, '-', Month), '%Y-%b')) #for desired output ordering
# A tibble: 8 x 6
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1 V2
# <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 10 0
# 2 2020 Jan A a 20 1
# 3 2020 Feb A a 70 1
# 4 2020 Jan A b 20 2
# 5 2020 Feb B a 10 1
# 6 2020 Jan B b 20 2
# 7 2020 Feb B b 70 7
# 8 2020 Feb B b 80 8

Extracting strings from links using regex in R

I have a list of url links and i want to extract one of the strings and save them in another variable. The sample data is below:
sample<- c("http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf")
sample
[1] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf"
[2] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf"
[3] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf"
[4] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf"
[5] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf"
[6] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf"
[7] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf"
[8] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf"
[9] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf"
[10] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf"
I want to extract week and year using regex.
week year
1 1 2009
2 2 2001
3 3 2002
4 4 2004
5 5 2005
6 6 2018
7 7 2016
8 8 2015
9 9 2020
10 10 2014
You could use str_match to capture numbers after 'owgr' and 'f' :
library(stringr)
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1]
You can convert this to dataframe, change class to numeric and assign column names.
setNames(type.convert(data.frame(
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1])), c('year', 'week'))
# year week
#1 1 2009
#2 2 2001
#3 3 2002
#4 4 2004
#5 5 2005
#6 6 2018
#7 7 2016
#8 8 2015
#9 9 2020
#10 10 2014
Another way could be to extract all the numbers from last part of sample. We can get the last part with basename.
str_extract_all(basename(sample), '\\d+', simplify = TRUE)
Another way you can try
library(dplyr)
library(stringr)
df <- data.frame(sample)
df2 <- df %>%
transmute(year = str_extract(sample, "(?<=wgr)\\d{1,2}(?=f)"), week = str_extract(sample, "(?<=f)\\d{4}(?=\\.pdf)"))
# year week
# 1 1 2009
# 2 2 2001
# 3 3 2002
# 4 4 2004
# 5 5 2005
# 6 6 2018
# 7 7 2016
# 8 8 2015
# 9 9 2020
# 10 10 2014
You could use {unglue} :
library(unglue)
unglue_data(
sample,
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr{week}f{year}.pdf")
#> week year
#> 1 01 2009
#> 2 02 2001
#> 3 03 2002
#> 4 04 2004
#> 5 05 2005
#> 6 06 2018
#> 7 07 2016
#> 8 08 2015
#> 9 09 2020
#> 10 10 2014

how do I identify rows where an element appears for the first time?

I have the following data frame of student records. what I want is to identify students who joined a certain program in 2014 for the first time when they were in 9th grade.
names.first<-c('a','a','b','b','c','d')
names.last<-c('c','c','z','z','f','h')
year<-c(2014,2013,2014,2015,2015,2014)
grade<-c(9,8,9,10,10,10)
df<-data.frame(names.first,names.last,year,grade)
df
To do this, I have used the following statement to say that I want students where the program year==2014 and their grade ==9.
df$first.cohort<-ifelse(df$year==2014 & df$grade==9,1,0)
df
names.first names.last year grade first.cohort
1 a c 2014 9 1
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
However, as you can notice this would include students who didn't enter the program in year 2014 such as student awho started in 2013. How do I create a ifelse statement where I only capture students who are in 9th grade and started the program in 2014 for the first time so that the df looks like
names.first names.last year grade first.cohort
1 a c 2014 9 0
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
We can use first after arrangeing by 'name' and 'year' to create the logical expression
library(dplyr)
df %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 4
# Groups: names [4]
# names year grade first.cohort
# <fct> <dbl> <dbl> <int>
#1 a 2013 8 0
#2 a 2014 9 0
#3 b 2014 9 1
#4 b 2015 10 0
#5 c 2015 10 0
#6 d 2014 10 0
For keeping the same order as in the input dataset, we can create a sequence column first and then do the arrange on the column after the mutate
df %>%
mutate(rn = row_number()) %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn)
Or using the same logic with data.table that have the additional advantage of keeping the same order as in the input dataset
library(data.table)
setDT(df)[order(names, year), first.cohort := as.integer(grade == 9 &
first(year) == 2014), names]
Update
With the new example in the OP's post, we do the grouping by both the 'names' column
df %>%
arrange(names.first, names.last, year) %>%
group_by(names.first, names.last) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 5
# Groups: names.first, names.last [4]
# names.first names.last year grade first.cohort
# <fct> <fct> <dbl> <dbl> <int>
#1 a c 2013 8 0
#2 a c 2014 9 0
#3 b z 2014 9 1
#4 b z 2015 10 0
#5 c f 2015 10 0
#6 d h 2014 10 0
Using dplyr
library(dplyr)
df%>%group_by(names)%>%dplyr::mutate(Fc=as.numeric((year==2014&grade==9)&(min(year)==2014)))
# A tibble: 6 x 4
# Groups: names [4]
names year grade Fc
<fctr> <dbl> <dbl> <dbl>
1 a 2014 9 0
2 a 2013 8 0
3 b 2014 9 1
4 b 2015 10 0
5 c 2015 10 0
6 d 2014 10 0

Resources