R: Count unique rows for each unique name per year - r

This is my first question on stackoverflow. I searched for similar questions but I didn't find an answer.
I know that the question in the title isn't clear but I hope you are going to understand what I want as output.
I have a dataframe that looks like this:
ID Name Year
1 1 Anas 2018
2 1 Carl 2018
3 1 Catherine 2018
4 2 Anas 2018
5 2 Carl 2018
6 3 Catherine 2018
7 3 Julien 2018
8 4 Raul 2018
9 4 Ahmed 2018
10 4 Laurence 2018
11 4 Carl 2018
12 5 Anas 2019
13 5 Georges 2019
14 5 Arman 2019
15 6 Anas 2019
16 6 Pietro 2019
17 7 Pietro 2019
18 8 Diego 2019
if the names in the column "Name" have the same ID, it means that are collaborators in a project.
I want to add a column with the number of UNIQUE collaborators per year for each name (by including each name in the count of his collborators)
The output should look like this: (I added the last column to explain how to count-Not needed)
ID Name Year Unique_Coll explication
1 1 Anas 2018 3 (Anas, Carl, Catherine)
2 1 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
3 1 Catherine 2018 4 (Catherine, Carl, Anas, Julien)
4 2 Anas 2018 3 (Anas, Carl, Catherine)
5 2 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
6 3 Catherine 2018 4 (Catherine, Carl, Anas, Julien)
7 3 Julien 2018 2 (Julien, Catherine)
8 4 Raul 2018 4 (Raul, Ahmed, Laurence, Carl)
9 4 Ahmed 2018 4 (Ahmed, Raul, Laurence, Carl)
10 4 Laurence 2018 4 (Laurence, Raul, Ahmed, Carl)
11 4 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
12 5 Anas 2019 4 (Anas, Georges, Arman, Pietro)
13 5 Georges 2019 3 (Georges, Anas, Arman)
14 5 Arman 2019 3 (Arman Anas, Georges)
15 6 Anas 2019 4 (Anas, Georges, Arman, Pietro)
16 6 Pietro 2019 2 (Pietro, Anas)
17 7 Pietro 2019 2 (Pietro, Anas)
18 8 Diego 2019 1 (Diego)
Thank you

You could construct a variable that would be a list of names and count the number of unique names in the following way:
library(dplyr)
df = df %>%
group_by(ID) %>%
mutate(group = list(Name)) %>%
group_by(Year,Name) %>%
mutate(n = n_distinct(unlist(list(group)))) %>%
select(-group)
# A tibble: 18 x 4
# Groups: Year, Name [12]
ID Name Year n
<int> <chr> <int> <int>
1 1 Anas 2018 3
2 1 Carl 2018 6
3 1 Catherine 2018 4
4 2 Anas 2018 3
5 2 Carl 2018 6
6 3 Catherine 2018 4
7 3 Julien 2018 2
8 4 Raul 2018 4
9 4 Ahmed 2018 4
10 4 Laurence 2018 4
11 4 Carl 2018 6
12 5 Anas 2019 4
13 5 Georges 2019 3
14 5 Arman 2019 3
15 6 Anas 2019 4
16 6 Pietro 2019 2
17 7 Pietro 2019 2
18 8 Diego 2019 1

The following solution uses dplyr to first join all collaborators to every Name, creating a column Name_collab (note that this expands the data frame and could blow it up if it were large). Then, we count the distinct Name_collab for every Name, Year combination and get rid of duplicates.
library(dplyr)
df %>%
left_join(df, by = c("ID", "Year"), suffix = c("", "_collab")) %>%
group_by(Name, Year) %>%
mutate(Unique_Coll = n_distinct(Name_collab)) %>%
ungroup() %>%
distinct(ID, Name, Year, Unique_Coll)
which gives
# A tibble: 18 x 4
ID Name Year Unique_Coll
<int> <fct> <int> <int>
1 1 Anas 2018 3
2 1 Carl 2018 6
3 1 Catherine 2018 4
4 2 Anas 2018 3
5 2 Carl 2018 6
6 3 Catherine 2018 4
7 3 Julien 2018 2
8 4 Raul 2018 4
9 4 Ahmed 2018 4
10 4 Laurence 2018 4
11 4 Carl 2018 6
12 5 Anas 2019 4
13 5 Georges 2019 3
14 5 Arman 2019 3
15 6 Anas 2019 4
16 6 Pietro 2019 2
17 7 Pietro 2019 2
18 8 Diego 2019 1
Input:
df <- read.table(text="ID Name Year
1 1 Anas 2018
2 1 Carl 2018
3 1 Catherine 2018
4 2 Anas 2018
5 2 Carl 2018
6 3 Catherine 2018
7 3 Julien 2018
8 4 Raul 2018
9 4 Ahmed 2018
10 4 Laurence 2018
11 4 Carl 2018
12 5 Anas 2019
13 5 Georges 2019
14 5 Arman 2019
15 6 Anas 2019
16 6 Pietro 2019
17 7 Pietro 2019
18 8 Diego 2019")

I have a solution using joins.
library(tidyverse)
# read data
dta <- tribble(~ID, ~Name, ~Year,
1, "Anas", 2018,
1, "Carl", 2018,
1, "Catherine", 2018,
2, "Anas", 2018,
2, "Carl", 2018,
3, "Catherine", 2018,
3, "Julien", 2018,
4, "Raul", 2018,
4, "Ahmed", 2018,
4, "Laurence", 2018,
4, "Carl", 2018,
5, "Anas", 2019,
5, "Georges", 2019,
5, "Arman", 2019,
6, "Anas", 2019,
6, "Pietro", 2019,
7, "Pietro", 2019,
8, "Diego", 2019)
nb_collabs <- dta %>%
left_join(dta, by = c("ID", "Year")) %>%
select(-ID) %>%
group_by(Name.x, Year) %>%
nest(collaborators = Name.y) %>%
mutate(unique_collaborators = map(collaborators, distinct),
Unique_Coll = map_int(unique_collaborators, nrow)) %>%
select(-collaborators, -unique_collaborators)
left_join(dta, nb_collabs, by = c("Name"="Name.x", "Year"))
# A tibble: 18 x 4
# ID Name Year Unique_Coll
# <dbl> <chr> <dbl> <int>
# 1 1 Anas 2018 3
# 2 1 Carl 2018 6
# 3 1 Catherine 2018 4
# 4 2 Anas 2018 3
# 5 2 Carl 2018 6
# 6 3 Catherine 2018 4
# 7 3 Julien 2018 2
# 8 4 Raul 2018 4
# 9 4 Ahmed 2018 4
#10 4 Laurence 2018 4
#11 4 Carl 2018 6
#12 5 Anas 2019 4
#13 5 Georges 2019 3
#14 5 Arman 2019 3
#15 6 Anas 2019 4
#16 6 Pietro 2019 2
#17 7 Pietro 2019 2
#18 8 Diego 2019 1
So the first step is to join the data with itself. The point is to have the name in "Name.x", and a separate row for each collaborator as "Name.y". Then we can nest the collaborator names, so that we get a data frame with one row for each Name, with a nested data frame with the collaborators, so we just need to remove the duplicates and count the number of persons.
In nb_collabs we have a table with each person and the number of collaborators, we can simply join it back with the original data frame to get the desired format.

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

How to calculate the number of months from the initial date for each individual

This is a representation of my dataset
ID<-c(rep(1,10),rep(2,8))
year<-c(2007,2007,2007,2008,2008,2009,2010,2009,2010,2011,
2008,2008,2009,2010,2009,2010,2011,2011)
month<-c(2,7,12,4,11,6,11,1,9,4,3,6,7,4,9,11,2,8)
mydata<-data.frame(ID,year,month)
I want to calculate for each individual the number of months from the initial date. I am using two variables: year and month.
I firstly order years and months:
mydata2<-mydata%>%group_by(ID,year)%>%arrange(year,month,.by_group=T)
Then I created the variable date considering that the day begin with 01:
mydata2$date<-paste("01",mydata2$month,mydata2$year,sep = "-")
then I used lubridate to change this variable in date format
mydata2$date<-dmy(mydata2$date)
But after this, I really don't know what to do, in order to have such a dataset (preferably using dplyr code) below:
ID year month date dif_from_init
1 1 2007 2 01-2-2007 0
2 1 2007 7 01-7-2007 5
3 1 2007 12 01-12-2007 10
4 1 2008 4 01-4-2008 14
5 1 2008 11 01-11-2008 21
6 1 2009 1 01-1-2009 23
7 1 2009 6 01-6-2009 28
8 1 2010 9 01-9-2010 43
9 1 2010 11 01-11-2010 45
10 1 2011 4 01-4-2011 50
11 2 2008 3 01-3-2008 0
12 2 2008 6 01-6-2008 3
13 2 2009 7 01-7-2009 16
14 2 2009 9 01-9-2009 18
15 2 2010 4 01-4-2010 25
16 2 2010 11 01-11-2010 32
17 2 2011 2 01-2-2011 35
18 2 2011 8 01-8-2011 41
One way could be:
mydata %>%
group_by(ID) %>%
mutate(date = as.Date(sprintf('%d-%d-01',year, month)),
diff = as.numeric(round((date - date[1])/365*12)))
# A tibble: 18 x 5
# Groups: ID [2]
ID year month date diff
<dbl> <dbl> <dbl> <date> <dbl>
1 1 2007 2 2007-02-01 0
2 1 2007 7 2007-07-01 5
3 1 2007 12 2007-12-01 10
4 1 2008 4 2008-04-01 14
5 1 2008 11 2008-11-01 21
6 1 2009 6 2009-06-01 28
7 1 2010 11 2010-11-01 45
8 1 2009 1 2009-01-01 23
9 1 2010 9 2010-09-01 43
10 1 2011 4 2011-04-01 50
11 2 2008 3 2008-03-01 0
12 2 2008 6 2008-06-01 3
13 2 2009 7 2009-07-01 16
14 2 2010 4 2010-04-01 25
15 2 2009 9 2009-09-01 18
16 2 2010 11 2010-11-01 32
17 2 2011 2 2011-02-01 35
18 2 2011 8 2011-08-01 41

Extracting strings from links using regex in R

I have a list of url links and i want to extract one of the strings and save them in another variable. The sample data is below:
sample<- c("http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf",
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf")
sample
[1] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf"
[2] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf"
[3] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf"
[4] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf"
[5] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf"
[6] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf"
[7] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf"
[8] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf"
[9] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf"
[10] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf"
I want to extract week and year using regex.
week year
1 1 2009
2 2 2001
3 3 2002
4 4 2004
5 5 2005
6 6 2018
7 7 2016
8 8 2015
9 9 2020
10 10 2014
You could use str_match to capture numbers after 'owgr' and 'f' :
library(stringr)
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1]
You can convert this to dataframe, change class to numeric and assign column names.
setNames(type.convert(data.frame(
str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1])), c('year', 'week'))
# year week
#1 1 2009
#2 2 2001
#3 3 2002
#4 4 2004
#5 5 2005
#6 6 2018
#7 7 2016
#8 8 2015
#9 9 2020
#10 10 2014
Another way could be to extract all the numbers from last part of sample. We can get the last part with basename.
str_extract_all(basename(sample), '\\d+', simplify = TRUE)
Another way you can try
library(dplyr)
library(stringr)
df <- data.frame(sample)
df2 <- df %>%
transmute(year = str_extract(sample, "(?<=wgr)\\d{1,2}(?=f)"), week = str_extract(sample, "(?<=f)\\d{4}(?=\\.pdf)"))
# year week
# 1 1 2009
# 2 2 2001
# 3 3 2002
# 4 4 2004
# 5 5 2005
# 6 6 2018
# 7 7 2016
# 8 8 2015
# 9 9 2020
# 10 10 2014
You could use {unglue} :
library(unglue)
unglue_data(
sample,
"http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr{week}f{year}.pdf")
#> week year
#> 1 01 2009
#> 2 02 2001
#> 3 03 2002
#> 4 04 2004
#> 5 05 2005
#> 6 06 2018
#> 7 07 2016
#> 8 08 2015
#> 9 09 2020
#> 10 10 2014

Replicating table in R with change in one column

I have this table in R :
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
I want to replicate this same table 4 times, all values should be the same. Except the Month column, which needs to be incremented by 1 every time. And the final table should look like this:
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
John 8 2017 8 16
Carol 90 2017 8 30
Bug 9 2017 8 1
John 8 2017 9 16
Carol 90 2017 9 30
Bug 9 2017 9 1
John 8 2017 10 16
Carol 90 2017 10 30
Bug 9 2017 10 1
John 8 2017 11 16
Carol 90 2017 11 30
Bug 9 2017 11 1
Please point how to do this efficiently in R. Many thanks!
If this is your dataframe:
df = read.table(text = "Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1", header = TRUE)
Then this is your dataframe repeating:
df2 = df[rep(rownames(df), 4),]
And this is it again, but with the months incremented:
df2$Month = df2$Month + rep(0:3, 3)
In the more general case:
m = 4 # <-- number of rows desired
df2 = df[rep(rownames(df), m), ]
df2$Month = df2$Month + rep(0:m, nrow(df))

R - Add row index to a data frame but handle ties with minimum rank

I successfully used the answer in this SO thread
r-how-to-add-row-index-to-a-data-frame-based-on-combination-of-factors but I need to handle situation where two (or more) rows can be tied.
df <- data.frame(
season = c(2014,2014,2014,2014,2014,2014, 2014, 2014),
week = c(1,1,1,1,2,2,2,2),
player.name = c("Matt Ryan","Peyton Manning","Cam Newton","Matthew Stafford","Carson Palmer","Andrew Luck", "Aaron Rodgers", "Chad Henne"),
fant.pts.passing = c(28,19,29,28,18,22,29,22)
)
df <- df[order(-df$season, df$week, -df$fant.pts.passing),]
df$Index <- ave( 1:nrow(df), df$season, df$week, FUN=function(x) 1:length(x) )
df
In this example, for week 1, Matt Ryan and Matthew Stafford would both be 2, and then Peyton Manning would be 4.
You would want to use the rank function with ties.method="min" within your ave call:
df$Index <- ave(-df$fant.pts.passing, df$season, df$week,
FUN=function(x) rank(x, ties.method="min"))
df
# season week player.name fant.pts.passing Index
# 3 2014 1 Cam Newton 29 1
# 1 2014 1 Matt Ryan 28 2
# 4 2014 1 Matthew Stafford 28 2
# 2 2014 1 Peyton Manning 19 4
# 7 2014 2 Aaron Rodgers 29 1
# 6 2014 2 Andrew Luck 22 2
# 8 2014 2 Chad Henne 22 2
# 5 2014 2 Carson Palmer 18 4
Assuming you want ranks by season and week, this can be easily accomplished with dplyr's min_rank:
library(dplyr)
df %>% group_by(season, week) %>%
mutate(indx = min_rank(desc(fant.pts.passing)))
# season week player.name fant.pts.passing Index indx
# 1 2014 1 Cam Newton 29 1 1
# 2 2014 1 Matt Ryan 28 2 2
# 3 2014 1 Matthew Stafford 28 3 2
# 4 2014 1 Peyton Manning 19 4 4
# 5 2014 2 Aaron Rodgers 29 1 1
# 6 2014 2 Andrew Luck 22 2 2
# 7 2014 2 Chad Henne 22 3 2
# 8 2014 2 Carson Palmer 18 4 4
You could use the faster frank from data.table and assign (:=) the column by reference
library(data.table)#v1.9.5+
setDT(df)[, indx := frank(-fant.pts.passing, ties.method='min'), .(season, week)]
# season week player.name fant.pts.passing indx
#1: 2014 1 Cam Newton 29 1
#2: 2014 1 Matt Ryan 28 2
#3: 2014 1 Matthew Stafford 28 2
#4: 2014 1 Peyton Manning 19 4
#5: 2014 2 Aaron Rodgers 29 1
#6: 2014 2 Andrew Luck 22 2
#7: 2014 2 Chad Henne 22 2
#8: 2014 2 Carson Palmer 18 4

Resources