I am a newly self-taught user of R and require assistance.
I am working with a dataset that has captured location of residence and whether the locality is metropolitan, regional or rural over 7 years (2015-2021) for a subset of a population. Each individual has a unique ID and each year is on a new row (ie. each ID has 7 rows). I am trying to figure out how many individuals have remained in the same location, how many have moved and where they moved to.
I am really struggling to figure out what I need to do to get the required outputs, but I assume there is a way to get a summary table that has number of individuals who havent moved (+- where they are located) and number of individuals that have moved (+- where they have moved to).
Your assistance would be greatly appreciated.
Dummy dataset:
stack <- tribble(
~ID, ~Year, ~Residence, ~Locality,
#--/--/--/----
"a", "2015", "Sydney", "Metro",
"a", "2016", "Sydney", "Metro",
"a", "2017", "Sydney", "Metro",
"a", "2018", "Sydney", "Metro",
"a", "2019", "Sydney", "Metro",
"a", "2020", "Sydney", "Metro",
"a", "2021", "Sydney", "Metro",
"b", "2015", "Sydney", "Metro",
"b", "2016", "Orange", "Regional",
"b", "2017", "Orange", "Regional",
"b", "2018", "Orange", "Regional",
"b", "2019", "Orange", "Regional",
"b", "2020", "Broken Hill", "Rural",
"b", "2021", "Sydney", "Metro",
"c", "2015", "Dubbo", "Regional",
"c", "2016", "Dubbo", "Regional",
"c", "2017", "Dubbo", "Regional",
"c", "2018", "Dubbo", "Regional",
"c", "2019", "Dubbo", "Regional",
"c", "2020", "Dubbo", "Regional",
"c", "2021", "Dubbo", "Regional",
)
Cheers in advance.
You can use the lead function to add columns containing the persons' location in the following year. Using mutate across, you can apply the lead to two columns simultaneously. You can then make a row-wise comparisons and look for moves before summarising.
#Group by individual before applying the lead function
#Apply the lead function to the two listed columns and add "nextyear" as a suffix
#Add a logical column which returns TRUE if any change of residence or locality is detected.
#summarise the date by individual by retaining the location with the max year.
stack%>%
unite(col="Location", c(Residence, Locality), sep="-")%>%
group_by(ID)%>%
mutate(across(c("Year", "Location"), list(nextyear= lead)),
Move=Location!=Location_nextyear)%>%
filter(!is.na(Year_nextyear))%>%
mutate(nb.of.moves=sum(Move, na.rm=TRUE))%>%
slice_max(Year)%>%
select(ID, last.location=Location_nextyear, nb.of.moves)
# A tibble: 3 x 3
# Groups: ID [3]
ID last.location nb.of.moves
<chr> <chr> <int>
1 a Sydney-Metro 0
2 b Sydney-Metro 3
3 c Dubbo-Regional 0
Here is another tidyverse option and using cumsum. We can get the cumulative sum to show how many times each person moves (if they do). Then, we can slice the last row, and get the count of each location. The change column indicates how many times they moved. However, it's unclear what you want the final product to look like.
library(tidyverse)
stack %>%
group_by(ID) %>%
mutate(
change = cumsum(case_when(
paste0(Residence, Locality) != lag(paste0(Residence, Locality)) ~ TRUE,
TRUE ~ FALSE
))
) %>%
slice(n()) %>%
ungroup %>%
count(Residence, Locality, change)
Output
Residence Locality change n
<chr> <chr> <int> <int>
1 Dubbo Regional 0 1
2 Sydney Metro 0 1
3 Sydney Metro 3 1
Using data.table.
library(data.table)
setDT(stack) # convert to data.table
setorder(stack, ID, Year) # assure rows are in correct order
stack[, rle(paste(Residence, Locality, sep=', ')), by=.(ID)]
## ID lengths values
## 1: a 7 Sydney, Metro
## 2: b 1 Sydney, Metro
## 3: b 4 Orange, Regional
## 4: b 1 Broken Hill, Rural
## 5: b 1 Sydney, Metro
## 6: c 7 Dubbo, Regional
So a stayed in Sydney for 7 years, b stayed in Sydney for 1 year then moved to Orange for 4 years, then moved to Broken Hill for 1 year, then moved back to Sydney for 1 year.
To determine how many times each person moved:
result <- stack[, rle(paste(Residence, Locality, sep=', ')), by=.(ID)]
result[, .(N=.N-1), by=.(ID)]
## ID N
## 1: a 0
## 2: b 3
## 3: c 0
So a and c did not move at all, and b moved 3 times.
Similar to what #Dealec did, I used the lag function from dplyr instead.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
stack <- tribble(
~ID, ~Year, ~Residence, ~Locality,
#--/--/--/----
"a", "2015", "Sydney", "Metro",
"a", "2016", "Sydney", "Metro",
"a", "2017", "Sydney", "Metro",
"a", "2018", "Sydney", "Metro",
"a", "2019", "Sydney", "Metro",
"a", "2020", "Sydney", "Metro",
"a", "2021", "Sydney", "Metro",
"b", "2015", "Sydney", "Metro",
"b", "2016", "Orange", "Regional",
"b", "2017", "Orange", "Regional",
"b", "2018", "Orange", "Regional",
"b", "2019", "Orange", "Regional",
"b", "2020", "Broken Hill", "Rural",
"b", "2021", "Sydney", "Metro",
"c", "2015", "Dubbo", "Regional",
"c", "2016", "Dubbo", "Regional",
"c", "2017", "Dubbo", "Regional",
"c", "2018", "Dubbo", "Regional",
"c", "2019", "Dubbo", "Regional",
"c", "2020", "Dubbo", "Regional",
"c", "2021", "Dubbo", "Regional",
) %>%
clean_names()
results <- stack %>%
mutate(location = paste(residence, locality, sep = "_")) %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(
row = row_number(),
movement = case_when(
row == 1 ~ NA_character_,
location == lag(location, n = 1) ~ "no_movement",
TRUE ~ location
)
) %>%
ungroup() %>%
select(-row)
results
#> # A tibble: 21 x 6
#> id year residence locality location movement
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 2015 Sydney Metro Sydney_Metro <NA>
#> 2 a 2016 Sydney Metro Sydney_Metro no_movement
#> 3 a 2017 Sydney Metro Sydney_Metro no_movement
#> 4 a 2018 Sydney Metro Sydney_Metro no_movement
#> 5 a 2019 Sydney Metro Sydney_Metro no_movement
#> 6 a 2020 Sydney Metro Sydney_Metro no_movement
#> 7 a 2021 Sydney Metro Sydney_Metro no_movement
#> 8 b 2015 Sydney Metro Sydney_Metro <NA>
#> 9 b 2016 Orange Regional Orange_Regional Orange_Regional
#> 10 b 2017 Orange Regional Orange_Regional no_movement
#> # ... with 11 more rows
results %>%
count(year, movement) %>%
pivot_wider(names_from = movement,
values_from = n) %>%
clean_names()
#> # A tibble: 7 x 6
#> year na no_movement orange_regional broken_hill_rural sydney_metro
#> <chr> <int> <int> <int> <int> <int>
#> 1 2015 3 NA NA NA NA
#> 2 2016 NA 2 1 NA NA
#> 3 2017 NA 3 NA NA NA
#> 4 2018 NA 3 NA NA NA
#> 5 2019 NA 3 NA NA NA
#> 6 2020 NA 2 NA 1 NA
#> 7 2021 NA 2 NA NA 1
#tracking movement from a location
from_location <- stack %>%
mutate(location = paste(residence, locality, sep = "_")) %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(
row = row_number(),
movement_from = case_when(
row == 1 ~ NA_character_,
location == lag(location, n = 1) ~ "no_movement",
TRUE ~ lag(location, n = 1)
)
) %>%
ungroup() %>%
select(-row)
from_location %>%
count(year, movement_from) %>%
pivot_wider(names_from = movement_from,
names_prefix = "from_",
values_from = n) %>%
clean_names()
#> # A tibble: 7 x 6
#> year from_na from_no_movement from_sydney_metro from_orange_regional
#> <chr> <int> <int> <int> <int>
#> 1 2015 3 NA NA NA
#> 2 2016 NA 2 1 NA
#> 3 2017 NA 3 NA NA
#> 4 2018 NA 3 NA NA
#> 5 2019 NA 3 NA NA
#> 6 2020 NA 2 NA 1
#> 7 2021 NA 2 NA NA
#> # ... with 1 more variable: from_broken_hill_rural <int>
Created on 2022-04-28 by the reprex package (v2.0.1)
Related
I have a large DF with certain columns that have a vector of character values as below. The number of columns varies from dataset to dataset as well as the number of character vectors it holds also varies.
ID Country1 Country2 Country3
1 1 Argentina, Japan,USA,Poland, Argentina,USA Pakistan
2 2 Colombia, Mexico,Uruguay,Dutch Mexico,Uruguay Afganisthan
3 3 Argentina, Japan,USA,NA Japan Khazagistan
4 4 Colombia, Mexico,Uruguay,Dutch Colombia, Dutch North Korea
5 5 India, China China Iran
Would like to match them one-to-one with another string vector as below
vals_to_find <-c("Argentina","USA","Mexico")
If, a column/row matches to anyone of the strings passed would like to retain that column and row. Remove duplicates, and finally remove those values that do not match.
the desired output is as follows
ID Countries.found
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
data
dput(df)
structure(list(ID = 1:5, Country1 = c("Argentina, Japan,USA,Poland,",
"Colombia, Mexico,Uruguay,Dutch", "Argentina, Japan,USA,NA",
"Colombia, Mexico,Uruguay,Dutch", "India, China"), Country2 = c("Argentina,USA",
"Mexico,Uruguay", "Japan", "Colombia, Dutch", "China"), Country3 = c("Pakistan",
"Afganisthan", "Khazagistan", "North Korea", "Iran")), class = "data.frame", row.names = c(NA,
-5L))
dput(df_out)
structure(list(ID = 1:4, Countries.found = c("Argentina, USA",
"Mexico", "Argentina, USA", "Mexico")), class = "data.frame", row.names = c(NA,
-4L))
Instead of a each column as a vector, if the file is read as one value per column. Then, was able do it as below
dput(df_out)
structure(list(ID = 1:5, X1 = c("Argentina", "Colombia", "Argentina",
"Colombia", "India"), X2 = c("Japan", "Mexico", "Japan", "Mexico",
"China"), X3 = c("USA", "Uruguay", "USA", "Uruguay", NA), X4 = c("Poland",
"Dutch", NA, "Dutch", NA), X5 = c("Argentina", "Mexico", "Japan",
"Colombia", "China"), X6 = c("USA", "Uruguay", NA, "Dutch", NA
), X7 = c("Pakistan", "Afganisthan", "Khazagistan", "North Korea",
"Iran")), class = "data.frame", row.names = c(NA, -5L))
df_out %>%
dplyr::select(
where(~ !all(is.na(.x)))
) %>%
dplyr::select(c(1, where(~ any(.x %in% vals_to_find)))) %>%
dplyr::mutate(dplyr::across(
tidyselect::starts_with("X"),
~ vals_to_find[match(., vals_to_find)]
)) %>%
tidyr::unite("countries_found", tidyselect::starts_with("X"),
sep = " | ", remove = TRUE, na.rm = TRUE
)
Output
ID countries_found
1 1 Argentina | USA | Argentina | USA
2 2 Mexico | Mexico
3 3 Argentina | USA
4 4 Mexico
unite the "Country" columns, then create a long vector by separating the values into rows, get all distinct values per ID, filter only those who are in vals_to_find, and summarise each countries.found toString.
library(tidyr)
library(dplyr)
df %>%
unite("Country", starts_with("Country"), sep = ",") %>%
separate_rows(Country) %>%
distinct(ID, Country) %>%
filter(Country %in% vals_to_find) %>%
group_by(ID) %>%
summarise(Countries.found = toString(Country))
output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
We may use
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(starts_with("Country"),
~ str_extract_all(.x, str_c(vals_to_find, collapse = "|")))) %>%
pivot_longer(cols = -ID, names_to = NULL,
values_to = 'Countries.found') %>%
unnest(Countries.found) %>%
distinct %>%
group_by(ID) %>%
summarise(Countries.found = toString(Countries.found))
-output
# A tibble: 4 × 2
ID Countries.found
<int> <chr>
1 1 Argentina, USA
2 2 Mexico
3 3 Argentina, USA
4 4 Mexico
I have the following DF
head(sample_data)
article value date
1 A 21920 2015
2 I 615 2017
3 B 1414 2018
4 D 102 2018
5 I 1096 2015
6 A 2577 2021
Full dataset
dput(sample_data)
structure(list(article = c("A", "I", "B", "D", "I", "A", "C",
"C", "D", "H", "B", "I", "A", "G", "E", "G", "D", "A", "D", "B",
"A", "C", "D", "F", "G", "D", "G", "C", "E", "E", "G", "G", "A",
"A", "E", "H", "B", "E", "E", "B", "B", "A", "H", "A", "B", "G",
"D", "C", "E", "A"), value = c(21920, 615, 1414, 102, 1096, 2577,
840, 311, 804, 695, 3863, 279, 7324, 299, 311, 133, 759, 5386,
5396, 11051, 14708, 856, 1749, 2212, 318, 3478, 415, 781, 227,
248, 122, 185, 1344, 15442, 248, 433, 5068, 38, 165, 369, 805,
18944, 264, 11716, 4274, 442, 2530, 827, 164, 18506), date = c("2015",
"2017", "2018", "2018", "2015", "2021", "2016", "2021", "2017",
"2021", "2019", "2015", "2019", "2016", "2015", "2019", "2018",
"2020", "2017", "2015", "2015", "2016", "2015", "2015", "2021",
"2015", "2019", "2016", "2016", "2015", "2019", "2020", "2019",
"2016", "2016", "2015", "2015", "2021", "2021", "2020", "2020",
"2015", "2016", "2017", "2019", "2016", "2015", "2016", "2019",
"2016")), row.names = c(NA, -50L), class = "data.frame")
I'm trying to use dplyr to get something along the lines of this:
sample_data %>%
+ group_by(article, date) %>%
+ summarise(weight = sum(value))
`summarise()` has grouped output by 'article'. You can override using the `.groups` argument.
# A tibble: 29 x 3
# Groups: article [9]
article date weight
<chr> <chr> <dbl>
1 A 2015 55572
2 A 2016 33948
3 A 2017 11716
4 A 2019 8668
5 A 2020 5386
6 A 2021 2577
7 B 2015 16119
8 B 2018 1414
9 B 2019 8137
10 B 2020 1174
# ... with 19 more rows
However, I want to add another column with a proportion of each article's weight of the total (sum of A:I) per year. The sum of all article proportions should then amount to 1 for each year.
I tried the code below. I suspect this occurs because I use "value" that results in all values being printed, hence all occurrences. How can I summarise this so it looks like the one above with the added column?
sample_data %>%
+ group_by(article, date) %>%
+ summarise(weight = sum(value), prop = value/weight)
`summarise()` has grouped output by 'article', 'date'. You can override using the `.groups` argument.
# A tibble: 50 x 4
# Groups: article, date [29]
article date weight prop
<chr> <chr> <dbl> <dbl>
1 A 2015 55572 0.394
2 A 2015 55572 0.265
3 A 2015 55572 0.341
4 A 2016 33948 0.455
5 A 2016 33948 0.545
6 A 2017 11716 1
7 A 2019 8668 0.845
8 A 2019 8668 0.155
9 A 2020 5386 1
10 A 2021 2577 1
# ... with 40 more rows
After the initial summarize, you have one entry for each article per year. You then wish to know what the contribution of each article was to each year's total, so you need to group_by again using just the year, and finally mutate to get the proportion for each article.
library(dplyr)
sample_data %>%
group_by(article, date) %>%
summarise(weight = sum(value), .groups = "keep") %>%
group_by(date) %>%
mutate(prop = weight / sum(weight))
#> # A tibble: 29 x 4
#> # Groups: date [7]
#> article date weight prop
#> <chr> <chr> <dbl> <dbl>
#> 1 A 2015 55572 0.661
#> 2 A 2016 33948 0.876
#> 3 A 2017 11716 0.632
#> 4 A 2019 8668 0.491
#> 5 A 2020 5386 0.799
#> 6 A 2021 2577 0.628
#> 7 B 2015 16119 0.192
#> 8 B 2018 1414 0.622
#> 9 B 2019 8137 0.461
#> 10 B 2020 1174 0.174
#> # ... with 19 more rows
Created on 2022-02-19 by the reprex package (v2.0.1)
An option is also to have do the group by sum within first summarise
library(dplyr)
library(tibble)
library(tidyr)
sample_data %>%
group_by(date) %>%
summarise(out = enframe(tapply(value, article, sum)/sum(value),
name = 'article', value = 'prop'), .groups = 'drop') %>%
unpack(out)
# A tibble: 29 × 3
date article prop
<chr> <chr> <dbl>
1 2015 A 0.661
2 2015 B 0.192
3 2015 D 0.0923
4 2015 E 0.00665
5 2015 F 0.0263
6 2015 H 0.00515
7 2015 I 0.0164
8 2016 A 0.876
9 2016 C 0.0853
10 2016 E 0.0123
# … with 19 more rows
I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I have two datasets I would like to merge in R: one is a long catch dataset and the other is a small effort dataset. I would like to join these so that I can multiply values for the same years AND industry together. Eg, the small effort columns will be repeated many times over, as they are industry-wide characteristics. I think this is a very simple merge but am having trouble making it work!
Catch <- data.frame(
Species = c("a", "a", "c", "c", "a", "b"),
Industry= c( "ag", "fi", "ag", "fi", "ag", "fi" ),
Year = c("1990", "1990", "1991", "1992", "1990", "1990"),
Catch = c(0,1,4,7,5,6))
Effort<-data.frame(
Industry= c( "ag", "ag", "ag" , "fi", "fi", "fi"),
Year = c("1990", "1991", "1992", "1990", "1991", "1992"),
Effort = c(0,1,4,7,5,6))
What I have tried so far:
effort_catch<-merge(Effort, Catch , by.x= Year, by.y=Year )
I am not sure which one is what you need
transform(
merge(Catch, Effort, by = c("Industry", "Year"), all.x = TRUE),
prod = Catch * Effort
)
Industry Year Species Catch Effort prod
1 ag 1990 a 0 0 0
2 ag 1990 a 5 0 0
3 ag 1991 c 4 1 4
4 fi 1990 a 1 7 7
5 fi 1990 b 6 7 42
6 fi 1992 c 7 6 42
or
transform(
merge(Catch, Effort, by = c("Industry", "Year"), all = TRUE),
prod = Catch * Effort
)
Industry Year Species Catch Effort prod
1 ag 1990 a 0 0 0
2 ag 1990 a 5 0 0
3 ag 1991 c 4 1 4
4 ag 1992 <NA> NA 4 NA
5 fi 1990 a 1 7 7
6 fi 1990 b 6 7 42
7 fi 1991 <NA> NA 5 NA
8 fi 1992 c 7 6 42
Here's a solution using dplyr
library(dplyr)
full_join(Catch, Effort) %>%
mutate(Multiplied = Catch * Effort)
#> Joining, by = c("Industry", "Year")
#> Species Industry Year Catch Effort Multiplied
#> 1 a ag 1990 0 0 0
#> 2 a fi 1990 1 7 7
#> 3 c ag 1991 4 1 4
#> 4 c fi 1992 7 6 42
#> 5 a ag 1990 5 0 0
#> 6 b fi 1990 6 7 42
#> 7 <NA> ag 1992 NA 4 NA
#> 8 <NA> fi 1991 NA 5 NA
Based on your provided data...
Catch <- data.frame(
Species = c("a", "a", "c", "c", "a", "b"),
Industry= c( "ag", "fi", "ag", "fi", "ag", "fi" ),
Year = c("1990", "1990", "1991", "1992", "1990", "1990"),
Catch = c(0,1,4,7,5,6))
Effort<-data.frame(
Industry= c( "ag", "ag", "ag" , "fi", "fi", "fi"),
Year = c("1990", "1991", "1992", "1990", "1991", "1992"),
Effort = c(0,1,4,7,5,6))
I have the following data and I was to make a new column using mutate which details when colour = 'g' then take the level on the g row minus the level figure on the 'r' row.
Then likewise with type. Where type = 1 then take the corresponding level minus the level on the type 2 row.
library(dplyr)
d <- tibble(
date = c("2018", "2018", "2018", "2019", "2019", "2019", "2020", "2020", "2020", "2020"),
colour = c("none","g", "r", "none","g", "r", "none", "none", "none", "none"),
type = c("type1", "none", "none", "type2", "none", "none", "none", "none", "none", "none"),
level= c(78, 99, 45, 67, 87, 78, 89, 87, 67, 76))
Just to be clear this is what I want the data to look like.
So the data should look like this:
d2 <- tibble(
date = c("2018", "2018", "2018", "2019", "2019", "2019", "2020", "2020", "2020", "2020"),
colour = c("none","g", "r", "none","g", "r", "none", "none", "none", "none"),
type = c("type1", "none", "none", "type2", "none", "none", "none", "none", "none", "none"),
level= c(78, 99, 45, 67, 87, 78, 89, 87, 67, 76),
color_gap = c("NULL", 44, "NULL", "NULL", 9, "NULL", "NULL", "NULL", "NULL", "NULL"),
type_gap = c(11, "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL"))
I started to use mutate and case when and got to the below. However, I'm stuck on the final calculation part. How do I say I want to take the color g level - the color r level?
d %>%
mutate(color_gap = case_when(color == "g" ~ level)%>%
mutate(type_gap = case_when(type== "type1" ~ level)%>%
) -> d2
Anyone know how to complete this?
Thanks
This subtracts the first r level from the first g level, second r level from second g level, etc. Same for type1 and type2. This has no checks at all. It doesn't check whether there is a matching r for each g, whether they are in the expected order, whether they are in the same date-group, etc. It assumes the data is already perfectly formatted as expected, so be careful using this on real data.
d %>%
mutate(color_gap = replace(rep(NA, n()), colour == 'g',
level[colour == 'g'] - level[colour == 'r']),
type_gap = replace(rep(NA, n()), type == 'type1',
level[type == 'type1'] - level[type == 'type2']))
# # A tibble: 10 x 6
# date colour type level color_gap type_gap
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
# 1 2018 none type1 78 NA 11
# 2 2018 g none 99 54 NA
# 3 2018 r none 45 NA NA
# 4 2019 none type2 67 NA NA
# 5 2019 g none 87 9 NA
# 6 2019 r none 78 NA NA
# 7 2020 none none 89 NA NA
# 8 2020 none none 87 NA NA
# 9 2020 none none 67 NA NA
# 10 2020 none none 76 NA NA
you could do this with group_by and mutate.
I assumed that there is only 1 row per date that would satisfy each condition.
d %>%
mutate(color_gap = case_when(colour == "g" ~ level)) %>%
mutate(type_gap = case_when(type== "type1" ~ level)) %>%
group_by(date) %>%
mutate(diff = max(color_gap,na.rm=T)-max(type_gap, na.rm=T))