(R) - Fill missing integer range by group with dplyr - r

I am using R studio, mostly dplyr processing, where I have a df of users (A,B,C,...) and what day since their first visit they were active (1,2,3,...)
user
day
active
A
1
T
A
3
T
B
2
T
B
4
T
I would like to complete this list with all missing days - up to their current maximum value (so for user B until 4 and for user A until 3) - and value FALSE:
user
day
active
A
1
T
A
2
F
A
3
T
B
1
F
B
2
T
B
3
F
B
4
T
I've been googling and chewing on this for hours now. Anybody have an idea?

We could group by 'user' and then get the sequence from min to max of 'day' in complete to expand the data while filling the 'active' column with FALSE (by default missing combinations are filled with NA)
library(dplyr)
library(tidyr)
df1 %>%
group_by(user) %>%
complete(day = min(day):max(day), fill = list(active = FALSE)) %>%
ungroup
-output
# A tibble: 6 × 3
user day active
<chr> <int> <lgl>
1 A 1 TRUE
2 A 2 FALSE
3 A 3 TRUE
4 B 2 TRUE
5 B 3 FALSE
6 B 4 TRUE
data
df1 <- structure(list(user = c("A", "A", "B", "B"), day = c(1L, 3L,
2L, 4L), active = c(TRUE, TRUE, TRUE, TRUE)), class = "data.frame",
row.names = c(NA,
-4L))

You can create a new dataframe of users and days for all users and all days and then join it to your existing dataframe then set the active column. Something like this:
fullDf <- data.frame("user" = c(rep("A", 4), rep("B", 4)),
"day" = rep(1:4, 2))
existingDf <- left_join(fullDf, existingDf, by = c("user", "day"))
existingDf$active <- ifelse(is.na(existingDf$active), FALSE, existingDf$active

Related

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.
test <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2)
)
want <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2),
want = c(NA, NA, "yes", NA, "yes", NA)
)
In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.
I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
mutate(across(want, ~replace(.x, duplicated(.x), NA)))
I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:
test %>%
group_by(grp) %>%
mutate(want = if_else(???, "yes", NA_character_))
The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?
Thank you!
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
ungroup()
# # A tibble: 6 × 4
# grp value threshold want
# <chr> <dbl> <dbl> <lgl>
# 1 A 1 4 FALSE
# 2 A 3 4 FALSE
# 3 A 5 4 TRUE
# 4 B 1 2 FALSE
# 5 B 3 2 TRUE
# 6 B 5 2 FALSE
If you really want strings, you can if_else after this.
Here is more direct way:
The essential part:
With min(which((value > threshold) == TRUE) we get the first TRUE in our column,
Next we use an ifelse and check the number we get to the row number and set the conditions:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
"yes", NA_character_))
grp value threshold want
<chr> <dbl> <dbl> <chr>
1 A 1 4 NA
2 A 3 4 NA
3 A 5 4 yes
4 B 1 2 NA
5 B 3 2 yes
6 B 5 2 NA
>
This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:
library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
# grp value threshold flag
# <char> <num> <num> <lgcl>
#1: A 1 4 NA
#2: A 3 4 NA
#3: A 5 4 TRUE
#4: B 1 2 NA
#5: B 3 2 TRUE
#6: B 5 2 NA
Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

Remove non-last rows with certain condition per group

I have the following dataframe called df (dput below):
group indicator value
1 A FALSE 2
2 A FALSE 1
3 A FALSE 2
4 A TRUE 4
5 B FALSE 5
6 B FALSE 1
7 B TRUE 3
I would like to remove the non-last rows with indicator == FALSE per group. This means that in df the rows: 1,2 and 5 should be removed because they are not the last rows with FALSE per group. Here is the desired output:
group indicator value
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3
So I was wondering if anyone knows how to remove non-last rows with certain condition per group in R?
dput of df:
df <- structure(list(group = c("A", "A", "A", "A", "B", "B", "B"),
indicator = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE
), value = c(2, 1, 2, 4, 5, 1, 3)), class = "data.frame", row.names = c(NA,
-7L))
Filter using last(which()) to find the row number of the last FALSE row per group:
library(dplyr)
df %>%
group_by(group) %>%
filter(indicator | row_number() == last(which(!indicator))) %>%
ungroup()
# A tibble: 4 × 3
group indicator value
<chr> <lgl> <dbl>
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3
You can do this with lead and check if the following indicator is TRUE.
library(tidyverse)
df <- structure(list(group = c("A", "A", "A", "A", "B", "B", "B"),
indicator = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE
), value = c(2, 1, 2, 4, 5, 1, 3)), class = "data.frame", row.names = c(NA,
-7L))
df |>
group_by(group) |>
mutate(slicer = if_else(lead(indicator) ==F, 1, 0)) |>
mutate(slicer = if_else(is.na(slicer), 0 , slicer)) |>
filter(slicer == 0) |>
select(-slicer)
#> # A tibble: 4 × 3
#> # Groups: group [2]
#> group indicator value
#> <chr> <lgl> <dbl>
#> 1 A FALSE 2
#> 2 A TRUE 4
#> 3 B FALSE 1
#> 4 B TRUE 3
Another approach:
library(dplyr)
df %>%
group_by(group) %>%
slice_max(cumsum(!indicator))
Note: While this approach covers the example shown and OP's clarification that T always comes last, it will not work in sequences such as T, F, F, T in which you'd like to keep both Ts and not just the one following F.
Output:
# A tibble: 4 x 3
# Groups: group [2]
group indicator value
<chr> <lgl> <dbl>
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3
Some alternatives one could come up with:
"Dumb" solution
should_be_kept <- logical(nrow(df))
for(row in 1:nrow(df)) {
if(df[row,"Indicator"]) {
should_be_kept[row] <- TRUE
} else if(row == max(which(!df[, "Indicator"] & df$Group == df[row, "Group"]))) {
should_be_kept[row] <- TRUE
} else {
should_be_kept[row] = FALSE
}
}
df[should_be_kept, ]
Solution using a custom function to find the last FALSE indicators from each group
rows_to_keep <- logical(nrow(df)) #We create a TRUE/FALSE vector with one entry for each row of df
rows_to_keep[df$Indicator] <- TRUE #If Indicator is TRUE, we mark that row as "selectable"
get_last_false_in_group <- function(df, group) {
return(max(which(df$Group == group & !df$Indicator))) #Gets the last time the condition inside of which() is met
}
#The following chunk does a group-by-group search of the last false indicator. There's probably some apply magic that simplifies this but I'm too dumb to come up with it.
groups <- levels(factor(df$Group))
for(current_group in groups) {
rows_to_keep[get_last_false_in_group(df, current_group)] <- TRUE
}
#Now that our rows_to_keep vector is ready, we can filter the corresponding rows and get the intended result:
df[rows_to_keep,]
With the data.table package, it's possible to replace the calls to max(which(...)) with calls to just the last function

Add values to dataframe based on values in other column

I would like to add values to a column based on non-unique values in another column. For example, say I have a dataframe with a currently empty column that looks like this:
Site
Species Richness
A
0
A
0
A
0
B
0
B
0
I want to assign known species richness values for each site. Let's say site A has species richness 3, and site B has species richness 5. I would like the output to be:
Site
Species Richness
A
3
A
3
A
3
B
5
B
5
How do I input species richness values for specific sites?
I've tried this:
rows_update(df, tibble(Site = A, richness = 3))
rows_update(df, tibble(Site = B, richness = 5))
But I get an error message saying "'x' key values are not unique"
Any help would be appreciated!
Here, we could make use of join on from data.table and assign := the corresponding column of 'SpeciesRichness'. It would be more efficient
library(data.table)
setDT(df)[data.table(Site = c('A','B'), SpeciesRichness = c(3, 5)),
SpeciesRichness := i.SpeciesRichness, on = .(Site)]
The issue with ?rows_update is that the by column should be uniquely identifying in both data.
The two tables are matched by a set of key variables whose values must uniquely identify each row.
With 'df', the values are replicated 3 times for 'A' and 2 for 'B'. Using dplyr, we can do a left_join
library(dplyr)
df %>%
left_join(tibble(Site = c('A', "B"), new = c(3, 5))) %>%
transmute(Site, SpeciesRichness = new)
-output
# Site SpeciesRichness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
data
df <- structure(list(Site = c("A", "A", "A", "B", "B"),
SpeciesRichness = c(0L,
0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L
))
You can create a dataframe with Site and Richness value and join them together.
In base R :
df1 <- data.frame(Site = rep(c('A', 'B'), c(3, 2)))
df2 <- data.frame(Site = c('A', 'B'), richness = c(3, 5))
df1 <- merge(df1, df2)
df1
# Site richness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
You can also use match :
df1$richness <- df2$richness[match(df1$Site, df2$Site)]
You could define the values then use case_when
x <- 3
y <- 5
df %>%
mutate(SpeciesRichness= case_when(Site=="A" ~ x,
Site=="B" ~ y))
Output:
Site SpeciesRichness
1 A 3
2 A 3
3 A 3
4 B 5
5 B 5

Grouped recurrence by periods over a data.table

I have a dataset with names, dates, and several categorical columns. Let's say
data <- data.table(name = c('Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Ben', 'Ben', 'Cal'),
period = c(1,1,1,1,1,1,2,2,2,3,3),
category = c("A","A","A","B","B","B","A","B","A","B","A"))
Which looks like this:
name period category
Anne 1 A
Ben 1 A
Cal 1 A
Anne 1 B
Ben 1 B
Cal 1 B
Anne 2 A
Ben 2 B
Ben 2 A
Ben 3 A
Cal 3 B
I want to compute, for each period, how many names were present in the past period, for every group of my categorical variables. The output should be as follows:
period category recurrence_count
2 A 2 # due to Anne and Ben being on A, period 1
2 B 1 # due to Ben being on B, period 1
3 A 1 # due to Ben being on A, period 2
3 B 0 # no match from B, period 2
I am aware of the .I and .GRP operators in data.table, but I have no idea how to write the notion of 'next group' in the j entry of my statement. I imagine something like this might be a reasonable path, but I can't figure out the correct syntax:
data[, .(recurrence_count = length(intersect(name, name[last(.GRP)]))), by = .(category, period)]
You can first summarize your data by category and period.
previous_period_names <- data[, .(names = list(name)), .(category, period)]
previous_period_names[, next_period := period + 1]
Join your summary with your original data.
data[previous_period_names, names := i.names, on = c('period==next_period')]
Now count how many names you see the name in the summarized names
data[, .(recurrence_count = sum(name %in% unlist(names))), by = .(period, category)]
Another data.table alternative. For rows that can have a previous period (period != 1), create such a variable (prev_period := period - 1).
Join original data with a subset that has values for 'prev_period' (data[data[!is.na(prev_period)]). Join on 'category', 'period = prev_period' and 'name'.
In the resulting data set, for each 'period' and 'category' (by = .(period = i.period, category)), count the number of names from original data (x.name) that had a match with previous period (length(na.omit(x.name))).
data[period != 1, prev_period := period - 1]
data[data[!is.na(prev_period)], on = c("category", period = "prev_period", "name"),
.(category, i.period, x.name)][
, .(n = length(na.omit(x.name))), by = .(period = i.period, category)]
# period category n
# 1: 2 A 2
# 2: 2 B 1
# 3: 3 B 1
# 4: 3 A 0
One option in base R is to split the 'data' by 'category', then loop over the list (lapply), use Reduce with intersect on the splitted 'name' by 'period' with accumulate as TRUE, get the lengths of the list, create a data.frame with the unique elements of 'period' and use Map to create the 'category' from the names of the list output, rbind the list of data.frame into a single dataset
library(data.table)
lst1 <- lapply(split(data, data$category), function(x)
data.frame(period = unique(x$period)[-1],
recurrence_count = lengths(Reduce(intersect,
split(x$name, x$period), accumulate = TRUE)[-1])))
rbindlist(Map(cbind, category = names(lst1), lst1))[
order(period), .(period, category, recurrence_count)]
# period category recurrence_count
#1: 2 A 2
#2: 2 B 1
#3: 3 A 1
#4: 3 B 0
Or using the same logic within data.table, grouped by 'category, do the split of 'name' by 'period' and apply the Reduce with intersect
setDT(data)[, .(period = unique(period),
recurrence_count = lengths(Reduce(intersect,
split(name, period), accumulate = TRUE))), .(category)][duplicated(category)]
# category period recurrence_count
#1: A 2 2
#2: A 3 1
#3: B 2 1
#4: B 3 0
Or similar option in tidyverse
library(dplyr)
library(purrr)
data %>%
group_by(category) %>%
summarise(reccurence_count = lengths(accumulate(split(name, period),
intersect)), period = unique(period), .groups = 'drop' ) %>%
filter(duplicated(category))
# A tibble: 4 x 3
# category reccurence_count period
# <chr> <int> <int>
#1 A 2 2
#2 A 1 3
#3 B 1 2
#4 B 0 3
data
data <- structure(list(name = c("Anne", "Ben", "Cal", "Anne", "Ben",
"Cal", "Anne", "Ben", "Ben", "Ben", "Cal"), period = c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), category = c("A", "A", "A",
"B", "B", "B", "A", "B", "A", "A", "B")), class = "data.frame",
row.names = c(NA,
-11L))
A data.table option
setDT(df)[
,
{
u <- split(name, period)
data.table(
period = unique(period)[-1],
recurrence_count = lengths(
Map(
intersect,
head(u, -1),
tail(u, -1)
)
)
)
},
category
]
gives
category period recurrence_count
1: A 2 2
2: A 3 1
3: B 2 1
4: B 3 0

R - Merge two datatables and remove duplicates from older file?

I have two databases - old one and update one.
Both have same structures, with unique ID.
If record changes - there's new record with same ID and new data.
So after rbind(m1,m2) I have duplicated records.
I can't just remove duplicated ID's, since the data could be updated.
There's no way to tell the difference which record is new, beside it being in old file or update file.
How can I merge two tables, and if there's row with duplicated ID, leave the one from newer file?
I know I could add column to both and just ifelse() this, but I'm looking for something more elegant, preferably oneliner.
hard to give the correct answer without sample data.. but here is an approach that you can adjust to your data..
#sample data
library( data.table )
dt1 <- data.table( id = 2:3, value = c(2,4))
dt2 <- data.table( id = 1:2, value = c(2,6))
#dt1
# id value
# 1: 2 2
# 2: 3 4
#dt2
# id value
# 1: 1 2
# 2: 2 6
#rowbind...
DT <- rbindlist( list(dt1,dt2), use.names = TRUE )
# id value
# 1: 2 2
# 2: 3 4
# 3: 1 2
# 4: 2 6
#deselect duplicated id from the buttom up
# assuming the last file in the list contains the updated values
DT[ !duplicated(id, fromLast = TRUE), ]
# id value
# 1: 3 4
# 2: 1 2
# 3: 2 6
Say you have:
old <- data.frame(id = c(1,2,3,4,5), val = c(21,22,23,24,25))
new <- data.frame(id = c(1,4), val = c(21,27))
so the record with id 4 has changed in the new dataset and 1 is a pure duplicate.
You can use dplyr::anti_join to find old records not in the new dataset and then just use rbind to add the new ones on.
joined <- rbind(anti_join(old,new, by = "id"),new)
You could use dplyr:
df_new %>%
full_join(df_old, by="id") %>%
transmute(id = id, value = coalesce(value.x, value.y))
returns
id value
1 1 0.03432355
2 2 0.28396359
3 3 0.01121692
4 4 0.57214035
5 5 0.67337745
6 6 0.67637187
7 7 0.69178855
8 8 0.83953140
9 9 0.55350251
10 10 0.27050363
11 11 0.28181032
12 12 0.84292569
given
df_new <- structure(list(id = 1:10, value = c(0.0343235526233912, 0.283963593421504,
0.011216921498999, 0.572140350239351, 0.673377452883869, 0.676371874753386,
0.691788548836485, 0.839531400706619, 0.553502510068938, 0.270503633422777
)), class = "data.frame", row.names = c(NA, -10L))
df_old <- structure(list(id = c(1, 4, 5, 3, 7, 9, 11, 12), value = c(0.111697669373825,
0.389851713553071, 0.252179590053856, 0.91874519130215, 0.504653975600377,
0.616259852424264, 0.281810319051147, 0.842925694771111)), class = "data.frame", row.names = c(NA,
-8L))

Resources