R mutate a column by group in ifelse - r

I'd like to mutate a column in R data.table.
Here's the example of my data.
df <- data.table(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
category=c(1,1,1,NA,NA,NA,2,2,2),
result = c('a','a','a','b','b','b','c','c','c'))
My goal is to create a column using if-else command.
The column would be the first values of groupId group by id.
The point is when mutating, the values should be the same by group.
If the category is NA, then the result should be the last value of groupId.
This is the result I'm looking forward to.
id groupId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA b
5: 2 c NA b
6: 2 b NA b
7: 3 c 2 c
8: 3 b 2 c
9: 3 a 2 c

with data.table:
df[,result:=fifelse(is.na(category),last(stopId),first(stopId)),by=id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a

As it's name, by using first and last,
df %>%
group_by(id) %>%
mutate(resultt = ifelse(is.na(category), last(stopId), first(stopId)))
id stopId category result resultt
<dbl> <chr> <dbl> <chr> <chr>
1 1 a 1 a a
2 1 b 1 a a
3 1 c 1 a a
4 2 a NA b b
5 2 c NA b b
6 2 b NA b b
7 3 c 2 c c
8 3 b 2 c c
9 3 a 2 c c
Data you provided is different above...

We can use .N or 1 to index stopId per group
> df[, result := stopId[ifelse(is.na(category), .N, 1)], id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a
or shorter
> df[, result := stopId[c(1, .N)[is.na(category) + 1]], id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a

Related

Is there some way to keep variable names from.SD+.SDcols together with non .SD variable names in data.table?

Given a data.table
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
if one does
DT[, .(a, .SD), .SDcols=x:y]
a .SD.x .SD.v .SD.y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the variables from .SDcols become prefixed by .SD. On the other hand, if one tries, as in https://stackoverflow.com/a/62282856/997979,
DT[, c(.(a), .SD), .SDcols=x:y]
V1 x v y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the other variable name (a) become lost. (It is due to this reason that I re-ask the question which I initially marked as a duplicate to that linked above).
Is there some way to keep the names from both .SD variables and non .SD variables?
The goal is simultaneously being able to use .() to select variables without quotes and being able to select variables through .SDcols = patterns("...")
Thanks in advance!
not really sure why.. but it works ;-)
DT[, .(a, (.SD)), .SDcols=x:y]
# a x v y
# 1: 1 b 1 1
# 2: 2 b 1 3
# 3: 3 b 1 6
# 4: 4 a 2 1
# 5: 5 a 2 3
# 6: 6 a 1 6
# 7: 7 c 1 1
# 8: 8 c 2 3
# 9: 9 c 2 6

Get the group index in a data.table

I have the following data.table:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5,6,7,8,9,10), b = c('A','A','A','B','B', 'C', 'C', 'C', 'D', 'D'), c = c(1,1,1,1,1,2,2,2,2,2))
> DT
a b c
1: 1 A 1
2: 2 A 1
3: 3 A 1
4: 4 B 1
5: 5 B 1
6: 6 C 2
7: 7 C 2
8: 8 C 2
9: 9 D 2
10: 10 D 2
I want to add a column that shows the index grouped by c (starts from 1 from each group in column c), but that only changes when the value of b is changed. The result wanted is shown below:
Here are two ways to do this :
Using rleid :
library(data.table)
DT[, col := rleid(b), c]
With match + unique :
DT[, col := match(b, unique(b)), c]
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
3 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
We can use factor with levels specified and coerce it to integer
library(data.table)
DT[, col := as.integer(factor(b, levels = unique(b))), c]
-output
DT
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
# 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
Or using base R with rle
with(DT, as.integer(ave(b, c, FUN = function(x)
with(rle(x), rep(seq_along(values), lengths)))))

R find intervals in data.table

i want to add a new column with intervals or breakpoints by group. As an an example:
This is my data.table:
x <- data.table(a = c(1:8,1:8), b = c(rep("A",8),rep("B",8)))
I have already the breakpoint or rowindices:
pos <- data.table(b = c("A","A","B","B"), bp = c(3,5,2,4))
Here i can find the interval for group "A" with:
findInterval(1:nrow(x[b=="A"]), pos[b=="A"]$bp)
How can i do this for each group. In this case "A" and "B"?
An option is to split the datasets by 'b' column, use Map to loop over the corresponding lists, and apply findInterval
Map(function(u, v) findInterval(seq_len(nrow(u)), v$bp),
split(x, x$b), split(pos, pos$b))
#$A
#[1] 0 0 1 1 2 2 2 2
#$B
#[1] 0 1 1 2 2 2 2 2
or another option is to group by 'b' from 'x', then use findInterval by subsetting the 'bp' from 'pos' by filtering with a logical condition created based on .BY
x[, findInterval(seq_len(.N), pos$bp[pos$b==.BY]), b]
# b V1
# 1: A 0
# 2: A 0
# 3: A 1
# 4: A 1
# 5: A 2
# 6: A 2
# 7: A 2
# 8: A 2
# 9: B 0
#10: B 1
#11: B 1
#12: B 2
#13: B 2
#14: B 2
#15: B 2
#16: B 2
Another option using rolling join in data.table:
pos[, ri := rowid(b)]
x[, intvl := fcoalesce(pos[x, on=.(b, bp=a), roll=Inf, ri], 0L)]
output:
a b intvl
1: 1 A 0
2: 2 A 0
3: 3 A 1
4: 4 A 1
5: 5 A 2
6: 6 A 2
7: 7 A 2
8: 8 A 2
9: 1 B 0
10: 2 B 1
11: 3 B 1
12: 4 B 2
13: 5 B 2
14: 6 B 2
15: 7 B 2
16: 8 B 2
We can nest the pos data into list by b and join with x and use findInterval to get corresponding groups.
library(dplyr)
pos %>%
tidyr::nest(data = bp) %>%
right_join(x, by = 'b') %>%
group_by(b) %>%
mutate(interval = findInterval(a, data[[1]][[1]])) %>%
select(-data)
# b a interval
# <chr> <int> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 A 4 1
# 5 A 5 2
# 6 A 6 2
# 7 A 7 2
# 8 A 8 2
# 9 B 1 0
#10 B 2 1
#11 B 3 1
#12 B 4 2
#13 B 5 2
#14 B 6 2
#15 B 7 2
#16 B 8 2

Count how many times values has changed in column using R

Hi i want to count how many times value has changed in a column by the group and how many unique values was in a group, and i sort of getting what i want, but it has a NA observation which i do not want to be counted.
df <- data.frame(x=c("a",'a', "a", "b",'b', "b", "c",'c', "d")
,y=c(1,2,NA,3,3,3,2,1,5))
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(y), by=x][]
setDT(df)[, count := uniqueN(y),by=x][]
x y wanted count
1: a 1 1 3
2: a 2 2 3
3: a NA 3 3
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1`
Desired results:
x y wanted count
1: a 1 1 2
2: a 2 2 2
3: a NA 2 2
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1
I tried rleid(!is.na(y)) but seems not to work as i expected. Thank you.
We can replace the NA elements with previous non-NA element (na.locf), take the rleid on that to get the 'wanted' and also get the length of unique elements that are not NA to get the 'count'
library(zoo)
setDT(df)[, c('wanted', 'count') := list(rleid(na.locf(y)), uniqueN(y, na.rm = TRUE)), x]
df
# x y wanted count
#1: a 1 1 2
#2: a 2 2 2
#3: a NA 2 2
#4: b 3 1 1
#5: b 3 1 1
#6: b 3 1 1
#7: c 2 1 2
#8: c 1 2 2
#9: d 5 1 1

R data.table conditional selection

I have a data.table
set.seed(1)
DT <- data.table(tag = rep(LETTERS[1:4],each = 2, times = 3),
year = rep(1:3, each = 8), month = rep(c(1:2), 12),
value = runif(24, 1, 10))
DT
tag year month value
1: A 1 1 3.389578
2: A 1 2 4.349115
3: B 1 1 6.155680
4: B 1 2 9.173870
5: C 1 1 2.815137
6: C 1 2 9.085507
7: D 1 1 9.502077
8: D 1 2 6.947180
9: A 2 1 6.662026
10: A 2 2 1.556076
11: B 2 1 2.853771
12: B 2 2 2.589011
13: C 2 1 7.183206
14: C 2 2 4.456933
15: D 2 1 7.928573
16: D 2 2 5.479293
17: A 3 1 7.458567
18: A 3 2 9.927155
19: B 3 1 4.420317
20: B 3 2 7.997007
21: C 3 1 9.412347
22: C 3 2 2.909283
23: D 3 1 6.865064
24: D 3 2 2.129996
Sort this DT by year, month, and -value:
setorder(DT, year, month, -value)
will produce:
tag year month value
1: D 1 1 9.502077
2: B 1 1 6.155680
3: A 1 1 3.389578
4: C 1 1 2.815137
5: B 1 2 9.173870
6: C 1 2 9.085507
7: D 1 2 6.947180
8: A 1 2 4.349115
9: D 2 1 7.928573
10: C 2 1 7.183206
11: A 2 1 6.662026
12: B 2 1 2.853771
13: D 2 2 5.479293
14: C 2 2 4.456933
15: B 2 2 2.589011
16: A 2 2 1.556076
17: C 3 1 9.412347
18: A 3 1 7.458567
19: D 3 1 6.865064
20: B 3 1 4.420317
21: A 3 2 9.927155
22: B 3 2 7.997007
23: C 3 2 2.909283
24: D 3 2 2.129996
I would like the result to be like following:
tag year month value
1: D 1 1 9.502077
2: B 1 1 6.155680
3: B 1 2 9.173870
4: D 1 2 6.947180
5: D 2 1 7.928573
6: C 2 1 7.183206
7: D 2 2 5.479293
8: C 2 2 4.456933
9: C 3 1 9.412347
10: A 3 1 7.458567
11: A 3 2 9.927155
12: C 3 2 2.909283
The result DT would have the following property: within each year keep only the two tags that have larger value in month 1. For example, year 1, the two tags that have larger value are D and B, so keep D and B for the whole year 1. year 2 keeps D and C. At each month 1, I need to reselect the rows with the year.
We group by 'year', get the first two elements in 'tag' for 'month' 1 (as it is already ordered), create a logical index using %in% and subset the rows.
DT[, .SD[tag %in% head(tag[month ==1],2)], .(year)]
# year tag month value
# 1: 1 D 1 9.502077
# 2: 1 B 1 6.155680
# 3: 1 B 2 9.173870
# 4: 1 D 2 6.947180
# 5: 2 D 1 7.928573
# 6: 2 C 1 7.183206
# 7: 2 D 2 5.479293
# 8: 2 C 2 4.456933
# 9: 3 C 1 9.412347
#10: 3 A 1 7.458567
#11: 3 A 2 9.927155
#12: 3 C 2 2.909283

Resources