Frequency of two columns counting NAs in one column as zero frequency - r

Part 1 I have the following data table. I want to make a new column which contains the number of occurrences of each id where there is any style value, except for NA. The main problem is that I don't know how to deal with the NA. Currently, when NA is present I get a frequency of 1.
id style
1 A
1 A
2 A
2 B
3 NA
4 A
4 C
5 NA
I tried using the following, but it still counts NA values
dt[, allele_count := .N, by = list(pat_id, style)]
The desired data table would be as follows:
id style count
1 A 2
1 A 2
2 A 2
2 B 2
3 NA 0
4 A 4
4 B 4
4 B 4
4 C 4
5 NA 0
Part2 I would also like to be able to add another column which would have the number of occurrences of each id with a certain style value.
id style count2
1 A 2
1 A 2
2 A 1
2 B 1
3 NA 0
4 A 1
4 B 2
4 B 2
4 C 1
5 NA 0
Bonus Question: Instead of looking at how many times an id occurs with a given style value as in Part2 , How can you calculate the number of different style values for each id, as follows.
id style count3
1 A 1
1 A 1
2 A 2
2 B 2
3 NA 0
4 A 3
4 B 3
4 B 3
4 C 3
5 NA 0

Here's one possibility. Basically we use a row subset to assign the new columns, then replace the NA values in all three new columns with zero at the end.
nna <- !is.na(dt$style) ## so we don't have to call it four times
dt[nna, count := .N, by = id][nna, count2 := .N, by = .(id, style)][
nna, count3 := uniqueN(style), by = id][!nna, names(dt)[3:5] := 0L]
which results in
id style count count2 count3
1: 1 A 2 2 1
2: 1 A 2 2 1
3: 2 A 2 1 2
4: 2 B 2 1 2
5: 3 NA 0 0 0
6: 4 A 2 1 2
7: 4 C 2 1 2
8: 5 NA 0 0 0
Or you can simplify this down to the following, then reorder the columns if necessary.
dt[nna, c("count", "count3") := .(.N, uniqueN(style)), by = id][
nna, count2 := .N, by = .(id, style)][!nna, names(dt)[3:5] := 0L]
Note that this method is very similar to the other posted answer. I am not sure which of the two is the preferred method, row subset or if() statement.

This is a possible solution :
library(data.table)
dt <- data.table(id=c(1,1,2,2,3,4,4,5),
style=c('A','A','A','B',NA,'A','C',NA))
# count = number of ids having ALL styles defined
dt[, count := if(any(is.na(style))) 0L else .N, by = id]
# count2 = number of id-style occurrences (0 if style = NA)
dt[, count2 := if(is.na(style)) 0L else .N, by = .(id, style)]
> dt
id style count count2
1: 1 A 2 2
2: 1 A 2 2
3: 2 A 2 1
4: 2 B 2 1
5: 3 NA 0 0
6: 4 A 2 1
7: 4 C 2 1
8: 5 NA 0 0
BONUS:
dt[, count3 := uniqueN(na.omit(style)), by = id]
> dt
id style count count2 count3
1: 1 A 2 2 1
2: 1 A 2 2 1
3: 2 A 2 1 2
4: 2 B 2 1 2
5: 3 NA 0 0 0
6: 4 A 2 1 2
7: 4 C 2 1 2
8: 5 NA 0 0 0

Related

R Data Table add rows to each group if not existing [duplicate]

This question already has answers here:
data.table equivalent of tidyr::complete()
(3 answers)
Closed 29 days ago.
I have a data table with multiple groups. Each group I'd like to fill with rows containing the values in vals if they are not already present. Additional columns should be filled with NAs.
DT = data.table(group = c(1,1,1,2,2,3,3,3,3), val = c(1,2,4,2,3,1,2,3,4), somethingElse = rep(1,9))
vals = data.table(val = c(1,2,3,4))
What I want:
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
The order of val does not necessarily have to be increasing, the values may also be appened at the beginning/end of each group.
I don't know how to approach this problem. I've thought about using rbindlist(...,fill = TRUE), but then the values will be simply appended.
I think some expression with DT[, lapply(...), by = c("group")] might be useful here but I have no idea how to check if a value already exists.
You can use a cross-join:
setDT(DT)[
CJ(group = group, val = val, unique = TRUE),
on = .(group, val)
]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
Another way to solve your problem:
DT[, .SD[vals, on="val"], by=group]
group val somethingElse
1: 1 1 1
2: 1 2 1
3: 1 3 NA
4: 1 4 1
5: 2 1 NA
6: 2 2 1
7: 2 3 1
8: 2 4 NA
9: 3 1 1
10: 3 2 1
11: 3 3 1
12: 3 4 1
# or
DT[CJ(group, val, unique=TRUE), on=.NATURAL]
I will just add this answer for a slightly more complex case:
#Raw Data
DT = data.table(group = c(1,1,2,2,2,3,3,3,3),
x = c(1,2,1,3,4,1,2,3,4),
y = c(2,4,2,6,8,2,4,6,8),
somethingElse = rep(1,9))
#allowed combinations of x and y
DTxy = data.table(x = c(1,2,3,4), y = c(2,4,6,8))
Here, I want to add all x,y combinations from DTxy to each group from DT, if not already present.
I've wrote a function to work for subsets.
#function to join subsets on two columns (here: x,y)
DTxyJoin = function(.SD, xy){
.SD = .SD[xy, on = .(x,y)]
return(.SD)
}
I then applied the function to each group:
#add x and y to each group if missing
DTres = DT[, DTxyJoin(.SD, DTxy), by = c("group")]
The Result:
group x y somethingElse
1: 1 1 2 1
2: 1 2 4 1
3: 1 3 6 NA
4: 1 4 8 NA
5: 2 1 2 1
6: 2 2 4 NA
7: 2 3 6 1
8: 2 4 8 1
9: 3 1 2 1
10: 3 2 4 1
11: 3 3 6 1
12: 3 4 8 1

Reshape complex time to event data in R

I have the following data frame where I have the beginning of the time, the end of the time AND a Date where the individual got the observations A or B.
df =
id Date Start_Date End_Date A B
1 2 1 4 1 0
1 3 1 4 0 1
2 3 2 9 1 0
2 6 2 9 1 0
2 7 2 9 1 0
2 2 2 9 0 1
What I want to do is to order the time chronologically (create a new Time variable), and fill the information A and B accordingly, that is, if the individual got A at time 2 it should also have at the following up times (i.e. 3 until End_Time). Ideally, the interval times are not regular but follow the changes in Date (see individual 2):
Cool_df =
id Time A B
1 1 0 0
1 2 1 0
1 3 1 1
1 4 1 1
2 2 0 1
2 3 1 1
2 6 1 1
2 7 1 1
2 9 1 1
Any recommendation highly appreciated because I do not know where to start.
Here is a data.table approach
library(data.table)
setDT(df)
# Summarise dates
ans <- df[, .(Date = unique(c(min(Start_Date), Date, max(End_Date)))), by = .(id)]
# Join
ans[ df[A==1,], A := 1, on = .(id,Date)]
ans[ df[B==1,], B := 1, on = .(id,Date)]
#fill down NA's using "locf"
cols.to.fill = c("A","B")
ans[, (cols.to.fill) := lapply(.SD, nafill, type = "locf"),
by = .(id), .SDcols = cols.to.fill]
#fill other NA with zero
ans[is.na(ans)] <- 0
# id Date A B
# 1: 1 1 0 0
# 2: 1 2 1 0
# 3: 1 3 1 1
# 4: 1 4 1 1
# 5: 2 2 0 1
# 6: 2 3 1 1
# 7: 2 6 1 1
# 8: 2 7 1 1
# 9: 2 9 1 1

Set values of a column to NA after a given point

I have a dataset like this:
ID NUMBER X
1 5 2
1 3 4
1 6 3
1 2 5
2 7 3
2 3 5
2 9 3
2 4 2
and I'd like to set values of variable X to NA after the variable NUMBER increses (even though after it decreases again) for each ID, and obtaining:
ID NUMBER X
1 5 2
1 3 4
1 6 NA
1 2 NA
2 7 3
2 3 5
2 9 NA
2 4 NA
How can I do it?
Thanks for your help!
Surely not the most elegant solution, but it is quite intuitive:
library(data.table)
setDT(d)
d[, n := ifelse(NUMBER > shift(NUMBER, 1, "lag"),1,0), by=ID]
d[is.na(n), n := 0]
d[, n := cumsum(n), by=ID]
d[n>0, X := NA ]
d
ID NUMBER X n
1: 1 5 2 0
2: 1 3 4 0
3: 1 6 NA 1
4: 1 2 NA 1
5: 2 7 3 0
6: 2 3 5 0
7: 2 9 NA 1
8: 2 4 NA 1
You can do this with dplyr package. If your dataframe is called df then you can use this code:
df %>% group_by(ID) %>%
mutate ( X = c(X[1:(min(which(diff(Number) > 0)))],rep("NA",length(X)-(min(which(diff(Number) > 0)))))) %>%
as.data.frame()
I first grouped them with ID and then I found the first increasing number with diff and which.

How do I find the sum of inequalities by unique group/subgroup pairs?

Suppose I am working with the following data.table:
dta <- setDT(
data.frame(
id = c("A","A","A","B","B","C","C","C"),
subid = c(1,1,2,1,2,1,1,1),
x1 = c(1,1,3,1,2,3,3,3),
x2 = c(3,3,1,1,1,3,3,3)
)
)
> dta
id subid x1 x2
1: A 1 1 3
2: A 1 1 3
3: A 2 3 1
4: B 1 1 1
5: B 2 2 1
6: C 1 3 3
7: C 1 3 3
8: C 1 3 3
For each unique id-subid pairing, I would like to find the total number of times that x1<x2 and the total number of times that x1>=x2, and have those counts be added to the data.table as new columns/variables but aggregated to the id level.
The outcome would look something like:
id subid x1 x2 lt gt
1: A 1 1 3 1 1
2: A 1 1 3 1 1
3: A 2 3 1 1 1
4: B 1 1 1 0 2
5: B 2 2 1 0 2
6: C 1 3 3 0 1
7: C 1 3 3 0 1
8: C 1 3 3 0 1
For example, of the two unique id-subidparings for id="A", one has x1<x2 and one has x1>x2, which means that for A the variable for "less-than" has a value of 1 (i.e. dta$lt[dta$id==A] <- 1), and the same for "greater-than" (dta$gt[dta$id==A] <- 1).
I have been searching for a solution to this but have not had much luck. I have found solutions to similar problems (e.g. counting number of unique observations by unique pairings), but have not been able to modify them to suit my needs. In particular, I am struggling to aggregate the count from the id-subid level to the id level. (It could be that I'm not exactly sure how to search for -- or even word -- this question.)
I've been able to do this using nested loops on a data frame, but I suspect there is a more efficient way of doing it. In particular, I am curious about doing this using data.table.
A possible solution:
dta[, c('lt', 'gt') := unique(.SD)[, .(sum(x1 < x2), sum(x1 >= x2))], by = .(id)]
which gives:
> dta
id subid x1 x2 lt gt
1: A 1 1 3 1 1
2: A 1 1 3 1 1
3: A 2 3 1 1 1
4: B 1 1 1 0 2
5: B 2 2 1 0 2
6: C 1 3 3 0 1
7: C 1 3 3 0 1
8: C 1 3 3 0 1

Count how many times values has changed in column using R

Hi i want to count how many times value has changed in a column by the group and how many unique values was in a group, and i sort of getting what i want, but it has a NA observation which i do not want to be counted.
df <- data.frame(x=c("a",'a', "a", "b",'b', "b", "c",'c', "d")
,y=c(1,2,NA,3,3,3,2,1,5))
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(y), by=x][]
setDT(df)[, count := uniqueN(y),by=x][]
x y wanted count
1: a 1 1 3
2: a 2 2 3
3: a NA 3 3
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1`
Desired results:
x y wanted count
1: a 1 1 2
2: a 2 2 2
3: a NA 2 2
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1
I tried rleid(!is.na(y)) but seems not to work as i expected. Thank you.
We can replace the NA elements with previous non-NA element (na.locf), take the rleid on that to get the 'wanted' and also get the length of unique elements that are not NA to get the 'count'
library(zoo)
setDT(df)[, c('wanted', 'count') := list(rleid(na.locf(y)), uniqueN(y, na.rm = TRUE)), x]
df
# x y wanted count
#1: a 1 1 2
#2: a 2 2 2
#3: a NA 2 2
#4: b 3 1 1
#5: b 3 1 1
#6: b 3 1 1
#7: c 2 1 2
#8: c 1 2 2
#9: d 5 1 1

Resources