New variable for cumulative count of one variable in another variable - r

I have previously asked a question about how to create a variable which counts instances of one ID number appearing in another column (see here). However, I have now realised that I need the new columns to hold a cumulative sum of the number of times the victim (in a particular crime incident) has been recorded as a suspect (in incidents occurring previous to that incident), and one which counts the number of times the suspect (in a particular crime incident) has been recorded as a victim (in incidents occurring previous to that incident). My data is already ordered by date so all I need is a way of cumulatively counting.
Here's a simplified version of my data:
s.uid
v.uid
1
1
3
2
2
9
3
3
8
4
4
5
5
5
2
6
9
2
7
NA
7
8
5
9
9
9
5
And here is what I want to create:
s.uid
v.uid
s.in.v
v.in.s
1
1
3
0
0
2
2
9
0
0
3
3
8
1
0
4
4
5
0
0
5
5
2
1
1
6
9
2
1
1
7
NA
7
NA
0
8
5
9
1
1
9
9
5
2
2
Note that, where there is an NA, I would like the NA to be preserved. I'm currently trying to work in tidyverse and piping where possible, so I would prefer answers in that kind of format, but I'm open to any solution!
I tried adapting user438383's answer to my previous question but it threw an error (I'm quite new to R so I wasn't sure what this meant!):
# DUMMY DATA TEST
s.uid <- c(1:5, 9, NA, 5, 9)
v.uid <- c(3, 9, 8, 5, 2, 2, 7, 9, 5)
dat <- tibble(s.uid, v.uid)
dat %>%
group_by(s.uid) %>%
mutate(s.in.v = cumsum(dat$v.uid %in% s.uid)) %>%
group_by(v.uid) %>%
mutate(v.in.s = cumsum(dat$s.uid %in% v.uid))
Error: Problem with `mutate()` input `s.in.v`.
x Input `s.in.v` can't be recycled to size 1.
ℹ Input `s.in.v` is `cumsum(dat$v.uid %in% s.uid)`.
ℹ Input `s.in.v` must be size 1, not 9.
ℹ The error occurred in group 1: s.uid = 1.

One approach is to use the magrittr pipe placeholder dot, and a rowwise approach, summing the number of TRUE values of a subset of the column.
dat %>%
mutate(n = row_number()) %>%
rowwise() %>%
mutate(s.in.v = ifelse(is.na(s.uid), NA, sum(s.uid == .$v.uid[1:n], na.rm = T)),
v.in.s = ifelse(is.na(v.uid), NA, sum(v.uid == .$s.uid[1:n], na.rm = T))) %>%
ungroup() %>%
select(-n)
# A tibble: 9 x 5
n s.uid v.uid s.in.v v.in.s
<int> <dbl> <dbl> <int> <int>
1 1 1 3 0 0
2 2 2 9 0 0
3 3 3 8 1 0
4 4 4 5 0 0
5 5 5 2 1 1
6 6 9 2 1 1
7 7 NA 7 NA 0
8 8 5 9 1 1
9 9 9 5 2 2
Note that this is likely not computationally efficient.

Related

Nested if and max if statement in R error: unused argument

Still quite new to R and I would like to identify the biggest value of a variable within a group when an other variable is postive/negative. Specifically, if l_diff<0, I would like to find the biggest value of t in each group when pos_n<0. If l_diff>0, I would like to find the biggest value of t in each group when pos_p>0. Example data:
l_diff <- c(-1,-1,-1,-1,-1,-1,1,1,1,1,1)
pos_n <- c(2,2,1,-4,-2,-2,2,1,-5,4,8)
pos_p <- c(3,4,-5,6,-7,2,-3,3,2,1,4)
t <- c(5,7,3,1,6,2,7,5,3,2,1)
group <- c(1,1,1,1,1,1,2,2,2,2,2)
db <- data.frame(cbind(l_diff,pos_n, pos_p, t, group))
Desired output:
cmax<- c(6,6,6,6,6,6,5,5,5,5,5)
I tried the following:
db<-db %>%
group_by((group)) %>%
mutate(ifelse(l_diff<0, t1 = max(t[pos_n<0], ifelse(l_diff>0, t1 = max(t[pos_p >0])))))
But I get the following error:
Error: Problem with mutate() input ..1. x unused argument (t1 =
max(t[pos_n < 0], ifelse(l_diff > 0, t1 = max(t[pos_p > 0])))) i
Input ..1 is ifelse(...). i The error occurred in group 1: (group)
= 1.
Any idea what may be wrong or any other suggestions?
With ifelse, we need to place the assignment outside, similarly, all the arguments in the ifelse usage must be present
ifelse(test, yes, no)
Here, the no was not found in the nested second ifelse. It is not an issue if we use case_when as by default the TRUE ~ NA
library(dplyr)
db %>%
group_by(group) %>%
mutate(t1 = ifelse(l_diff<0, max(t[pos_n<0]),
ifelse(l_diff>0,max(t[pos_p >0]), NA))) %>%
ungroup
-output
# A tibble: 11 x 6
# l_diff pos_n pos_p t group t1
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 -1 2 3 5 1 6
# 2 -1 2 4 7 1 6
# 3 -1 1 -5 3 1 6
# 4 -1 -4 6 1 1 6
# 5 -1 -2 -7 6 1 6
# 6 -1 -2 2 2 1 6
# 7 1 2 -3 7 2 5
# 8 1 1 3 5 2 5
# 9 1 -5 2 3 2 5
#10 1 4 1 2 2 5
#11 1 8 4 1 2 5

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

Replacing the last value within groups with different values

My question is similar to this post, but the difference is instead of replacing the last value within each group/id with all 0's, different values are used to replace the last value within each group/id.
Here is an example (I borrowed it from the above link):
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 9999
6 2 0
7 2 9
8 2 500
9 3 0
10 3 1
In the above link, the last value within each group/id was replaced by a zero, using something like:
df %>%
group_by(id) %>%
mutate(Time = c(Time[-n()], 0))
And the output was
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 0
6 2 0
7 2 9
8 2 0
9 3 0
10 3 0
In my case, I would like the last value within each group/id to be replaced by a different value. Originally, the last value within each group/id was 9999, 500, and 1. Now I would like: 9999 is replaced by 5, 500 is replaced by 12, and 1 is replaced by 92. The desired output is:
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 5
6 2 0
7 2 9
8 2 12
9 3 0
10 3 92
I tried this one:
df %>%
group_by(id) %>%
mutate(Time = replace(Time, n(), c(5,12,92))),
but it did not work.
This could be solved using almost identical solution as I posted in the linked question. e.g., just replace 0L with the desired values
library(data.table)
indx <- setDT(df)[, .I[.N], by = id]$V1
df[indx, Time := c(5L, 12L, 92L)]
df
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 92
So to add some explanations:
.I is identical to row_number() or 1:n() in dplyr for an ungrouped data, e.g. 1:nrow(df) in base R
.N is like n() in dplyr, e.g., the size of a certain group (or the whole data set). So basically when I run .I[.N] by group, I'm retrieving the global index of the last row of each group
The next step is just use this index as a row index within df while assigning the desired values to Time by reference using the := operator.
Edit
Per OPs request, here's a possible dplyr solution. Your original solution doesn't work because you are working per group and thus you were trying to pass all three values to each group.
The only way I can think of is to first calculate group sizes, then ungroup and then mutate on the cumulative sum of these locations, something among these lines
library(dplyr)
df %>%
group_by(id) %>%
mutate(indx = n()) %>%
ungroup() %>%
mutate(Time = replace(Time, cumsum(unique(indx)), c(5, 12, 92))) %>%
select(-indx)
# Source: local data frame [10 x 2]
#
# id Time
# 1 1 3
# 2 1 10
# 3 1 1
# 4 1 0
# 5 1 5
# 6 2 0
# 7 2 9
# 8 2 12
# 9 3 0
# 10 3 92
Another way using data.table would be to create another data.table which contains the values to be replaced with for a given id, and then join and update by reference (simultaneously).
require(data.table) # v1.9.5+ (for 'on = ' feature)
replace = data.table(id = 1:3, val = c(5L, 12L, 9L)) # from #David
setDT(df)[replace, Time := val, on = "id", mult = "last"]
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 9
In data.table, joins are considered as an extension of subsets. It's natural to think of doing whatever operation we do on subsets also on joins. Both operations do something on some rows.
For each replace$id, we find the last matching row (mult = "last") in df$id, and update that row with the corresponding val.
Installation instructions for v1.9.5 here. Hope this helps.

Select first observed data and utilize mutate

I am running into an issue with my data where I want to take the first observed ob score score for each individual id and subtract that from that last observed score.
The problem with asking for the first observation minus the last observation is that sometimes the first observation data is missing.
Is there anyway to ask for the first observed score for each individual, thus skipping any missing data?
I built the below df to illustrate my problem.
help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))
id ob score
1 5 1 NA
2 5 2 2
3 5 3 3
4 5 4 4
5 5 5 3
6 12 1 7
7 12 2 3
8 12 3 4
9 17 1 3
10 17 2 4
11 20 1 NA
12 20 2 1
13 20 3 4
And what I am hoping to run is code that will give me...
id ob score es
1 5 1 NA -1
2 5 2 2 -1
3 5 3 3 -1
4 5 4 4 -1
5 5 5 3 -1
6 12 1 7 3
7 12 2 3 3
8 12 3 4 3
9 17 1 3 -1
10 17 2 4 -1
11 20 1 NA -3
12 20 2 1 -3
13 20 3 4 -3
I am attempting to work out of dplyr and I understand the use of the 'group_by' command, however, not sure how to 'select' only first observed scores and then mutate to create es.
I would use first() and last() (both dplyr function) and na.omit() (from the default stats package.
First, I would make sure your score column was a numberic column with proper NA values (not strings as in your example)
help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))
then you can do
library(dplyr)
help %>% group_by(id) %>% arrange(ob) %>%
mutate(es=first(na.omit(score)-last(na.omit(score))))
library(dplyr)
temp <- help %>% group_by(id) %>%
arrange(ob) %>%
filter(!is.na(score)) %>%
mutate(es = first(score) - last(score)) %>%
select(id, es) %>%
distinct()
help %>% left_join(temp)
This solution is a little verbose, only b/c it relies on a couple of helper functions FIRST and LAST:
# The position (indicator) of the first value that evaluates to TRUE.
LAST <- function (x, none = NA) {
out <- FIRST(reverse(x), none = none)
if (identical(none, out)) {
return(none)
}
else {
return(length(x) - out + 1)
}
}
# The position (indicator) of the last value that evaluates to TRUE.
FIRST <- function (x, none = NA)
{
x[is.na(x)] <- FALSE
if (any(x))
return(which.max(x))
else return(none)
}
# returns the difference between the first and last non-missing values
diff2 <- function(x)
x[LAST(!is.na(x))] - x[FIRST(!is.na(x))]
library(dplyr)
help %>%
group_by(id) %>%
arrange(ob) %>%
summarise(diff = diff2(score))

Match group assignments between columns

I am trying to check the accuracy rate of a clustering algorithm, with a dataframe that looks like the one here. The orig.gp refers to the original grouping, which is the "correct" group assignment. The new.gp refers to the grouping assigned by the clustering algorithm.
df <- data.frame(id = 1:9,
orig.gp = c(rep(1:3, each = 3)),
new.gp = c(2, 2, 3, 3, 3, 1, 1, 1, 1) )
df
# id orig.gp new.gp
# 1 1 1 2
# 2 2 1 2
# 3 3 1 3
# 4 4 2 3
# 5 5 2 3
# 6 6 2 1
# 7 7 3 1
# 8 8 3 1
# 9 9 3 1
What I am trying to determine is whether the same ids are assigned the same grouping as the orig.gp. The group number itself is not that important, as the number is arbitrary. Ideally, I would like to achieve something like this:
# orig.gp new.gp correct
# 1 1 2 yes
# 2 1 2 yes
# 3 1 3 no
# 4 2 3 yes
# 5 2 3 yes
# 6 2 1 no
# 7 3 1 yes
# 8 3 1 yes
# 9 3 1 yes
To illustrate, in the original grouping, group 1 consists of ids 1, 2, 3; group 2 consists of ids 4, 5, 6; group 3 consists of 7, 8, 9. In the new grouping, ids 1, 2 are correctly assigned into the same group, thus the "yes" in the correct column. I would like to determine whether the same ids are assigned into the same groups as the original groupings.
Any suggestions would be appreciated!
The way I understand your problem, it is basically one of recoding. Namely, you want to identify observations that fall on the diagonal of a crosstabulation of new.gp and orig.gp, but the values of new.gp are mislabeled.
What I propose here is basically recoding the values of new.gp based on a simple crosstabulation (see tab below). The recoding is done by taking the modal value of orig.gp for each possible value of new.gp and assuming that this mode is the correct value label. I then use recode from car to perform the recoding.
library("car")
tab <- with(df, table(new.gp, orig.gp))
tab
## orig.gp
## new.gp 1 2 3
## 1 0 1 3
## 2 2 0 0
## 3 1 2 0
df$recoded <- recode(df$new.gp, paste(rownames(tab),colnames(tab)[max.col(tab)],sep='=',collapse=';'))
df$correct <- ifelse(df$orig.gp == df$recoded, "yes", "no")
The result:
> df
orig.gp new.gp recoded correct
1 1 2 1 yes
2 1 2 1 yes
3 1 3 2 no
4 2 3 2 yes
5 2 3 2 yes
6 2 1 3 no
7 3 1 3 yes
8 3 1 3 yes
9 3 1 3 yes

Resources