Create a dummy variable indicating whether a value is observed before - r

I have a huge dataset and wanted to create a binary dummy variable indicating whether a value is observed before. Here is the sample data set.
data.frame(
id = c(rep("A",3),rep("B",3),rep("C",3)),
time = rep(seq(1:3),3),
item = c(11,12,13,11,11,13,22,11,22))
From the dataset, here is the desired column,
observed_b4 = c(NA,0,0,NA,1,0,NA,0,1)
For each group, I want to have information about whether item is observed before or not. I can do it with for-loop but the data size is too big to do.

Using duplicated:
base:
cbind(x, flag = as.integer(duplicated(paste(x$id, x$item))))
# id time item flag
# 1 A 1 11 0
# 2 A 2 12 0
# 3 A 3 13 0
# 4 B 1 11 0
# 5 B 2 11 1
# 6 B 3 13 0
# 7 C 1 22 0
# 8 C 2 11 0
# 9 C 3 22 1
or dplyr:
library(dplyr)
x %>%
group_by(id) %>%
mutate(flag = as.integer(duplicated(item)))
## A tibble: 9 x 4
## Groups: id [3]
# id time item flag
# <chr> <int> <dbl> <int>
#1 A 1 11 0
#2 A 2 12 0
#3 A 3 13 0
#4 B 1 11 0
#5 B 2 11 1
#6 B 3 13 0
#7 C 1 22 0
#8 C 2 11 0
#9 C 3 22 1

A solution with base R that uses: ave and duplicated.
ave allows you to apply a function over df$item for each group made by df$id. duplicated checks whether an item was already shown. ave returns automatically a numeric vector (the name class of the input vector).
df$observed_b4 <- ave(df$item, df$id, FUN = duplicated)
df
#> id time item observed_b4
#> 1 A 1 11 0
#> 2 A 2 12 0
#> 3 A 3 13 0
#> 4 B 1 11 0
#> 5 B 2 11 1
#> 6 B 3 13 0
#> 7 C 1 22 0
#> 8 C 2 11 0
#> 9 C 3 22 1
However, to get exactly what you're looking for, you can use this:
df$observed_b4 <- ave(df$item, df$id, FUN = function(x) replace(duplicated(x),1,NA))
df
#> id time item observed_b4
#> 1 A 1 11 NA
#> 2 A 2 12 0
#> 3 A 3 13 0
#> 4 B 1 11 NA
#> 5 B 2 11 1
#> 6 B 3 13 0
#> 7 C 1 22 NA
#> 8 C 2 11 0
#> 9 C 3 22 1

We could group by 'id', 'item', create a logical vector with row_number() and coerce it to binary (+)
library(dplyr)
df1 %>%
group_by(id, item) %>%
mutate(flag = +(row_number() != 1))
-output
# A tibble: 9 x 4
# Groups: id, item [7]
# id time item flag
# <chr> <int> <dbl> <int>
#1 A 1 11 0
#2 A 2 12 0
#3 A 3 13 0
#4 B 1 11 0
#5 B 2 11 1
#6 B 3 13 0
#7 C 1 22 0
#8 C 2 11 0
#9 C 3 22 1

Related

Create column with ID starting at 1 and increment when value in another column changes in R

I have a data frame like so:
ID <- c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B')
val1 <- c(0,1,2,3,4,5,6,7,8,9,10,11,0,1,2,3)
val2 <- c(0,1,2,3,4,5,0,1,0,1,2,0,1,0,1,2)
df <- data.frame(ID, val1, val2)
Output:
ID val1 val2
1 A 0 0
2 A 1 1
3 A 2 2
4 A 3 3
5 A 4 4
6 A 5 5
7 A 6 0
8 A 7 1
9 A 8 0
10 A 9 1
11 A 10 2
12 B 11 0
13 B 0 1
14 B 1 0
15 B 2 1
16 B 3 2
I am trying to create a third column (val 3) which is like an index. When val1 = 0 and val 2 = 0 it should be 1 (this is also grouped by ID). It should stay as one and then increment by 1 until val2 = 0 again, like so showing desired output:
ID val1 val2 val3
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2
How can this be achieved? I tried:
df <- df %>%
group_by(ID, val2) %>%
mutate(val3 = row_number())
And:
df$val3 <- cumsum(c(1,diff(df$val2)==0))
But neither provide the desired outcome.
Inside cumsum use the logical comparison val2==0
df %>%
group_by(ID) %>%
mutate(val3 = cumsum(val2==0))
# A tibble: 16 × 4
# Groups: ID [2]
ID val1 val2 val3
<chr> <dbl> <dbl> <int>
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2

How to create another column in a data frame based on repeated observations in another column?

So basically I have a data frame that looks like this:
BX
BY
1
12
1
12
1
12
2
14
2
14
3
5
I want to create another colum ID, which will have the same number for the same values in BX and BY. So the table would look like this then:
BX
BY
ID
1
12
1
1
12
1
1
12
1
2
14
2
2
14
2
3
5
3
Here is a base R way.
Subset the data.frame by the grouping columns, find the duplicated rows and use a standard cumsum trick.
df1<-'BX BY
1 12
1 12
1 12
2 14
2 14
3 5'
df1 <- read.table(textConnection(df1), header = TRUE)
cumsum(!duplicated(df1[c("BX", "BY")]))
#> [1] 1 1 1 2 2 3
df1$ID <- cumsum(!duplicated(df1[c("BX", "BY")]))
df1
#> BX BY ID
#> 1 1 12 1
#> 2 1 12 1
#> 3 1 12 1
#> 4 2 14 2
#> 5 2 14 2
#> 6 3 5 3
Created on 2022-10-12 with reprex v2.0.2
You can do:
transform(dat, ID = as.numeric(interaction(dat, drop = TRUE, lex.order = TRUE)))
BX BY ID
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3
Or if you prefer dplyr:
library(dplyr)
dat %>%
group_by(across()) %>%
mutate(ID = cur_group_id()) %>%
ungroup()
# A tibble: 6 × 3
BX BY ID
<dbl> <dbl> <int>
1 1 12 1
2 1 12 1
3 1 12 1
4 2 14 2
5 2 14 2
6 3 5 3

How can I extract information of one group based on the filtrates of another group in dplyr

My data frame looks like this but with thousands of entries
type <- rep(c("A","B","C"),4)
time <- c(0,0,0,1,1,1,2,2,2,3,3,3)
counts <- c(0,30,15,30,30,10,31,30,8,30,8,0)
df <- data.frame(time,type,counts)
df
time type counts
1 0 A 0
2 0 B 30
3 0 C 15
4 1 A 30
5 1 B 30
6 1 C 10
7 2 A 31
8 2 B 30
9 2 C 8
10 3 A 30
11 3 B 8
12 3 C 0
I want at each time point bigger than 0 to extract all the types that have counts==30
and then I want to extract for these types their counts at the next time point.
I want my data to look like this
time type counts time_after type_after counts_after
1 A 30 2 A 30
1 B 30 2 B 31
2 B 30 3 B 8
Any help or guidance are appreciated
Not very elegant but should do the job
library(dplyr)
type <- rep(c("A","B","C"),4)
time <- c(0,0,0,1,1,1,2,2,2,3,3,3)
counts <- c(0,30,15,30,30,10,31,30,8,30,8,0)
df <- tibble(time,type,counts)
df
#> # A tibble: 12 x 3
#> time type counts
#> <dbl> <chr> <dbl>
#> 1 0 A 0
#> 2 0 B 30
#> 3 0 C 15
#> 4 1 A 30
#> 5 1 B 30
#> 6 1 C 10
#> 7 2 A 31
#> 8 2 B 30
#> 9 2 C 8
#> 10 3 A 30
#> 11 3 B 8
#> 12 3 C 0
thirties <- df %>%
filter(counts == 30 & time != 0) %>%
mutate(time_after = time + 1)
inner_join(thirties, df, by = c("time_after" = "time",
"type" = "type")) %>%
select(time,
type = type,
counts = counts.x,
time_after,
type_after = type,
count_after = counts.y)
#> # A tibble: 3 x 6
#> time type counts time_after type_after count_after
#> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 1 A 30 2 A 31
#> 2 1 B 30 2 B 30
#> 3 2 B 30 3 B 8

Count number of new and lost friends between two data frames in R

I have two data frames of the same respondents, one from Time 1 and the next from Time 2. In each wave they nominated their friends, and I want to know:
1) how many friends are nominated in Time 2 but not in Time 1 (new friends)
2) how many friends are nominated in Time 1 but not in Time 2 (lost friends)
Sample data:
Time 1 DF
ID friend_1 friend_2 friend_3
1 4 12 7
2 8 6 7
3 9 NA NA
4 15 7 2
5 2 20 7
6 19 13 9
7 12 20 8
8 3 17 10
9 1 15 19
10 2 16 11
Time 2 DF
ID friend_1 friend_2 friend_3
1 4 12 3
2 8 6 14
3 9 NA NA
4 15 7 2
5 1 17 9
6 9 19 NA
7 NA NA NA
8 7 1 16
9 NA 10 12
10 7 11 9
So the desired DF would include these columns (EDIT filled in columns):
ID num_newfriends num_lostfriends
1 1 1
2 1 1
3 0 0
4 0 0
5 3 3
6 0 1
7 0 3
8 3 3
9 2 3
10 2 1
EDIT2:
I've tried doing an anti join
df3 <- anti_join(df1, df2)
But this method doesn't take into account friend id numbers that might appear in a different column in time 2 (For example respondent #6 friend 9 and 19 are in T1 and T2 but in different columns in each time)
Another option:
library(tidyverse)
left_join(
gather(df1, key, x, -ID),
gather(df2, key, y, -ID),
by = c("ID", "key")
) %>%
group_by(ID) %>%
summarise(
num_newfriends = sum(!y[!is.na(y)] %in% x[!is.na(x)]),
num_lostfriends = sum(!x[!is.na(x)] %in% y[!is.na(y)])
)
Output:
# A tibble: 10 x 3
ID num_newfriends num_lostfriends
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
Simple comparisons would be an option
library(tidyverse)
na_sums_old <- rowSums(is.na(time1))
na_sums_new <- rowSums(is.na(time2))
kept_friends <- map_dbl(seq(nrow(time1)), ~ sum(time1[.x, -1] %in% time2[.x, -1]))
kept_friends <- kept_friends - na_sums_old * (na_sums_new >= 1)
new_friends <- 3 - na_sums_new - kept_friends
lost_friends <- 3 - na_sums_old - kept_friends
tibble(ID = time1$ID, new_friends = new_friends, lost_friends = lost_friends)
# A tibble: 10 x 3
ID new_friends lost_friends
<int> <dbl> <dbl>
1 1 1 1
2 2 1 1
3 3 0 0
4 4 0 0
5 5 3 3
6 6 0 1
7 7 0 3
8 8 3 3
9 9 2 3
10 10 2 2
You can make anti_join work by first pivoting to a "long" data frame.
df1 <- df1 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
df2 <- df2 %>%
pivot_longer(starts_with("friend_"), values_to = "friend") %>%
drop_na()
head(df1)
#> # A tibble: 6 x 3
#> ID name friend
#> <int> <chr> <int>
#> 1 1 friend_1 4
#> 2 1 friend_2 12
#> 3 1 friend_3 7
#> 4 2 friend_1 8
#> 5 2 friend_2 6
#> 6 2 friend_3 7
lost_friends <- anti_join(df1, df2, by = c("ID", "friend"))
new_fiends <- anti_join(df2, df1, by = c("ID", "friend"))
respondents <- distinct(df1, ID)
respondents %>%
full_join(
count(lost_friends, ID, name = "num_lost_friends")
) %>%
full_join(
count(new_fiends, ID, name = "num_new_friends")
) %>%
mutate_at(vars(starts_with("num_")), replace_na, 0)
#> Joining, by = "ID"
#> Joining, by = "ID"
#> # A tibble: 10 x 3
#> ID num_lost_friends num_new_friends
#> <int> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 0 0
#> 4 4 0 0
#> 5 5 3 3
#> 6 6 1 0
#> 7 7 3 0
#> 8 8 3 3
#> 9 9 3 2
#> 10 10 2 2
Created on 2019-11-01 by the reprex package (v0.3.0)

Grouping rows with mutliple conditions across columns, incl. a sorting, in R/dplyr

In the following dataframe, I have 24 points in the 3D space (2 horizontal locations along X and Y, each with 12 vertical values along Z).
I would like to group together the points vertically if:
they have the same val value and
they follow each other along the Z axis (so two 1 separated by another value would not have the same ID).
And this should be done only for the values beyond the 3 first Z values (which automatically get ID = 1, 2 and 3 respectively, the following ones start at 4).
set.seed(50)
library(dplyr)
mydf = data.frame(X = rep(1, 24), Y = rep(1:2, each = 12),
Z = c(sample(1:12,12,replace=F), sample(4:16,12,replace=F)),
val = c(rep(1:3, 8)))
mydf = mydf %>% group_by(X,Y) %>% arrange(X,Y,Z) %>% data.frame()
# X Y Z val
# 1 1 1 1 3 # In this X-Y location, Z starts at 1
# 2 1 1 2 3
# 3 1 1 3 3
# 4 1 1 4 2
# 5 1 1 5 2
# 6 1 1 6 1
# 7 1 1 7 1
# 8 1 1 8 1
# 9 1 1 9 1
# 10 1 1 10 2
# 11 1 1 11 2
# 12 1 1 12 3
# 13 1 2 4 2 # In this X-Y location, Z starts at 4
# [etc (see below)]
Desired output (note for example that lines 4-5 and 10-11 get a different ID):
rle1 = rle(mydf[4:12,]$val)
# Run Length Encoding
# lengths: int [1:4] 2 4 2 1
# values : int [1:4] 2 1 2 3
rle2 = rle(mydf[4:12 + 12,]$val)
# Run Length Encoding
# lengths: int [1:7] 2 1 1 2 1 1 1
# values : int [1:7] 3 1 2 1 3 1 2
mydf$ID = c(1:3, rep(4:(3+length(rle1$lengths)), rle1$lengths),
1:3, rep(4:(3+length(rle2$lengths)), rle2$lengths))
# X Y Z val ID
# 1 1 1 1 3 1
# 2 1 1 2 3 2
# 3 1 1 3 3 3
# 4 1 1 4 2 4
# 5 1 1 5 2 4
# 6 1 1 6 1 5
# 7 1 1 7 1 5
# 8 1 1 8 1 5
# 9 1 1 9 1 5
# 10 1 1 10 2 6
# 11 1 1 11 2 6
# 12 1 1 12 3 7 # In this X-Y location, I have 7 groups in the end
# 13 1 2 4 2 1
# 14 1 2 5 2 2
# 15 1 2 6 3 3
# 16 1 2 7 3 4
# 17 1 2 9 3 4
# 18 1 2 10 1 5
# 19 1 2 11 2 6
# 20 1 2 12 1 7
# 21 1 2 13 1 7
# 22 1 2 14 3 8
# 23 1 2 15 1 9
# 24 1 2 16 2 10 # In this X-Y location, I have 10 groups in the end
How could I perform this more efficiently, or in one line, and why not with dplyr, supposing this applies for many (X,Y) locations and with always the 3 first Z values (which starts at a different value at each location) followed by a location-dependent number of following ID groups?
I was starting with a try to work with a vector from a conditional subset in dplyr, which is wrong:
mydf %>% group_by(X,Y) %>% arrange(X,Y,Z) %>%
mutate(dummy = mean(rle(val)$values))
Error: error in evaluating the argument 'x' in selecting a method for function 'mean': Error in rle(c(1L, 2L, 3L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 2L))$function (x, :
invalid subscript type 'closure'
Thanks!
You can use data.table::rleid on val starting from the 4th element and then add an offset of 3, this could simplify the rle calculation;
library(dplyr); library(data.table)
mydf %>%
group_by(X, Y) %>%
mutate(ID = c(1:3, rleid(val[-(1:3)]) + 3)) %>%
as.data.frame() # for print purpose only
# X Y Z val ID
#1 1 1 1 3 1
#2 1 1 2 3 2
#3 1 1 3 3 3
#4 1 1 4 2 4
#5 1 1 5 2 4
#6 1 1 6 1 5
#7 1 1 7 1 5
#8 1 1 8 1 5
#9 1 1 9 1 5
#10 1 1 10 2 6
#11 1 1 11 2 6
#12 1 1 12 3 7
#13 1 2 4 2 1
#14 1 2 5 2 2
#15 1 2 6 3 3
#16 1 2 7 3 4
#17 1 2 9 3 4
#18 1 2 10 1 5
#19 1 2 11 2 6
#20 1 2 12 1 7
#21 1 2 13 1 7
#22 1 2 14 3 8
#23 1 2 15 1 9
#24 1 2 16 2 10
Or without rleid, use cumsum + diff:
mydf %>% group_by(X, Y) %>% mutate(ID = c(1:3, cumsum(c(4, diff(val[-(1:3)]) != 0))))

Resources