I am trying to create a new variable (v2) based on a pattern of numerical responses to another variable (v1). The dataset I am working with is in long format and ordered by visit. I have tried grouping by the 'id' variable and using various combinations of 'summarise' in dplyr, but cannot seem to figure this out. Below is an example of what I would like to achieve.
id visit v1 v2
<dbl> <int> <dbl> <int>
1 10001 1 0 1
2 10001 2 0 1
3 10002 1 0 2
4 10002 2 1 2
5 10003 1 1 3
6 10003 2 0 3
The value of 1 for v2 should reflect a response pattern of 0 across two visits for id 10001, 2 reflects a response pattern of 0/1, and so on.
Thank you in advance for the help!
Another way is:
dat %>%
group_by(id) %>%
mutate(v2 = c("00" = 1, "01" = 2, "10" = 3, "11" = 4)[paste(v1, collapse = "")])
# A tibble: 6 x 4
# Groups: id [3]
id visit v1 v2
<int> <int> <int> <dbl>
1 10001 1 0 1
2 10001 2 0 1
3 10002 1 0 2
4 10002 2 1 2
5 10003 1 1 3
6 10003 2 0 3
Assumption:
within an id, we always have exactly 2 rows
base R
ave(dat$v1, dat$id, FUN = function(z) {
if (length(z) != 2) return(NA_integer_)
switch(paste(z, collapse = ""),
"00" = 1L,
"01" = 2L,
"10" = 3L,
"11" = 4L,
NA_integer_)
})
# [1] 1 1 2 2 3 3
dplyr
library(dplyr)
dat %>%
group_by(id) %>%
mutate(v2 = if (n() != 2) NA_integer_ else case_when(
all(v1 == c(0L, 0L)) ~ 1L,
all(v1 == c(0L, 1L)) ~ 2L,
all(v1 == c(1L, 0L)) ~ 3L,
all(v1 == c(1L, 1L)) ~ 4L,
TRUE ~ NA_integer_)
) %>%
ungroup()
# # A tibble: 6 x 4
# id visit v1 v2
# <int> <int> <int> <int>
# 1 10001 1 0 1
# 2 10001 2 0 1
# 3 10002 1 0 2
# 4 10002 2 1 2
# 5 10003 1 1 3
# 6 10003 2 0 3
Data
dat <- structure(list(id = c(10001L, 10001L, 10002L, 10002L, 10003L, 10003L), visit = c(1L, 2L, 1L, 2L, 1L, 2L), v1 = c(0L, 0L, 0L, 1L, 1L, 0L), v2 = c(1L, 1L, 2L, 2L, 3L, 3L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))
Related
I have a dataframe such as:
COL1 VALUE1 VALUE2
1 A,A 1 5
2 A,A,B 1 3
3 C 1 1
4 D 1 2
5 D 1 2
6 A,A 1 10
7 A,B,A 1 2
and I can succeed to remove duplicate within the COL1 and count the number of different duplicated in COL1 by using:
as.data.frame(table(tab$COL1)) %>%
group_by(Var1 = sapply(strsplit(as.character(Var1), ","), function(x) toString(unique(x)))) %>%
summarise(Freq = sum(Freq))
And then I get:
# A tibble: 4 × 2
Var1 Freq
<chr> <int>
1 A 2
2 A, B 2
3 C 1
4 D 2
But I wondered if someone had an idea in order to add a new column called Mean which would be for each COL1 groups, the mean of the VALUE2 values and then get:
Var1 Freq Mean
1 A 2 7.5 < because (5+10)/2 =7.5
2 A, B 2 2.5 < because (3+2)/2 =2.5
3 C 1 1 < because 1/1 = 1
4 D 2 2 < because (2+2)/2 = 2
Here is the dataframe if it can helps:
structure(list(COL1 = structure(c(1L, 2L, 4L, 5L, 5L, 1L, 3L), .Label = c("A,A",
"A,A,B", "A,B,A", "C", "D"), class = "factor"), VALUE1 = c(1L,
1L, 1L, 1L, 1L, 1L, 1L), VALUE2 = c(5L, 3L, 1L, 2L, 2L, 10L,
2L)), class = "data.frame", row.names = c(NA, -7L))
You can calculate the frequency table directly in the dplyr chain, and then just add a Mean = mean(VALUE2) in the summarise() call.
I.e.
tab %>%
group_by(Var1 = sapply(strsplit(as.character(COL1), ","), function(x) toString(unique(x)))) %>%
summarise(Freq = sum(VALUE1), Mean = mean(VALUE2))
# # A tibble: 4 x 3
# Var1 Freq Mean
# <chr> <int> <dbl>
# 1 A 2 7.5
# 2 A, B 2 2.5
# 3 C 1 1
# 4 D 2 2
Is this what you want:
library(dplyr)
tab %>%
mutate(COL1 = sapply(strsplit(as.character(COL1), ","), function(x) toString(unique(x)))) %>%
group_by(COL1) %>%
summarise(Freq = sum(VALUE1),
Mean = mean(VALUE2))
# A tibble: 4 x 3
COL1 Freq Mean
* <chr> <int> <dbl>
1 A 2 7.5
2 A, B 2 2.5
3 C 1 1
4 D 2 2
I have a dataframe like this:
ID S1 C
1 1 2 3
2 1 2 3
3 3 1 1
4 6 2 5
5 6 7 5
What I need is the number of rows per group ID where S1 <= C. This is the desired output.
ID Obs
1 1 2
2 3 1
3 6 1
Even though the question was answered below, I have a follow up question: Is it possible to do the same for multiple columns (S1, S2, ..). For example for the dataframe below:
ID S1 S2 C
1 1 2 2 3
2 1 2 2 3
3 3 1 1 1
4 6 2 2 5
5 6 7 7 5
And then get:
ID S1.Obs S2.Obs
1 1 2 2
2 3 1 1
3 6 1 1
A base R solution with aggregate().
aggregate(Obs ~ ID, transform(df, Obs = S1 <= C), sum)
# ID Obs
# 1 1 2
# 2 3 1
# 3 6 1
A dplyr solution
library(dplyr)
df %>%
filter(S1 <= C) %>%
count(ID, name = "Obs")
# ID Obs
# 1 1 2
# 2 3 1
# 3 6 1
Data
df <- structure(list(ID = c(1L, 1L, 3L, 6L, 6L), S1 = c(2L, 2L, 1L, 2L, 7L),
C = c(3L, 3L, 1L, 5L, 5L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
Extension
If you want to apply this rule on multiple columns such as S1, S2, S3:
df %>%
group_by(ID) %>%
summarise(across(starts_with("S"), ~ sum(.x <= C)))
data <- data.frame(
ID = c(1, 1, 3, 6, 6),
S1 = c(2, 2, 1, 2, 7),
C = c(3, 3, 1, 5, 5)
)
library(dplyr)
data.filtered <- data[data$S1 <= data$C,]
data.filtered %>% group_by(ID) %>%
summarize(Obs = length(ID))
An option with data.table
library(data.table)
setDT(df)[S1 <=C, .(Obs = .N), ID]
# ID Obs
#1: 1 2
#2: 3 1
#3: 6 1
data
df <- structure(list(ID = c(1L, 1L, 3L, 6L, 6L), S1 = c(2L, 2L, 1L, 2L, 7L),
C = c(3L, 3L, 1L, 5L, 5L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
I am quite a beginner in R but thanks to the community of Stackoverflow I am improving!
However, I am stuck with a problem:
I have a dataset with 5 variables:
id_house represents the id for each household
id_ind is an id which values 1 for the first individual in the household, 2 for the next, 3 for the third...
Indicator_tb_men which indicates if the first person has answered to the survey (1 = yes, 0 = no). All the other members of the household take the value 0.
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
3 1 0
3 2 0
3 3 0
4 1 1
5 1 0
I would like to delete all members of households where the first individual has not answered the survey.
So it would give:
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
4 1 1
Using dplyr here is one way :
library(dplyr)
df %>%
arrange(id_house, id_ind) %>%
group_by(id_house) %>%
filter(first(indicator_tb_men) != 0)
# id_house id_ind indicator_tb_men
# <int> <int> <int>
#1 1 1 1
#2 1 2 NA
#3 2 1 1
#4 4 1 1
data
df <- structure(list(id_house = c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L),
id_ind = c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 1L), indicator_tb_men = c(1L,
NA, 1L, 0L, NA, NA, 1L, 0L)), class = "data.frame", row.names = c(NA, -8L))
in base we can use nested logic
df[df$id_house %in% df$id_house[df$id_ind == 1 & df$indicator_tb_men == 1],]
id_house id_ind indicator_tb_men
1 1 1 1
2 1 2 NA
3 2 1 1
7 4 1 1
Data: Using Ronak Shah's data
I want to have a list of positive and negative values corresponding to each value that comes after grouping a column. My data looks like this:
dataset <- read.table(text =
"id value
1 4
1 -2
1 0
2 6
2 -4
2 -5
2 -1
3 0
3 0
3 -4
3 -5",
header = TRUE, stringsAsFactors = FALSE)
I want my result to look like this:
id num_pos_value num_neg_value num_zero_value
1 1 1 1
2 1 3 0
3 0 2 2
I want to extend the columns of the above result by adding sum of the positive and negative values.
id num_pos num_neg num_zero sum_pos sum_neg
1 1 1 1 4 -2
2 1 3 0 6 -10
3 0 2 2 0 -9
We create a group by 'id' and calculate the sum of logical vector
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(num_pos = sum(value > 0),
num_neg = sum(value < 0),
num_zero = sum(value == 0))
# A tibble: 3 x 4
# id num_pos num_neg num_zero
# <int> <int> <int> <int>
#1 1 1 1 1
#2 2 1 3 0
#3 3 0 2 2
Or get the table of sign of 'value' and spread it to 'wide'
library(tidyr)
df1 %>%
group_by(id) %>%
summarise(num = list(table(factor(sign(value), levels = -1:1)))) %>%
unnest %>%
mutate(grp = rep(paste0("num", c("pos", "zero", "neg")), 3)) %>%
spread(grp, num)
Or using count
df1 %>%
count(id, val = sign(value)) %>%
spread(val, n, fill = 0)
data
df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), value = c(4L, -2L, 0L, 6L, -4L, -5L, -1L, 0L, 0L, -4L, -5L
)), class = "data.frame", row.names = c(NA, -11L))
I have rows with recurring IDs that I would like to merge. The columns are binaries so I would like to sum them together
Example before:
id nam1 nam2
1 1 1
1 0 0
2 1 0
2 0 1
3 1 1
3 1 0
Example after:
id nam1 nam2
1 1 1
2 1 1
3 2 1
Any ideas on how to do this?
#d.b's answer in comment:
aggregate(.~id, df, sum)
or using dplyr:
library(dplyr)
df %>%
group_by(id) %>%
summarize_all("sum")
Result:
# A tibble: 3 x 3
id nam1 nam2
<int> <int> <int>
1 1 1 1
2 2 1 1
3 3 2 1
Data
df = structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L), nam1 = c(1L, 0L,
1L, 0L, 1L, 1L), nam2 = c(1L, 0L, 0L, 1L, 1L, 0L)), .Names = c("id",
"nam1", "nam2"), row.names = c(NA, -6L), class = "data.frame")
#Sample data:
df <- data.frame(id=c(1,1,2,2,3,3),
nam1=c(1,0,1,0,1,1),
nam2=c(1,0,0,1,1,0))
library(data.table)
setDT(df)[, lapply(.SD, sum), by=.(id)]
id nam1 nam2
1 1 1
2 1 1
3 2 1