Grouping in Embedded Group Structures in R data.table - r

I have a data.table object looks like this:
FamilyID InterFamilyID MumInFamilyID Edu
1 1 NA 2
1 2 NA 5
1 3 2 3
2 1 NA 6
2 2 1 9
2 2 1 3
I want to perform a query like this one:
tbl1[, MumEdu:= Edu[InterFamilyID == MumInFamilyID], by=FamilyID]
to get something like this:
FamilyID InterFamilyID MumInFamilyID Edu MumEdu
1 1 NA 2 NA
1 2 NA 5 NA
1 3 2 3 5
2 1 NA 6 NA
2 2 1 9 6
2 2 1 3 6
In fact I have a data.table grouped by a column (FamilyID) and each of these groups are 1-1 grouped by another column (InterFamilyID). In another column there is reference to smaller group id of another group member. I want to use these values to access the referenced rows values.

You can use match to:
returns a vector of the positions of (first) matches of its first argument in its second.
and use the result positions to find out the corresponding element in Edu column:
tbl1[, MumEdu := Edu[match(MumInFamilyID, InterFamilyID)], by = FamilyID]
tbl1
# FamilyID InterFamilyID MumInFamilyID Edu MumEdu
#1: 1 1 NA 2 NA
#2: 1 2 NA 5 NA
#3: 1 3 2 3 5
#4: 2 1 NA 6 NA
#5: 2 2 1 9 6
#6: 2 2 1 3 6

Related

How to make the next number in a column a sequence in r

sorry to bother everyone. I have been stuck with coding
Student Number
1 NA
1 NA
1 1
1 1
2 NA
2 1
2 1
2 1
3 NA
3 NA
3 1
3 1
I tried using dplyr to cluster by students try to find a way so that every time it reads that 1, it adds it to the following column so it would read as
Student Number
1 NA
1 NA
1 1
1 2
2 NA
2 1
2 2
2 3
3 NA
3 NA
3 1
3 2
etc
Thank you! It'd help with attendance.
data.table solution;
library(data.table)
setDT(df)
df[!is.na(Number),Number:=cumsum(Number),by=Student]
df
Student Number
<int> <int>
1 1 NA
2 1 NA
3 1 1
4 1 2
5 2 NA
6 2 1
7 2 2
8 2 3
9 3 NA
10 3 NA
11 3 1
12 3 2
Try using cumsum, note that cumsum itself cannot ignore NA
library(dplyr)
df %>%
group_by(Student) %>%
mutate(n = cumsum(ifelse(is.na(Number), 0, Number)) + 0 * Number)
Student Number n
<int> <int> <dbl>
1 1 NA NA
2 1 NA NA
3 1 1 1
4 1 1 2
5 2 NA NA
6 2 1 1
7 2 1 2
8 2 1 3
9 3 NA NA
10 3 NA NA
11 3 1 1
12 3 1 2

Rearranging columns with NAs [duplicate]

This question already has answers here:
How to move cells with a value row-wise to the left in a dataframe [duplicate]
(5 answers)
Closed 4 years ago.
Sorry guys,
this is probably a silly question but I do not manage to find a quick solution to solve this issue.
I have a dataframe of this form indicating the number of components of households and gender of each member
Familyid Gender_1 Gender_2 Gender_3 Gender_4 Ncomponent
1 1 NA NA NA 1
2 NA 1 NA NA 1
3 1 2 NA NA 2
4 1 NA 2 NA 2
5 NA 1 2 NA 2
6 2 NA NA 1 2
I would like to collect this info just in two columns in the following way.
Familyid Gender_member1 Gender_member2 Ncomponent
1 1 NA 1
2 1 NA 1
3 1 2 2
4 1 2 2
5 1 2 2
6 2 1 2
In other words I want to create a column indicating gender of member 1, regardless in which column he/she is located in my original dataframe, and a different one indicating gender of the second family member, whenever this latter exists.
Can anyone helping me out with this?
Marco
I just removed NAs for Gender_x columns.
xy <- read.table(text = "Familyid Gender_1 Gender_2 Gender_3 Gender_4 Ncomponent
1 1 NA NA NA 1
2 NA 1 NA NA 1
3 1 2 NA NA 2
4 1 NA 2 NA 2
5 NA 1 2 NA 2
6 2 NA NA 1 2",
header = TRUE)
xy
fetch.gender <- grepl("^Gender_\\d{1}$", names(xy))
out <- apply(xy[, fetch.gender], MARGIN = 1, FUN = na.omit)
out <- do.call(rbind, out)
names(out) <- c("Gender_member1", "Gender_member2")
data.frame(Familyid = xy$Familyid, out, Ncomponent = xy$Ncomponent)
Familyid Gender_1 Gender_2 Ncomponent
1 1 1 1 1
2 2 1 1 1
3 3 1 2 2
4 4 1 2 2
5 5 1 2 2
6 6 2 1 2

Create a counting variable which I can use to group my unemployment data in R

I have data as below where i created the variable "B" with the function:
index <- which(Count$unemploymentduration ==1)
Count$B[index]<-1:length(index)
ID unemployment B
1 1 1
1 2 NA
1 3 NA
1 4 NA
2 1 2
2 2 NA
2 0 NA
2 1 3
2 2 NA
2 3 NA
2 4 NA
2 5 NA
And i want my data in this way and have no real idea how to get it like this.
Thought of an "if-function" but never used one in R.
ID unemployment B
1 1 1
1 2 1
1 3 1
1 4 1
2 1 2
2 2 2
2 0 2
2 1 3
2 2 3
2 3 3
2 4 3
2 5 3
Could someone help me out?
We can use na.locf from library(zoo)
library(zoo)
Count$B <- na.locf(Count$B)
But, this can be created directly without using an 'index'
Count$B <- cumsum(Count$unemployment==1)

imputing forward / backward

I am trying to impute some longitudinal data in this way (see below). For each individual (id), if first values are NA, I would like to impute using the first observed value for that individual regardless when that occurs. Then, I would like to impute forward based on the last value observed for each individual (see imputed below).
var values might not necessarily increase monotonically. Those values might be a character vector.
I have tried several ways to do this, but still I cannot get a satisfactory solution.
Any ideas?
id <- c(1,1,1,1,1,1,1,2,2,2,2)
time <- c(1,2,3,4,5,6,7,3,5,7,9)
var <- c(NA,NA,1,NA,2,3,NA,NA,2,3,NA)
imputed <- c(1,1,1,1,2,3,3,2,2,3,3)
dat <- data.table(id, time, var, imputed)
id time var imputed
1: 1 1 NA 1
2: 1 2 NA 1
3: 1 3 1 1
4: 1 4 NA 1
5: 1 5 2 2
6: 1 6 3 3
7: 1 7 NA 3
8: 2 3 NA 2
9: 2 5 2 2
10: 2 7 3 3
11: 2 9 NA 3
library(zoo)
dat[, newimp := na.locf(na.locf(var, FALSE), fromLast=TRUE), by = id]
dat
# id time var imputed newimp
# 1: 1 1 NA 1 1
# 2: 1 2 NA 1 1
# 3: 1 3 1 1 1
# 4: 1 4 NA 1 1
# 5: 1 5 2 2 2
# 6: 1 6 3 3 3
# 7: 1 7 NA 3 3
# 8: 2 3 NA 2 2
# 9: 2 5 2 2 2
#10: 2 7 3 3 3
#11: 2 9 NA 3 3

How to create a count variable by group for specific values in the variable of interest?

At the moment I have to deal with paradata (long-format) generated by a software during the data collection phase of a cohort study.
How can I create a variable containing the number of occurence of a certain value by a group-variable (like by id: gen _n if VAR1==2 in Stata)?
Basically the data looks like this:
ID: VAR1:
1 2
1 1
1 2
2 2
2 3
2 2
3 2
3 2
3 2
I can create a variable count.1 using
`data$count.1 <- ave(data$VAR1, data$ID, FUN = seq_along)`
ID: VAR1: count.1:
1 2 1
1 1 2
1 2 3
2 2 1
2 3 2
2 2 3
3 2 1
3 2 2
3 2 3
How can I create a variable count.2 counting by ID the number of the occurence of the event 2 in VAR1?
ID: VAR1: count.1: count.2:
1 2 1 1
1 1 2 NA
1 2 3 2
2 2 1 1
2 3 2 NA
2 2 3 2
3 1 1 NA
3 2 2 1
3 2 3 2
The Data:
ID=c(1,1,1,2,2,2,3,3,3)
VAR1=c(2,1,2,2,3,2,1,2,2)
data <- as.data.frame(cbind(ID, VAR1))
Thanks in advance!!!
Try
data$count.2 <- with(data, ave(VAR1==2, ID,
FUN=function(x) ifelse(x, cumsum(x), NA)) )
data$count.2
#[1] 1 NA 2 1 NA 2 NA 1 2
Or using data.table
library(data.table)
setDT(data)[VAR1==2, count.2:=1:.N, by=ID][]
# ID VAR1 count.2
#1: 1 2 1
#2: 1 1 NA
#3: 1 2 2
#4: 2 2 1
#5: 2 3 NA
#6: 2 2 2
#7: 3 1 NA
#8: 3 2 1
#9: 3 2 2
Or using dplyr
library(dplyr)
data %>%
group_by(ID) %>%
mutate(count.2= ifelse(VAR1==2, cumsum(VAR1==2), NA))

Resources