Create an summarizing variable for multiple columns in data.table r - r

I have the following data.table
dt <- data.table(id=c(1,2,2,2,3,3,4),
date=c("2019-09-13", "2018-12-06", "2017-12-14", "2018-02-08", "2015-12-06", "2012-12-14", "2011-02-08"),
variable_1=c("a","b",NA,NA,"b","c",NA),
variable_2=c(NA,NA,"a",NA,"a","c",NA),
variable_3=c(NA,NA,NA,"b","c","c",NA))
dt
id date variable_1 variable_2 variable_3
1: 1 2019-09-13 a <NA> <NA>
2: 2 2018-12-06 b <NA> <NA>
3: 2 2017-12-14 <NA> a <NA>
4: 2 2018-02-08 <NA> <NA> b
5: 3 2015-12-06 b a c
6: 3 2012-12-14 c c c
7: 4 2011-02-08 <NA> <NA> <NA>
I want to create a variable y that is summarizing all the columns. Everything that has one !is.na() among the variable should be 0 . Every row that has only is.na among all the variables should be 1. Like this:
id date variable_1 variable_2 variable_3 y
1: 1 2019-09-13 a <NA> <NA> 0
2: 2 2018-12-06 b <NA> <NA> 0
3: 2 2017-12-14 <NA> a <NA> 0
4: 2 2018-02-08 <NA> <NA> b 0
5: 3 2015-12-06 b a c 0
6: 3 2012-12-14 c c c 0
7: 4 2011-02-08 <NA> <NA> <NA> 1
In the original data.table I have 22 variables that I am looking at among 830 total variables. So I would prefer not to look for every Variable with _1 to _22 separately.
Is there a way in data.table?

dt[, y := +(rowSums(!is.na(.SD)) == 0L), .SDcols = patterns("^variable_")]
# id date variable_1 variable_2 variable_3 y
# 1: 1 2019-09-13 a <NA> <NA> 0
# 2: 2 2018-12-06 b <NA> <NA> 0
# 3: 2 2017-12-14 <NA> a <NA> 0
# 4: 2 2018-02-08 <NA> <NA> b 0
# 5: 3 2015-12-06 b a c 0
# 6: 3 2012-12-14 c c c 0
# 7: 4 2011-02-08 <NA> <NA> <NA> 1
Walk-through:
.SDcols=patterns(...) defines the columns to be processed as .SD in the j component. This doesn't involve removing/selecting columns for the output, just the ones that will be referenced internally.
!is.na(.SD) returns a logical matrix, same dims as .SD, indicating if its value is NA.
rowSums(...) returns the count of non-NAs in the row.
using the inverted logic of "count the number of non-NA values in a row", we're able to not care about the number of columns being processed; this is what allows me to use == 0L.
+(...) is a shorthand trick for converting logical to 0:1

Related

Extracting a numeric information align with ID from unstructured dataset in R

I am trying to extract score information for each ID and for each itemID. Here how my sample dataset looks like.
df <- data.frame(Text_1 = c("Scoring", "1 = Incorrect","Text1","Text2","Text3","Text4", "Demo 1: Color Naming","Amarillo","Azul","Verde","Azul",
"Demo 1: Errors","Item 1: Color naming","Amarillo","Azul","Verde","Azul",
"Item 1: Time in seconds","Item 1: Errors",
"Item 2: Shape Naming","Cuadrado/Cuadro","Cuadrado/Cuadro","Círculo","Estrella","Círculo","Triángulo",
"Item 2: Time in seconds","Item 2: Errors"),
School.2 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, NA,NA,NA,NA,NA,
0,"1 = Incorrect responses",0,1,NA,NA,NA,0,"1 = Incorrect responses",0,NA,NA,1,1,0,NA,0),
X_Elementary_School..3 = c("Bill:","X District","10/7/21","K","123-2222-2:",NA, NA,NA,NA,NA,NA,
NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,NA,NA),
School.4 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, 0,NA,1,NA,NA,0,"1 = Incorrect responses",0,1,NA,NA,120,0,"1 = Incorrect responses",NA,1,0,1,NA,1,110,0),
Y_Elementary_School..2 = c("John:","X District","11/7/21","K","112-1111-3:",NA, NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA, NA,NA))
> df
Text_1 School.2 X_Elementary_School..3 School.4 Y_Elementary_School..2
1 Scoring Teacher: Bill: Teacher: John:
2 1 = Incorrect DC Name: X District DC Name: X District
3 Text1 Date (mm/dd/yyyy): 10/7/21 Date (mm/dd/yyyy): 11/7/21
4 Text2 Child Grade: K Child Grade: K
5 Text3 Student Study ID: 123-2222-2: Student Study ID: 112-1111-3:
6 Text4 <NA> <NA> <NA> <NA>
7 Demo 1: Color Naming <NA> <NA> 0 <NA>
8 Amarillo <NA> <NA> <NA> <NA>
9 Azul <NA> <NA> 1 <NA>
10 Verde <NA> <NA> <NA> <NA>
11 Azul <NA> <NA> <NA> <NA>
12 Demo 1: Errors 0 <NA> 0 <NA>
13 Item 1: Color naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
14 Amarillo 0 <NA> 0 <NA>
15 Azul 1 <NA> 1 <NA>
16 Verde <NA> <NA> <NA> <NA>
17 Azul <NA> <NA> <NA> <NA>
18 Item 1: Time in seconds <NA> <NA> 120 <NA>
19 Item 1: Errors 0 <NA> 0 <NA>
20 Item 2: Shape Naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
21 Cuadrado/Cuadro 0 <NA> <NA> <NA>
22 Cuadrado/Cuadro <NA> <NA> 1 <NA>
23 Círculo <NA> <NA> 0 <NA>
24 Estrella 1 <NA> 1 <NA>
25 Círculo 1 <NA> <NA> <NA>
26 Triángulo 0 <NA> 1 <NA>
27 Item 2: Time in seconds <NA> <NA> 110 <NA>
28 Item 2: Errors 0 <NA> 0 <NA>
This sample dataset is limited only for two schools, two teachers and two students.
In this step, I need to extract student responses for each item.
Wherever the first column has Item , I need to grab from there. I especially need to index the rows and columns columns rather than giving the exact row columns number since this will be for multiple datafiles and each files has different information. No need to grab the ..:Error part.
################################################################################
# ## 2-extract the score information here
# ## 1-grab item information from where "Item 1:.." starts
Here, rather than using row number, how to automate this part.
score<-df[c(7:11,13:17,20:26),c(seq(2,dim(df)[2],2))] # need to automate row and columns index here
score<-as.data.frame(t(score))
rownames(score)<-seq(1,nrow(score),1)
colnames(score)<-paste0('i',seq(1,ncol(score),1)) # assign col names for items
score<-apply(score,2,as.numeric) # only keep numeric columns
score<-as.data.frame(score)
score$total<-rowSums(score,na.rm=T); score # create a total score
> score
i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5
Additionally, I need to add ID which I could not achieve here.
My desired output would be:
> score
ID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 123-2222-2 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 112-1111-3 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5

data.table: (Quickly) set column values within a group to the group's last element

Say that I have a dataset like this one:
example <- data.table(Object = rep(LETTERS[1:3], each=3), date = as.Date(rep(c(NA,NA,"2020-01-01"),3)), date_data =1:9)
example
Object date date_data
1: A <NA> 1
2: A <NA> 2
3: A 2020-01-01 3
4: B <NA> 4
5: B <NA> 5
6: B 2020-01-01 6
7: C <NA> 7
8: C <NA> 8
9: C 2020-01-01 9
I would like to set all date_data within a certain group equal to the last date_data value for that group. So, the desired output is this:
Object date date_data
1: A <NA> 3
2: A <NA> 3
3: A 2020-01-01 3
4: B <NA> 6
5: B <NA> 6
6: B 2020-01-01 6
7: C <NA> 9
8: C <NA> 9
9: C 2020-01-01 9
Now, I have managed to get exactly what I need using example[, date_data:= .SD[.N]$date_data, by = "Object"]. The problem is that I want to make such a call in a loop iterating over a large data table. Calling .SD each time is too slow. Ideally, the code would use .I (like here) or some other data.table optimized features that I don't know about. I didn't manage to figure out the right way to do this.
Any ideas?
You can use data.table::last:
library(data.table)
example <- data.table(
Object = rep(LETTERS[1:3], each=3),
date = as.Date(rep(c(NA,NA,"2020-01-01"),3)),
date_data =1:9
)
example[, date_data := last(date_data), by = Object ]
example
# Object date date_data
# <char> <Date> <int>
# 1: A <NA> 3
# 2: A <NA> 3
# 3: A 2020-01-01 3
# 4: B <NA> 6
# 5: B <NA> 6
# 6: B 2020-01-01 6
# 7: C <NA> 9
# 8: C <NA> 9
# 9: C 2020-01-01 9
But i don't know if there's much optimisation, otherwise you can just use .N on the variable:
example[, date_data := date_data[.N], by = Object]
example

Melt or Replicate rows in a data table a certain number of times and include counter in R

I would like to "expand" a dataframe, duplicating the information on some columns the number of times indicated by a fifth column.
What would the most efficiency to achieve this task with R? (Open to Data Table or Dplyer, reshape solutions).
Original Dataframe/DataTable:
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 2
4: 3 C 2016-01-04 <NA> NA
5: 4 D 2016-01-05 2016-01-02 5
Desired Dataframe/DataTable
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 1
4: 2 B 2016-01-03 2016-01-01 2
5: 3 C 2016-01-04 <NA> NA
6: 4 D 2016-01-05 2016-01-02 1
7: 4 D 2016-01-05 2016-01-02 2
8: 4 D 2016-01-05 2016-01-02 3
9: 4 D 2016-01-05 2016-01-02 4
10: 4 D 2016-01-05 2016-01-02 5
Reproducible data:
DT <- data.table(
f_1 = factor(c(1,2,2,3,4)),
f_2 = factor(c("A", "A", "B", "C", "D")),
d_1 = as.Date(c("2016-01-01","2016-01-02","2016-01-03","2016-01-04","2016-01-05")),
d_2 = as.Date(c(NA,NA,"2016-01-01",NA,"2016-01-02")),
i_1 = as.integer(c(NA,NA,2,NA,5)))
Thanks and sorry if it is duplicated. I am struggling with this kind of reshaping exercises.
Here is a data.table solution. Basically, group by those columns that you want to duplicate and generate sequence of integers using the number in i_1
DT[, .(i_1=if(!is.na(i_1)) seq_len(i_1) else i_1),
by=c(names(DT)[-ncol(DT)])]
output:
f_1 f_2 d_1 d_2 i_1
1: 1 A 2016-01-01 <NA> NA
2: 2 A 2016-01-02 <NA> NA
3: 2 B 2016-01-03 2016-01-01 1
4: 2 B 2016-01-03 2016-01-01 2
5: 3 C 2016-01-04 <NA> NA
6: 4 D 2016-01-05 2016-01-02 1
7: 4 D 2016-01-05 2016-01-02 2
8: 4 D 2016-01-05 2016-01-02 3
9: 4 D 2016-01-05 2016-01-02 4
10: 4 D 2016-01-05 2016-01-02 5
Or another way using data.table. For each row, create a sequence of numbers using i_1 and add the original data to that sequence with c(.SD[, -"i_1], ..... and finally remove the by column
DT[, c(.SD[, -"i_1"], .(i_1=if (!is.na(i_1)) seq_len(i_1) else i_1)),
by=seq_len(DT[,.N])][,-1L]
Are you OK replacing i_1 with 1 when it's NA? If so, the following would be slightly more readable:
First, repeat the rows the specified number of times (ad hoc accounting for the missing values of i_1, using replace courtesy of #Frank):
DT_out = DT[rep(1:.N, replace(i_1, is.na(i_1), 1L))]
This could be just DT[rep(1:.N, i_1)] if we've already replaced DT[is.na(i_1), i_1 := 1L].
All that's left is to update the values of i_1. There are simpler versions of this, depending on your data's particulars. Here I think is the more general version:
DT_out[!is.na(i_1), i_1 := rowidv(.SD), .SDcols = !'i_1'][]
# f_1 f_2 d_1 d_2 i_1
# 1: 1 A 2016-01-01 <NA> NA
# 2: 2 A 2016-01-02 <NA> NA
# 3: 2 B 2016-01-03 2016-01-01 1
# 4: 2 B 2016-01-03 2016-01-01 2
# 5: 3 C 2016-01-04 <NA> NA
# 6: 4 D 2016-01-05 2016-01-02 1
# 7: 4 D 2016-01-05 2016-01-02 2
# 8: 4 D 2016-01-05 2016-01-02 3
# 9: 4 D 2016-01-05 2016-01-02 4
# 10: 4 D 2016-01-05 2016-01-02 5
rowid and rowidv give the row number within the groups defined by the variables it's passed. You can compare with rowid(f_2), rowid(f_1), and rowid(f_1, f_2) to get an idea of what I mean. rowidv(.SD) is a shorthand for rowid(f_1, f_2, d_1, d_2), since we exclude i_1 from the columns in .SD.

Copy a value from one person in a group to everyone in a group

I have a data set in long format (multiple observations per ID), due to omitted information on prescriptions. Each ID is part of a larger "set", and there are 50 or more sets all with one diseased ID. One person per set has the disease, and the others don't.
dt <- data.table(ID = rep(1:10, each = 4),
disease = c(rep(0, 16), rep(1, 4), rep(0, 12), rep(1,4), rep(0,4)),
dob = c(rep(as.Date("13/05/1924", "%d/%m/%Y"), 4), rep(as.Date("15/09/1936", "%d/%m/%Y"),4),
rep(as.Date("30/06/1957", "%d/%m/%Y"),4), rep(as.Date("19/02/1946", "%d/%m/%Y"),4),
rep(as.Date("26/04/1939", "%d/%m/%Y"),4), rep(as.Date("13/05/1922", "%d/%m/%Y"), 4), rep(as.Date("18/10/1945", "%d/%m/%Y"),4),
rep(as.Date("30/06/1957", "%d/%m/%Y"),4), rep(as.Date("19/02/1946", "%d/%m/%Y"),4),
rep(as.Date("26/12/1939", "%d/%m/%Y"),4)),
disease.date = c(rep(as.Date("01/01/2000", "%d/%m/%Y"), 16), rep(as.Date("19/02/2006", "%d/%m/%Y"),4),
rep(as.Date("01/01/2000", "%d/%m/%Y"), 12), rep(as.Date("13/11/2010", "%d/%m/%Y"),4),
rep(as.Date("01/01/2000", "%d/%m/%Y"), 4)),
set = c(rep(1,20), rep(2,20)))
dt <- dt[(disease==0), disease.date:=NA]
dt
ID disease dob disease.date set
1: 1 0 1924-05-13 <NA> 1
2: 1 0 1924-05-13 <NA> 1
3: 1 0 1924-05-13 <NA> 1
4: 1 0 1924-05-13 <NA> 1
5: 2 0 1936-09-15 <NA> 1
6: 2 0 1936-09-15 <NA> 1
7: 2 0 1936-09-15 <NA> 1
8: 2 0 1936-09-15 <NA> 1
9: 3 0 1957-06-30 <NA> 1
10: 3 0 1957-06-30 <NA> 1
11: 3 0 1957-06-30 <NA> 1
12: 3 0 1957-06-30 <NA> 1
13: 4 0 1946-02-19 <NA> 1
14: 4 0 1946-02-19 <NA> 1
15: 4 0 1946-02-19 <NA> 1
16: 4 0 1946-02-19 <NA> 1
17: 5 1 1939-04-26 2006-02-19 1
18: 5 1 1939-04-26 2006-02-19 1
19: 5 1 1939-04-26 2006-02-19 1
20: 5 1 1939-04-26 2006-02-19 1
21: 6 0 1922-05-13 <NA> 2
22: 6 0 1922-05-13 <NA> 2
23: 6 0 1922-05-13 <NA> 2
24: 6 0 1922-05-13 <NA> 2
25: 7 0 1945-10-18 <NA> 2
26: 7 0 1945-10-18 <NA> 2
27: 7 0 1945-10-18 <NA> 2
28: 7 0 1945-10-18 <NA> 2
29: 8 0 1957-06-30 <NA> 2
30: 8 0 1957-06-30 <NA> 2
31: 8 0 1957-06-30 <NA> 2
32: 8 0 1957-06-30 <NA> 2
33: 9 1 1946-02-19 2010-11-13 2
34: 9 1 1946-02-19 2010-11-13 2
35: 9 1 1946-02-19 2010-11-13 2
36: 9 1 1946-02-19 2010-11-13 2
37: 10 0 1939-12-26 <NA> 2
38: 10 0 1939-12-26 <NA> 2
39: 10 0 1939-12-26 <NA> 2
40: 10 0 1939-12-26 <NA> 2
I'm interested in finding the age of everyone in that set on the date of disease for the case.
for example, how old is everyone in set 1 on 19/02/2006 (the cases disease date)? and in set 2 on 13/11/2010?
I've tried the data.table way:
cc[, age := dob - oa.cons.date, by = set]
which only worked for those with a disease.date
Any other thoughts I had involved copying the disease.date of each case to the controls in the sameset, but I didn't know how to do that either.
You can copy the first non-empty disease date within each set group to the whole column disease.date:
dt[, disease.date := disease.date[!is.na(disease.date)][1], by = set]
Then calculate age:
dt[, age := disease.date - dob]
Notice that time difference intervals are in days. You may divide them by 365 or treat them in any other suitable way. Maybe package lubridate can be useful here. With its help:
dt[, age := as.period(interval(dob, disease.date), unit = "years")]
or
dt[, age := decimal_date(disease.date) - decimal_date(dob)]
You can try this:
(dt$dob - dt$disease.date[20])/365
Taking dt$disease.date[20] since there are some NAs in the disease.date column.
Since both columns are date objects, R automatically calculates the difference in two dates. The difference will be in terms of days, so dividing by 365 gives you the approximate age.

Remove rows after a certain date based on a condition in R

There are similar questions I've seen, but none of them apply it to specific rows of a data.table or data.frame, rather they apply it to the whole matrix.
Subset a dataframe between 2 dates
How to select some rows with specific date from a data frame in R
I have a dataset with patients who were diagnosed with OA and those who were not:
dt <- data.table(ID = seq(1,10,1), OA = c(1,0,0,1,0,0,0,1,1,0),
oa.date = as.Date(c("01/01/2006", "01/01/2001", "01/01/2001", "02/03/2005","01/01/2001","01/01/2001","01/01/2001","05/06/2010", "01/01/2011", "01/01/2001"), "%d/%m/%Y"),
stop.date = as.Date(c("01/01/2006", "31/12/2007", "31/12/2008", "02/03/2005", "31/12/2011", "31/12/2011", "31/12/2011", "05/06/2010", "01/01/2011", "31/12/2011"), "%d/%m/%Y"))
dt$oa.date[dt$OA==0] <- NA
> dt
ID OA oa.date stop.date
1: 1 1 2006-01-01 2006-01-01
2: 2 0 <NA> 2007-12-31
3: 3 0 <NA> 2008-12-31
4: 4 1 2005-03-02 2005-03-02
5: 5 0 <NA> 2011-12-31
6: 6 0 <NA> 2011-12-31
7: 7 0 <NA> 2011-12-31
8: 8 1 2010-06-05 2010-06-05
9: 9 1 2011-01-01 2011-01-01
10: 10 0 <NA> 2011-12-31
What I want to do is delete those who were diagnosed with OA (OA==1) before start:
start <- as.Date("01/01/2009", "%d/%m/%Y")
So I want my final data to be:
> dt
ID OA oa.date stop.date
1: 2 0 <NA> 2009-12-31
2: 3 0 <NA> 2008-12-31
3: 5 0 <NA> 2011-12-31
4: 6 0 <NA> 2011-12-31
5: 7 0 <NA> 2011-12-31
6: 8 1 2010-06-05 2010-06-05
7: 9 1 2011-01-01 2011-01-01
8: 10 0 <NA> 2011-12-31
My tries are:
dt[dt$OA==1] <- dt[!(oa.date < start)]
I've also tried a loop but to no effect.
Any help is much appreciated.
This should be straightforward:
> dt[!(OA & oa.date < start)]
# ID OA oa.date stop.date
#1: 2 0 <NA> 2007-12-31
#2: 3 0 <NA> 2008-12-31
#3: 5 0 <NA> 2011-12-31
#4: 6 0 <NA> 2011-12-31
#5: 7 0 <NA> 2011-12-31
#6: 8 1 2010-06-05 2010-06-05
#7: 9 1 2011-01-01 2011-01-01
#8: 10 0 <NA> 2011-12-31
The OA column is binary (1/0) which is coerced to logical (TRUE/FALSE) in the i-expression.
You can try
dt=dt[dt$OA==0|(dt$OA==1&!(dt$oa.date < start)),]

Resources