I'd like to summarize a set of observations in a datatable and could use some help with the syntax.
I think this is as simple as a join but I'm trying to identify that specific values were seen on a specific observation DAY even if its across multiple measurements or sensors on that day.
observations are summarized by date
observations date have varied counts of measurements (rows per date)
'M'easurement columns indicate that a specific value was observed in ANY sensor for the day.
I've created 2 sample sets of data that I hope will clarify the goal. I've also created an image of an excel spreadsheet that hopes to show the relationship between the data.
library(data.table)
raw <- data.table(
Date = as.Date(c("2013-5-4","2013-5-4","2013-5-4", "2013-5-9","2013-5-9", "2013-5-16","2013-5-16","2013-5-16", "2013-5-30")),
S1 = c(4, 2, 3, 1, 1, 8, 7, 3, 3),
S2 = c(2, 5, 2, 4, 4, 9, 1, 6, 4),
S3 = c(6, 2, 2, 7, 3, 2, 7, 2, 1)
)
summarized <- data.table(
Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
M1 = c(FALSE,TRUE,TRUE,TRUE),
M2 = c(TRUE,FALSE,TRUE,FALSE),
M3 = c(TRUE,TRUE,TRUE,TRUE),
M4 = c(TRUE,FALSE,FALSE,TRUE),
M5 = c(TRUE,FALSE,FALSE,FALSE),
M6 = c(TRUE,FALSE,TRUE,FALSE),
M7 = c(FALSE,TRUE,TRUE,FALSE),
M8 = c(FALSE,FALSE,TRUE,FALSE),
M9 = c(FALSE,FALSE,TRUE,FALSE),
M10 = c(FALSE,FALSE,TRUE,FALSE)
)
Excel
Raw is the measurements input. Multiple measurements can happen on the same observation date (i.e. multiple rows).
Summarized is what I'm hoping to get out. Rows are summarized and the 'm'easurement columns merely indicate that the value (following the M, i.e. M1, M2) was observed on the day in any of the V columns. For example, the number 2 was seen on the first and last observation on 5/16, but the number 5 was not seen in any of the 9 values on 5/16.
I think I need to use a join but how to calculate the M columns escapes me.
Any help is much appreciated.
Question: is there a name for this type of operation in data science or mathematics?
Update:
I'm trying the following
setkey(raw,Date)
s <- data.table( Date=unique(raw$Date)) # get a datatable of the unique dates
setkey(s,Date)
s[raw, M1:=(length(na.omit(match(c(raw$V1,raw$v2,raw$v3),1)))>=1)]
Note that the values are not what's expected for 5-4 (should be FALSE). I think this is becuase the raw rows are not being constrained in my match statement.
Date M1
1: 2013-05-04 TRUE
2: 2013-05-09 TRUE
3: 2013-05-16 TRUE
4: 2013-05-30 TRUE
My guess is I need to use something different to subset the raw rows in the join.
This seems to work:
raw[,lapply(1:10,`%in%`,unique(unlist(.SD))),by=Date]
The result is
Date V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
If you want the columns with "M" instead of "V", use c(M=1:10) in place of 1:10.
This is a reshaping problem.
First, since it doesn't matter which sensor the data came from, let's aggregate your three columns into one column.
temp <- raw[,Reduce(union,list(S1,S2,S3)),by=Date]
Now we want to reshape from "long" to "wide" format.
A data table solution borrowed from this answer:
setkey(temp,Date,V1)
temp[CJ(unique(Date),unique(V1)), list(.N)][,
setNames(as.list(as.logical(N)), paste0("M",unique(V1))), by = Date]
# Date M1 M2 M3 M4 M5 M6 M7 M8 M9
# 1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
# 2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Base reshape works this way:
as.data.table(reshape(temp, timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
# Date V1.4 V1.2 V1.3 V1.5 V1.6 V1.1 V1.7 V1.8 V1.9
# 1: 2013-05-04 4 2 3 5 6 NA NA NA NA
# 2: 2013-05-09 4 NA 3 NA NA 1 7 NA NA
# 3: 2013-05-16 NA 2 3 NA 6 1 7 8 9
# 4: 2013-05-30 4 NA 3 NA NA 1 NA NA NA
## to order by column
temp2 <- as.data.table(reshape(temp[order(V1)], timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
# Date V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
# 1: 2013-05-09 1 NA 3 4 NA NA 7 NA NA
# 2: 2013-05-16 1 2 3 NA NA 6 7 8 9
# 3: 2013-05-30 1 NA 3 4 NA NA NA NA NA
# 4: 2013-05-04 NA 2 3 4 5 6 NA NA NA
##converts to logical true/false
temp2[,lapply(.SD,function(x) {x[is.na(x)] <- 0; as.logical(x)}), by = Date]
# Date vv V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
# 1: 2013-05-09 TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 2: 2013-05-16 TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 3: 2013-05-30 TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
# 4: 2013-05-04 TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
The package reshape2 is a bit more intuitive:
require(reshape2)
## dummy variable for TRUE/FALSE
temp[,vv := TRUE]
temp_reshape2 <- as.data.table(dcast(temp, Date ~ V1, value.var = "vv"))
## replace NA with FALSE
temp_reshape2[, lapply(.SD, function(x) {x[is.na(x)] <- FALSE; x}), by = Date]
# Date 1 2 3 4 5 6 7 8 9
# 1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
# 2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
And for completion, a lame parse-eval solution:
limits <- temp[,c(min(V1),max(V1))]
sapply(temp[,min(V1) : max(V1)], function(x) {
temp[,eval(parse(text=paste0("M",x," := any(abs(V1 - ",x,") < .Machine$double.eps)"))),by = Date]
})
Related
Thanks in advance for your kind help. This is my dataframe:
df <- data.frame('a'=c(1,2,3,4,5), 'b'=c("A",NA,"B","C","A"))
df
And I want to create a new column based on if the value of dataframe$b is present/or absent (TRUE/FALSE). I'm using grepl for this but I'm not sure how to dinamically create the new column.
I'm creating a vector with the unique values of df$b
list <- as.vector(unique(df$b))
And want to iterate with a for in df$b, in order to get a dataframe like this:
a b A B C
1 1 A TRUE FALSE FALSE
2 2 NA FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 A FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
But I'm not sure how to generate the new column inside the for loop. I'm trying to do something like this:
for (i in list) {
logical <- grepl(df$b, i)
df$i <- logical
But it generates an error. Any help will be appreciated
This may need table
df <- cbind(df, as.data.frame.matrix(table(df) > 0))
-output
df
a b A B C
1 1 A TRUE FALSE FALSE
2 2 <NA> FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 C FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
You can use this for loop
list <- as.vector(unique(na.omit(df$b)))
for(i in 1:length(list)){
`[[`(df , list[i]) <- ifelse(!is.na(df$b),
list[i] == df$b , FALSE)
}
output
a b A B C
1 1 A TRUE FALSE FALSE
2 2 <NA> FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 C FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
I have a transactional data like this
library(data.table)
library(stringr)
sample <- data.table (customerid=c(1,1,2,2,2,3,4,4,5,5,6,6,6,7,7),
product=c("A","A+B","A","A+B+C","A+C","B","B+C+D","C+D","A+D","A+B+D","A+B","A","A+C","B+D","D"))
I am trying to count how many product each customer buy totally and add it into a column name total_product
I tried this code in data.table
sample[, A:= str_detect(product,"A")]
sample[, B:= str_detect(product,"B")]
sample[, C:= str_detect(product,"C")]
sample[, C:= str_detect(product,"D")]
sample
the code returns
customerid product A B C D
1: 1 A TRUE FALSE FALSE FALSE
2: 1 A+B TRUE TRUE FALSE FALSE
3: 2 A TRUE FALSE FALSE FALSE
4: 2 A+B+C TRUE TRUE TRUE FALSE
5: 2 A+C TRUE FALSE TRUE FALSE
6: 3 B FALSE TRUE FALSE FALSE
7: 4 B+C+D FALSE TRUE TRUE TRUE
8: 4 C+D FALSE FALSE TRUE TRUE
9: 5 A+D TRUE FALSE FALSE TRUE
10: 5 A+B+D TRUE TRUE FALSE TRUE
11: 6 A+B TRUE TRUE FALSE FALSE
12: 6 A TRUE FALSE FALSE FALSE
13: 6 A+C TRUE FALSE TRUE FALSE
14: 7 B+D FALSE TRUE FALSE TRUE
15: 7 D FALSE FALSE FALSE TRUE
I saw a question on Stack that I should merge four c(A,B,C,D) column and count the TRUE
But in my case, I will have the same product count more than one time.
Thanks for your advice!
We can use lapply on the pattern vector (LETTERS[1:4]) and either specify the arguments of the function str_detect
sample[, LETTERS[1:4] := lapply(LETTERS[1:4], str_detect, string = product)]
Or use anonymous/lambda function
sample[, LETTERS[1:4] := lapply(LETTERS[1:4], function(x)
str_detect(product, x))]
Then create the 'total_product' count as the row wise sum of logical vector i.e. TRUE -> 1 and FALSE -> 0
sample[, total_product := rowSums(.SD), .SDcols = A:D]
If we want to count the unique elements from 'product' for each 'customerid', an option is to split the column with strsplit, get the unique count with uniqueN
sample[, .(total_product = uniqueN(unlist(strsplit(product,
'+', fixed = TRUE)))), by = customerid]
-output
# customerid total_product
#1: 1 2
#2: 2 3
#3: 3 1
#4: 4 3
#5: 5 3
#6: 6 3
#7: 7 2
I'm working with a large data frame of hospitalization records. Many patients have two or more hospitalizations, and their past medical history may be incomplete at one or more of the hospitalizations. I'd like to collapse all the information from each of their hospitalizations into a single list of medical problems for each patient.
Here's a sample data frame:
id <- c("123","456","789","101","123","587","456","789")
HTN <- c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE)
DM2 <- c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
TIA <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
df <- data.frame(id,HTN,DM2,TIA)
df
Which comes out to:
> df
id HTN DM2 TIA
1 123 TRUE FALSE TRUE
2 456 FALSE FALSE TRUE
3 789 FALSE TRUE TRUE
4 101 FALSE TRUE TRUE
5 123 FALSE FALSE FALSE
6 587 TRUE TRUE TRUE
7 456 FALSE FALSE TRUE
8 789 FALSE TRUE TRUE
I'd like my output to look like this:
id <- c("101","123","456","587","789")
HTN <- c(FALSE,TRUE,FALSE,TRUE,FALSE)
DM2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
TIA <- c(TRUE,TRUE,TRUE,TRUE,TRUE)
df2 <- data.frame(id,HTN,DM2,TIA)
df2
id HTN DM2 TIA
1 101 FALSE TRUE TRUE
2 123 TRUE FALSE TRUE
3 456 FALSE FALSE TRUE
4 587 TRUE TRUE TRUE
5 789 FALSE TRUE TRUE
So far I've got a pretty good hunch that arranging and grouping my data is the right place to start, and I think I could make it work by creating a new variable for each medical problem. I have about 30 medical problem's I'll need to collapse in this way, though, and that much repetitive code just seems like a recipe for an occult error.
df3 <- df %>%
arrange(id) %>%
group_by(id)
Looking around I haven't been able to find a particularly elegant way to go about this. Is there some slick dplyr function I'm overlooking?
We may use
df %>% group_by(id) %>% summarize_all(any)
# A tibble: 5 x 4
# id HTN DM2 TIA
# <fct> <lgl> <lgl> <lgl>
# 1 101 FALSE TRUE TRUE
# 2 123 TRUE FALSE TRUE
# 3 456 FALSE FALSE TRUE
# 4 587 TRUE TRUE TRUE
# 5 789 FALSE TRUE TRUE
In this way we first indeed group by id, as you suggested. Then we summarize all the variables with a function any: we provide a logical vector (e.g., HTN for patient 101) and return TRUE if in any of the rows we have TRUE and FALSE otherwise.
A base R option would be
aggregate(.~ id, df, any)
# id HTN DM2 TIA
#1 101 FALSE TRUE TRUE
#2 123 TRUE FALSE TRUE
#3 456 FALSE FALSE TRUE
#4 587 TRUE TRUE TRUE
#5 789 FALSE TRUE TRUE
Or with rowsum
rowsum(+(df[-1]), group = df$id) > 0
If we prefer data.table we might use:
setDT(df)[, lapply(.SD, any), keyby = id]
id HTN DM2 TIA
1: 101 FALSE TRUE TRUE
2: 123 TRUE FALSE TRUE
3: 456 FALSE FALSE TRUE
4: 587 TRUE TRUE TRUE
5: 789 FALSE TRUE TRUE
I am trying to apply a logical operator to columns of data.table, which works fine without by= groups, but with by= groups something weird is happening. I'm using R 3.3.1 and data.table 1.9.6 (off CRAN).
Here is an example data set:
library(data.table)
# x <- matrix(rnorm(10),ncol=2)
x <- structure(c(0.323618333400921, -0.103619160085676, -0.823322722581548,
-0.0175726272468397, 2.17286179204364, 1.63465793636681, 0.258469984285221,
-0.411622451584, 0.22594247704226, 0.114427086028551), .Dim = c(5L, 2L))
dt <- data.table(x)
dt$id <- c(1,1,1,2,2)
dt
## V1 V2 id
## 1: 0.32361833 1.6346579 1
## 2: -0.10361916 0.2584700 1
## 3: -0.82332272 -0.4116225 1
## 4: -0.01757263 0.2259425 2
## 5: 2.17286179 0.1144271 2
Let's say I would like to know if cumulative minimum (that is, the minimal value in each column either from the start of sample to current row, or from the start of the by= group to the current row) is below zero. When there are no by= groups, we are good:
dt[, cummin(.SD), .SDcols=1:2]
## V1 V2
## 1: 0.3236183 1.6346579
## 2: -0.1036192 0.2584700
## 3: -0.8233227 -0.4116225
## 4: -0.8233227 -0.4116225
## 5: -0.8233227 -0.4116225
> dt[, cummin(.SD)<0, .SDcols=1:2]
## V1 V2
## [1,] FALSE FALSE
## [2,] TRUE FALSE
## [3,] TRUE TRUE
## [4,] TRUE TRUE
## [5,] TRUE TRUE
So far so good. I can also compute actual cumulative minimums with by= groups:
dt[, cummin(.SD), by=id]
## id V1 V2
## 1: 1 0.32361833 1.6346579
## 2: 1 -0.10361916 0.2584700
## 3: 1 -0.82332272 -0.4116225
## 4: 2 -0.01757263 0.2259425
## 5: 2 -0.01757263 0.1144271
However, if I try to compute the indicator variables as before, the output is as follows:
dt[, (cummin(.SD)<0), by=id]
## id V1
## 1: 1 FALSE
## 2: 1 TRUE
## 3: 1 TRUE
## 4: 1 FALSE
## 5: 1 FALSE
## 6: 1 TRUE
## 7: 2 TRUE
## 8: 2 TRUE
## 9: 2 FALSE
## 10: 2 FALSE
Instead of 2 variables of 5 obs. each, we have one with 10. It appears that within each by-group, the variables are stacked on top of each other. How should I do this correctly?
It will work with lapply:
dt[,lapply(.SD,function(x) cummin(x)<0),by="id"]
id V1 V2
1: 1 FALSE FALSE
2: 1 TRUE FALSE
3: 1 TRUE TRUE
4: 2 TRUE FALSE
5: 2 TRUE FALSE
As commented by David Arenburg, it seems that it works with cummin(.SD) because it is a data.frame and it doesn't work with cummin(.SD)<0 which is a matrix. So another solution is to cast the matrix as a data.frame :
dt[, (as.data.frame(cummin(.SD)<0)), by=id]
id V1 V2
1: 1 FALSE FALSE
2: 1 TRUE FALSE
3: 1 TRUE TRUE
4: 2 TRUE FALSE
5: 2 TRUE FALSE
You can try something like this :
dt2 <- dt[, cummin(.SD), by = id]
results <- data.table(dt2[, .SD<0])
results$id <- dt$id
results
id V1 V2
1: 1 FALSE FALSE
2: 1 TRUE FALSE
3: 1 TRUE TRUE
4: 2 TRUE FALSE
5: 2 TRUE FALSE
I'm trying to use R to find the average number of attempts before a success in a dataframe with 300,000+ rows. Data is structured as below.
EventID SubjectID ActionID Success DateUpdated
a b c TRUE 2014-06-21 20:20:08.575032+00
b a c FALSE 2014-06-20 02:58:40.70699+00
I'm still learning my way through R. It looks like I can use ddply to separate the frame out based on Subject and Action (I want to see how many times a given subject tries an action before achieving a success), but I can't figure out how to write the formula I need to apply.
library(data.table)
# example data
dt = data.table(group = c(1,1,1,1,1,2,2), success = c(F,F,T,F,T,F,T))
# group success
#1: 1 FALSE
#2: 1 FALSE
#3: 1 TRUE
#4: 1 FALSE
#5: 1 TRUE
#6: 2 FALSE
#7: 2 TRUE
dt[, which(success)[1] - 1, by = group]
# group V1
#1: 1 2
#2: 2 1
Replace group with list(subject, action) or whatever is appropriate for your data (after converting it to data.table from data.frame).
To follow up on Tarehman's suggestion, since I like rle,
foo <- rle(data$Success)
mean(foo$lengths[foo$values==FALSE])
This might be an answer to a totally different question, but does this get close to what you want?
tfs <- sample(c(FALSE,TRUE),size = 50, replace = TRUE, prob = c(0.8,0.2))
tfs_sums <- cumsum(!tfs)
repsums <- tfs_sums[duplicated(tfs_sums)]
mean(repsums - c(0,repsums[-length(repsums)]))
tfs
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[20] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
[39] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
repsums
1 6 8 9 20 20 20 20 24 26 31 36
repsums - c(0,repsums[-length(repsums)])
1 5 2 1 11 0 0 0 4 2 5 5
The last vector shown is the length of each continuous "run" of FALSE values in the vector tfs
you could use data.table work around to get what you need as follows:
library (data.table)
df=data.frame(EventID=c("a","b","c","d"),SubjectID=c("b","a","a","a"),ActionID=c("c","c","c","c"),Success=c(TRUE,FALSE,FALSE,TRUE))
dt=data.table(df)
dt[ , Index := 1:.N , by = c("SubjectID" , "ActionID","Success") ]
Now this Index column will hold the number that you need for each subject/action consecutive experiments. You need to aggregate to get that number (max number)
result=stats:::aggregate.formula(Index~(SubjectID+ActionID),data=dt,FUN= function(x) max(x))
so this will give you the max index and it is the number of the falses before you hit a true. Note that you might need to do further processing to filter out subjects that has never had a true