Calculate probability and consecutive events in a dataframe

Calculate probability and consecutive events in a dataframe - r

My dataset has 575 rows and 368 columns and it looks like this:
NUTS3_2016 URAU_CODE FUA_CODE X2018.01.01.x X2018.01.02.x X2018.01.03.x ...
1 AT130 AT001C1 AT001L3 0.46369280 0.3582241 0.2777274 ...
2 AT211 AT006C1 AT006L2 -0.04453125 -0.3092773 -0.3284180 ...
3 AT312 AT003C1 AT003L3 1.02993164 0.9640137 0.6413086 ...
4 AT323 AT004C1 AT004L3 1.21105239 1.4335363 1.2400620 ...
... ... .... ... ... ... .... ...
I want to calculate the probability that x>2.5 for each row.
I also want to calculate for how many consecutive days x remains >2.5 for each row.
What are your suggestions?
Many thanks
Attempt:
A <- c("a", "b", "c", "d", "e")
B <- c(1:5)
C <- c(1:5)
x <- data.frame(A,B,C)
x$prob <- rowMeans(x[-(1)]>2)
x
# A B C prob
# 1 a 1 1 0
# 2 b 2 2 0
# 3 c 3 3 1
# 4 d 4 4 1
# 5 e 5 5 1

We can use rle for finding the length of the maximum streak.
## Some sample data:
set.seed(47)
data = matrix(rnorm(24, mean = 2.5), nrow = 3)
data = cbind(ID = c("A", "B", "C"), as.data.frame(data))
data
# ID V1 V2 V3 V4 V5 V6 V7 V8
# 1 A 4.494696 2.218235 1.514518 1.034250 2.9938202 3.170779 1.7966118 2.749148
# 2 B 3.211143 2.608776 2.515131 1.577544 0.6717708 2.418922 2.4594218 2.159584
# 3 C 2.685405 1.414263 2.247954 2.539602 2.5914729 3.764241 0.9338379 2.917191
data$max_streak = apply(data[-1], 1, function(x) with(rle(x > 2.5), max(lengths[values])))
# ID V1 V2 V3 V4 V5 V6 V7 V8 max_streak
# 1 A 4.494696 2.218235 1.514518 1.034250 2.9938202 3.170779 1.7966118 2.749148 2
# 2 B 3.211143 2.608776 2.515131 1.577544 0.6717708 2.418922 2.4594218 2.159584 3
# 3 C 2.685405 1.414263 2.247954 2.539602 2.5914729 3.764241 0.9338379 2.917191 3

Related

Overlap in row values from previous rows

I have a dataframe like this:
set.seed(123)
a <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
df <- data.frame(
V1 = sample(a,4, replace=TRUE),
V2 = sample(a,4, replace=TRUE),
V3 = sample(a,4, replace=TRUE),
V4 = sample(a,4, replace=TRUE)
)
which looks like
V1 V2 V3 V4
1 C I E G
2 H A E F
3 D E I A
4 H I E I
I'd like to count the number of unique values in a row in comparison to the previous rows, so the result would look like:
V1 V2 V3 V4 V5
1 C I E G 4
2 H A E F 3
3 D E I A 2
4 H I E I 1
V5 equals 4 for row 1 since it's the 1st row and all are unique
V5 equals 3 for row 2 since H, A, and F were not in row 1
V5 equals 2 for row 3 since 1) D and I were not in row 2. and 2) D and A were not in row 1.
V5 equals 1 for row 4 since 1) H was not in row 1, 2) I was not in row 2, and 3) H was not in row 4.
if row 4 were H I E A, then V5 for row 4 would have been still been 1 since it only has 1 value not in row 3, even though it would have 2 values not in row 2 and 2 values not in row 1.

Here is a multi-step method in base R.
# Create a list of the elements by row, using mike H's method
myList <- strsplit(Reduce(paste0, df), "")
# previous method, could create new object first t(df) if large df
# myList <- split(t(df), col(t(df)))
# get pairwise combinations of rows
combos <- t(combn(nrow(df):1, 2))[choose(nrow(df), 2):1,]
# get desired values, sapply runs through pairs of rows, tapply calculates min with row
df$cnts <- c(length(unique(myList[[1]])), # value for first row
tapply(sapply(1:nrow(combos), # sapply through pairs, taking set diffs
function(x) length(setdiff(myList[[combos[x,1]]],
myList[[combos[x,2]]]))),
combos[,1], min)) # split set diff lengths by row, get min length
This returns
df
V1 V2 V3 V4 cnts
1 C I E G 4
2 H A E F 3
3 D E I A 2
4 H I E I 1

For such tasks, storing the rows/sets of data like "df" in a tabulation format can be helpful to solve problems:
tab = table(as.matrix(df), row(df)) > 0
#> tab
#
# 1 2 3 4
# A FALSE TRUE TRUE FALSE
# C TRUE FALSE FALSE FALSE
# D FALSE FALSE TRUE FALSE
# E TRUE TRUE TRUE TRUE
# F FALSE TRUE FALSE FALSE
# G TRUE FALSE FALSE FALSE
# H FALSE TRUE FALSE TRUE
# I TRUE FALSE TRUE TRUE
crossprod can be used to retrieve (in a very efficient manner) the number of items that belong to a row but not to any other:
ct = crossprod(tab, !tab)
#> ct
#
# 1 2 3 4
# 1 0 3 2 2
# 2 3 0 2 2
# 3 2 2 0 2
# 4 1 1 1 0
Above we can see that, e.g., row 4 contains 1 element that row 1 does not contain, while row 1 contains 2 elements that are not in row 4, etc.
Since here we only care about the previous rows of each row and, specifically, the minimum of each set of one-to-all comparisons, an idea to get the result is:
ct[upper.tri(ct, TRUE)] = Inf ## to ignore 'upper.tri' values in 'max.col'
j_min = max.col(-ct, "first") ## row-index of the minimum difference per row
c(sum(tab[, 1]),
ct[cbind(2:nrow(df), j_min[-1])])
#[1] 4 3 2 1

Here's an approach that uses Reduce and mapply:
df$cols_paste <- strsplit(Reduce(paste0, df), split = "")
df$V5 <- lapply(1:length(df$cols_paste), function(x){
if(x==1) compare = NA
else compare = df$cols_paste[seq(1:(x-1))]
min(mapply(function(x, y) length(setdiff(x,y)), df$cols_paste[x], compare))
})
df[,setdiff(names(df), "cols_paste")]
V1 V2 V3 V4 V5
1 C I E G 4
2 H A E F 3
3 D E I A 2
4 H I E I 1

Transpose a data.table (columns names -> first column of output)

What is the most efficient way to traspose
> dt <- data.table( x = c(1, 1, 3, 1, 3, 1, 1), y = c(1, 2, 1, 2, 2, 1, 1) )
> dt
x y
1: 1 1
2: 1 2
3: 3 1
4: 1 2
5: 3 2
6: 1 1
7: 1 1
into:
> output
cn v1 v2 v3 v4 v5 v6 v7
1: x 1 1 3 1 3 1 1
2: y 1 2 1 2 2 1 1
dcast.data.table is supposed to be efficient, but I can't figure out how exactly it has to be done

How about data.table::transpose:
data.table(cn = names(dt), transpose(dt))
# cn V1 V2 V3 V4 V5 V6 V7
#1: x 1 1 3 1 3 1 1
#2: y 1 2 1 2 2 1 1
If you are really concerned about efficiency, this may be better:
tdt <- transpose(dt)[, cn := names(dt)]
setcolorder(tdt, c(ncol(tdt), 1:(ncol(tdt) - 1)))
tdt
# cn V1 V2 V3 V4 V5 V6 V7
#1: x 1 1 3 1 3 1 1
#2: y 1 2 1 2 2 1 1
transpose seems to be a little faster than t (which calls do_transpose), but not by a large margin. I would guess that both of these implementations are roughly near the upper bound of efficiency for non in-place transposition algorithms.
Dt <- data.table(
x = rep(c(1, 1, 3, 1, 3, 1, 1), 10e2),
y = rep(c(1, 2, 1, 2, 2, 1, 1), 10e2))
all.equal(data.table(t(Dt)), data.table(transpose(Dt)))
#[1] TRUE
microbenchmark::microbenchmark(
"base::t" = data.table(t(Dt)),
"data.table::transpose" = data.table(transpose(Dt))
)
#Unit: milliseconds
# expr min lq mean median uq max neval
#base::t 415.4200 434.5308 481.4373 458.1619 507.9556 747.2022 100
#data.table::transpose 409.5624 425.8409 474.9709 444.5194 510.3750 685.0543 100

Code to identify fields for object [temp_table] and report this via the object [temp_table_schema]
temp_table
temp_table_data_types <- sapply (temp_table, class)
temp_table_schema <- NULL
for (x in 1:(length(temp_table_data_types))) {
temp_table_schema <- base::rbind(temp_table_schema, data.table(ROWID = (x)
, COLUMN_NAME = names(temp_table_data_types[x])
, DATA_TYPE = temp_table_data_types[[x]][[1]]
, DETAILS = if(length(as.list(temp_table_data_types[[x]]))> 1) {as.list(temp_table_data_types[[x]])[[2]]} else {""}
))
}
temp_table_schema
rm(list = c("temp_table_data_types"))

cbind specific columns from multiple data.tables efficiently

I have a list of data.tables that I need to cbind, however, I only need the last X columns.
My data is structured as follows:
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
>DT.list
[[1]]
x y v1 v2
1: 1 a 1 3
2: 1 a 2 4
[[2]]
x y v3
1: 1 a 5
2: 1 a 6
[[3]]
x y v4 v5 v6
1: 1 a 7 9 11
2: 1 a 8 10 12
Columns x and y are the same for each of the data.tables but the amount of columns differs. The output should not include duplicate x, and y columns. It should look as follows:
x y v1 v2 v3 v4 v5 v6
1: 1 a 1 3 5 7 9 11
2: 1 a 2 4 6 8 10 12
I want to avoid using a loop. I am able to bind the data.tables using do.call("cbind", DT.list) and then remove the duplicates manually, but is there a way where the duplicates aren't created in the first place? Also, efficiency is important since the lists can be long with large data.tables.
thanks

Here's another way:
Reduce(
function(x,y){
newcols = setdiff(names(y),names(x))
x[,(newcols)] <- y[, ..newcols]
x
},
DT.list,
init = copy(DT.list[[1]][,c("x","y")])
)
# x y v1 v2 v3 v4 v5 v6
# 1: 1 a 1 3 5 7 9 11
# 2: 1 a 2 4 6 8 10 12
This avoids modifying the list (as #bgoldst's <- NULL assignment does) or making copies of every element of the list (as, I think, the lapply approach does). I would probably do the <- NULL thing in most practical applications, though.

Here's how it could be done in one shot, using lapply() to remove columns x and y from second-and-subsequent data.tables before calling cbind():
do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2))));
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
Another approach is to remove columns x and y from second-and-subsequent data.tables before doing a straight cbind(). I think there's nothing wrong with using a for loop for this:
for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL;
DT.list;
## [[1]]
## x y v1 v2
## 1: 1 a 1 3
## 2: 1 a 2 4
##
## [[2]]
## v3
## 1: 5
## 2: 6
##
## [[3]]
## v4 v5 v6
## 1: 7 9 11
## 2: 8 10 12
##
do.call(cbind,DT.list);
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12

Another option would be to use the [,, indexing function option inside lapplyon the list of data tables and exclude "unwanted" columns (in your case x and y). In this way, duplicates columns are not created.
# your given test data
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
A) using a character vector to indicate which columns to exclude
# cbind a list of subsetted data.tables
exclude.col <- c("x","y")
myDT <- do.call(cbind, lapply(DT.list, `[`,,!exclude.col, with = FALSE))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
B) same as above but using the character vector directly in lapply
myDT <- do.call(cbind, lapply(DT.list, `[`,,!c("x","y")))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
C) same as above, but all in one line
do.call( cbind, c(list(DT.list[[1]][,.(x,y)]), lapply(DT.list, `[`,,!c("x","y"))) )
# way too many brackets...but I think it works
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12

Adding two vectors by names

I have two named vectors
v1 <- 1:4
v2 <- 3:5
names(v1) <- c("a", "b", "c", "d")
names(v2) <- c("c", "e", "d")
I want to add them up by the names, i.e. the expected result is
> v3
a b c d e
1 2 6 9 4
Is there a way to programmatically do this in R? Note the names may not necessarily be in a sorted order, like in v2 above.

Just combine the vectors (using c, for example) and use tapply:
v3 <- c(v1, v2)
tapply(v3, names(v3), sum)
# a b c d e
# 1 2 6 9 4
Or, for fun (since you're just doing sum), continuing with "v3":
xtabs(v3 ~ names(v3))
# names(v3)
# a b c d e
# 1 2 6 9 4
I suppose with "data.table" you could also do something like:
library(data.table)
as.data.table(Reduce(c, mget(ls(pattern = "v\\d"))),
keep.rownames = TRUE)[, list(V2 = sum(V2)), by = V1]
# V1 V2
# 1: a 1
# 2: b 2
# 3: c 6
# 4: d 9
# 5: e 4
(I shared the latter not so much for "data.table" but to show an automated way of capturing the vectors of interest.)

Expand a vector and count instances

I have a data.frame
orig.DF<-data.frame(V1=c("A", "B", "C"), V2=c(3,2,4))
and I have to expand it so that it takes the following form
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4
I tried taaply and ave but I can't get it to count to 1:x and repeat the V1 accordingly

df <- data.frame(V1 = c("A", "B", "C"), V2 = c(3, 2, 4))
data.frame(x = rep(df$V1, df$V2), y = sequence(df$V2))
x y
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 C 1
7 C 2
8 C 3
9 C 4

Here is one approach:
do.call(
rbind,
apply(orig.DF, 1, function(row) expand.grid(row["V1"], 1:row["V2"]))
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculate probability and consecutive events in a dataframe - r

Related

Overlap in row values from previous rows

Transpose a data.table (columns names -> first column of output)

cbind specific columns from multiple data.tables efficiently

Adding two vectors by names

Expand a vector and count instances

Categories

Resources