Calculate the length of an interval if data are equal to zero - r

I have a dataframe with time points and the corresponding measure of the activity in different subjects. Each time point it's a 5 minutes interval.
time Subject1 Subject2
06:03:00 6,682129 8,127075
06:08:00 3,612061 20,58838
06:13:00 0 0
06:18:00 0,9030762 0
06:23:00 0 0
06:28:00 0 0
06:33:00 0 0
06:38:00 0 7,404663
06:43:00 0 11,55835
...
I would like to calculate the length of each interval that contains zero activity, as the example below:
Subject 1 Subject 2
Interval_1 1 5
Interval_2 5
I have the impression that I should solve this using loops and conditions, but as I am not so experienced with loops I do not know where to start. Do you have any idea to solve this? Any help is really appreciated!

You can use rle() to find runs of consecutive values and the length of the runs. We need to filter the results to only runs where the value is 0:
result = lapply(df[-1], \(x) with(rle(x), lengths[values == 0]))
result
# $Subject1
# [1] 1 5
#
# $Subject2
# [1] 5
As different subjects can have different numbers of 0-runs, the results make more sense in a list than a rectangular data frame.

Related

Problem finding number of elements in a dataframe in R

I have downloaded the data frame casos_hosp_uci_def_sexo_edad_provres_60_mas.csv, which describes the amount of people infected from Covid-19 in Spain classified for their province, age, gender... from this webpage. I read and represent the dataframe as:
db<-read.csv(file = 'casos_hosp_uci_def_sexo_edad_provres.csv')
The first five rows are shown
provincia_iso sexo grupo_edad fecha num_casos num_hosp num_uci num_def
1 A H 0-9 2020-01-01 0 0 0 0
2 A H 10-19 2020-01-01 0 0 0 0
3 A H 20-29 2020-01-01 0 0 0 0
4 A H 30-39 2020-01-01 0 0 0 0
5 A H 40-49 2020-01-01 0 0 0 0
The first four colums of the data frame show the name of the province, gender of the people, age group and date, the latest four columns show the number of people who got ill, were hospitalized, in ICU or dead.
I want to use R to find the day with the highest rate of contagions. To do that, I have to sum the elements of the fifth row num_casos for each different value of the column fecha.
I have already been able to calculate the number of sick males as hombresEnfermos=sum(db[which(db$sexo=="H"), 5]). However, I think there has to be a better way to check the days with higher contagion than go manually counting. However, I cannot find out how.
Can someone please help me?
Using dplyr to get the total by date:
library(dplyr)
db %>% group_by(fecha) %>% summarise(total = sum(num_casos))
Two alternatives in base R:
data.frame(fecha = sort(unique(db$fecha)),
total = sapply(split(db, f = db$fecha), function(x) {sum(x[['num_casos']])}))
Or more simply,
aggregate(db$num_casos, list(db$fecha), FUN=sum)
An alternative in data.table:
library(data.table)
db <- as.data.table(db)
db[, list(total=sum(num_casos)), by = fecha]

Cavs vs. Warriors - probability of Cavs winning the series includes combinations like "0,1,0,0,0,1,1" - but the series is over after game 5

There is a problem in DataCamp about computing the probability of winning an NBA series. Cavs and the Warriors are playing a seven game championship series. The first to win four games wins the series. They each have a 50-50 chance of winning each game. If the Cavs lose the first game, what is the probability that they win the series?
Here is how DataCamp computed the probability using Monte Carlo simulation:
B <- 10000
set.seed(1)
results<-replicate(B,{x<-sample(0:1,6,replace=T) # 0 when game is lost and 1 when won.
sum(x)>=4})
mean(results)
Here is a different way they computed the probability using simple code:
# Assign a variable 'n' as the number of remaining games.
n<-6
# Assign a variable `outcomes` as a vector of possible game outcomes: 0 indicates a loss and 1 a win for the Cavs.
outcomes<-c(0,1)
# Assign a variable `l` to a list of all possible outcomes in all remaining games. Use the `rep` function on `list(outcomes)` to create list of length `n`.
l<-rep(list(outcomes),n)
# Create a data frame named 'possibilities' that contains all combinations of possible outcomes for the remaining games.
possibilities<-expand.grid(l) # My comment: note how this produces 64 combinations.
# Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
rowSums(possibilities)
results<-rowSums(possibilities)>=4
# Calculate the proportion of 'results' in which the Cavs win the series.
mean(results)
Question/Problem:
They both produce approximately the same probability of winning the series ~ 0.34. However, there seems to be a flaw in the the concept and the code design. For example, the code (sampling six times) allows for combinations such as the following:
G2 G3 G4 G5 G6 G7 rowSums
0 0 0 0 0 0 0 # Series over after G4 (Cavs lose). No need for game G5-G7.
0 0 0 0 1 0 1 # Series over after G4 (Cavs lose). Double counting!
0 0 0 0 0 1 1 # Double counting!
...
1 1 1 1 0 0 4 # No need for game G6 and G7.
1 1 1 1 0 1 5 # Double counting! This is the same as 1,1,1,1,0,0.
0 1 1 1 1 1 5 # No need for game G7.
1 1 1 1 1 1 6 # Series over after G5 (Cavs win). Double counting!
> rowSums(possibilities)
[1] 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 2 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6
As you can see, these are never possible. After winning the first four of the remaining six games, no more games should be played. Similarly, after losing the first three games of the remaining six games, no more games should be played. So these combinations shouldn't be included in the computation of the probability of winning the series. There is double counting for some of the combinations.
Here is what I did to omit some of the combinations that are not possible in real life.
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities<-possibilities %>% mutate(rowsums=rowSums(possibilities)) %>% filter(rowsums<=4)
But then I am not able to omit the other unnecessary combinations. For example, I want to remove two of these three: (a) 1,0,0,0,0,0 (b) 1,0,0,0,0,1 (c) 1,0,0,0,1,1. This is because no more games will be played after losing three times in a row. And they are basically double counting.
There are too many conditions for me to be able to filter them individually. There has to be a more efficient and intuitive way to do this. Can someone provide me with some hints on how to solve this whole mess?
Here is a way:
library(dplyr)
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities %>%
mutate(rowsums=rowSums(cur_data()),
anti_sum = rowSums(!cur_data())) %>%
filter(rowsums<=4, anti_sum <= 3)
We use the fact that r can coerce into a logical where 0 will be false. See sum(!0) as a short example.

Reverse cumsum with breaks with non-sequential numbers

Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92

Match() function only runs for 0, but not for other values

I use Time Series data (imported with read.csv()) and want to match() values from an other data frame (also imported with read.csv) to those recorded in my Time Series.
It looks like this:
df1 <- data.frame(hue=rawdata[,"hue"])
# This is my Time Series-raw data
hue
2017-07-01 00:00:00 0
2017-07-01 00:01:00 0
2017-07-01 00:02:00 0
2017-07-01 00:03:00 0
2017-07-01 00:04:00 0
2017-07-01 00:05:00 0
The Values change between 0 and 7 sometimes. Here's just a head() print
And this is the data frame I want to check it with:
df2 <- data.frame(hue=sz$hue, Q=sz$Q)
# sz is the imported csv file
hue Q
1 0 0
2 1 13
3 2 26
4 3 39
5 4 52
6 5 65
Here, the same: Just a head() print.
Now, my aim is to create a new column next to hue in my rawdata.
in this new column I want the Q-values depending on their hue of df2. For example: From minute Zero to five on 2017-07-01 the Q-value in the new column will be 0, because hue is 0.
I tried many things with the match function like:
df1$match=sz$Q[match(df1$hue, sz$hue)]
But it's only working for the 0's and not for other values like 1,2,3 etc. R only gives me NAs at those points.
It works perfectly in this Video:
Using Match function in R
Actually I'm not quite sure if this is really a "match"-problem or a more format problem because I checked these two things:
> df1["2017-07-21 23:20:00","hue"]==2 # the value at this date is actually 2!
[1] FALSE
> is.numeric(df1["2017-07-21 23:20:00","hue"])
[1] TRUE
Does anyone know what I can do to get R to consider all values?
Thank you so much for taking time for this!

Number of different states/events in a sequence with TraMineR

I'm interested in counting the number of different states present in each sequence of my dataset. For sake of simplicity, I'll use a TraMineR example:
starting from this sequence:
1230 D-D-D-D-A-A-A-A-A-A-A-D
then computing the extract distinct states with the seqdss function obtaining:
1230 D-A-D
Is there a function to extract the overall number of different states in the sequence only accounting for presence of a state and not its potential repetition along the sequence? In other words, for the case described above I would like to obtain a vector containing for this sequence the value 2 (event A and event D) instead of 3 (1 event A + 2 events D).
Thank you.
You can compute the number of distinct states by first computing the state distribution of each sequence using seqistatd and then summing the number of non-zero elements in each row of the matrix returned by seqistatd. I illustrate below using the biofam data:
library(TraMineR)
data(biofam)
bf.seq <- seqdef(biofam[,10:25])
## longitudinal distributions
bf.ldist <- seqistatd(bf.seq)
n.states <- apply(bf.ldist,1,function(x) sum(x != 0))
## displaying results
bf.ldist[1:3,]
0 1 2 3 4 5 6 7
1167 9 0 0 1 0 0 6 0
514 1 10 0 1 0 0 4 0
1013 7 5 0 1 0 0 3 0
n.states[1:3]
1167 514 1013
3 4 4
I might be missing something here, but it looks like you're after unique.
Your expected result is not clear ( maybe because you describe it in English and not in pseudo code). I guess you you are looking for table to count number of states per subject. Here I am using provided with TraMineR package:
library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal,13:24)
head(actcal.seq )
Sequence
2848 B-B-B-B-B-B-B-B-B-B-B-B
1230 D-D-D-D-A-A-A-A-A-A-A-D
2468 B-B-B-B-B-B-B-B-B-B-B-B
654 C-C-C-C-C-C-C-C-C-B-B-B
6946 A-A-A-A-A-A-A-A-A-A-A-A
1872 D-B-B-B-B-B-B-B-B-B-B-B
Now applying table to the 4th row for example:
tab <- table(unlist(actcal.seq[4,]))
tab[tab>0]
B C
3 9

Resources