Trace back to certain indiviuals - r

I have the following data and my issue is the following: at some point of time and at a certain place, a contamination occurs. I don't know who caused it but I want to be able to trace this back with the largest likelihood possible. I need a probability for each individual to be the cause of this contamination. This is what the desired column "Prob_Contiminator" should show.
I know certain times at which a note was made about an occured contamination but this is only the time where the contamination was reported. What I have in mind is a high probability of having caused the contamination if someone is temporally close to the ocurrence, which decreases the further away the observation of an individual is from the contamination.
It is important that only individuals are considered to have caused the contamination if they have the same location_id as the occurence row had. Another problem is that people who frequently appear in the data would automatically apear to have caused the contamination more often. I additionally have data on when cleaning occured. I thought about limiting the observations of these "frequent users" to one observation which is closest to the event within one cleaning interval. How do I manage to properly spot the contaminators without discriminating the people who happen to be "heavy users"?
Data:
"Event_ID" "Person_ID" "Note" "time" "location_id" "Cleaning"
1 1 "" 1990-01-01 1 1
2 1 "" 1990-01-02 1 0
3 2 "" 1990-01-03 1 0
4 3 "Occured" 1993-01-03 1 1
5 3 "" 1995-01-04 2 0
6 3 "" 1995-01-04 2 0
7 4 "" 1995-01-04 3 0
8 5 "" 1995-01-05 6 0
9 6 "" 1995-01-05 5 0
10 7 "Ocurred" 1995-01-05 6 1
This is what I need ( Prob_Contaminator column not complete):
"Event_ID" "Person_ID" "Note" "time" "location_id" "Cleaning" "Prob_Contaminator"
1 1 "" 1990-01-01 1 1 0.4
2 1 "" 1990-01-02 1 0 0.4
3 2 "" 1990-01-03 1 0 0.6
4 3 "Occured" 1993-01-03 1 1
5 3 "" 1995-01-04 2 0
6 3 "" 1995-01-04 2 0
7 4 "" 1995-01-04 3 0
8 5 "" 1995-01-05 6 0
9 6 "" 1995-01-05 5 0
10 7 "Ocurred" 1995-01-05 6 1
The following example shows how I imagine the column Prob_Contaminated to be constructed. If we consider the row number 4 (Event ID=4), we see that a contamination has occured. Now i want to look back at all events since the last cleaning (in this case 3 events, based on the cleaning taken place at event_ID =1) and consider how far they are away from the event of contamination. This should only occur under the condition that the events looked at take place in the same location_id. Since the location_id is the same in this example (=1), the probability of being the contaminator for these 3 events is 1/3. Multiple events of 1 person should be reduced to the time closest to the contamination. This reduces the cases to two, and ergo the probability for Person_ID 1 and Person_ID 2 to be 1/2. Additionally I want to weight each probability by the distance that they have to the contamination. Since "time" value of Person_ID=2 is closer to the row with contamination than the the "time" value of Person_ID=1, the Prob_Contaminated for Person_ID=2 should be weighted higher. In this case I applied a weight of 1.2 for the more "recent" ID to the event (1.2*0.5=0.6) and a weight of 0.8*0.5=0.4) to the less recent event.
Code:
df <- data.frame(Event_ID = c(1:10),
Person_ID = c("1","1","2","3","3","3","4","5","6","7"),
Note = c("","","","Occured","","","","","","Ocurred"),
time = as.Date(c('1990-1-1','1990-1-2','1990-1-3','1993-1-3','1995-1-4','1995-1-4','1995-1-4',"1995-1-5","1995-1-5","1995-1-5")),
location_id = c("1","1","1","1","2","2","3","6","5","6"),
Cleaning = c("1","0","0","1","0","0","0","0","0","1"))
df2 <- data.frame(Event_ID = c(1:10),
Person_ID = c("1","1","2","3","3","3","4","5","6","7"),
Note = c("","","","Occured","","","","","","Ocurred"),
time = as.Date(c('1990-1-1','1990-1-2','1990-1-3','1993-1-3','1995-1-4','1995-1-4','1995-1-4',"1995-1-5","1995-1-5","1995-1-5")),
location_id = c("1","1","1","1","2","2","3","6","5","6"),
Cleaning = c("1","0","0","1","0","0","0","0","0","1"),
Prob_Contiminator = c("0.4","0.4","0.6","","","","","","",""))

Related

Cavs vs. Warriors - probability of Cavs winning the series includes combinations like "0,1,0,0,0,1,1" - but the series is over after game 5

There is a problem in DataCamp about computing the probability of winning an NBA series. Cavs and the Warriors are playing a seven game championship series. The first to win four games wins the series. They each have a 50-50 chance of winning each game. If the Cavs lose the first game, what is the probability that they win the series?
Here is how DataCamp computed the probability using Monte Carlo simulation:
B <- 10000
set.seed(1)
results<-replicate(B,{x<-sample(0:1,6,replace=T) # 0 when game is lost and 1 when won.
sum(x)>=4})
mean(results)
Here is a different way they computed the probability using simple code:
# Assign a variable 'n' as the number of remaining games.
n<-6
# Assign a variable `outcomes` as a vector of possible game outcomes: 0 indicates a loss and 1 a win for the Cavs.
outcomes<-c(0,1)
# Assign a variable `l` to a list of all possible outcomes in all remaining games. Use the `rep` function on `list(outcomes)` to create list of length `n`.
l<-rep(list(outcomes),n)
# Create a data frame named 'possibilities' that contains all combinations of possible outcomes for the remaining games.
possibilities<-expand.grid(l) # My comment: note how this produces 64 combinations.
# Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
rowSums(possibilities)
results<-rowSums(possibilities)>=4
# Calculate the proportion of 'results' in which the Cavs win the series.
mean(results)
Question/Problem:
They both produce approximately the same probability of winning the series ~ 0.34. However, there seems to be a flaw in the the concept and the code design. For example, the code (sampling six times) allows for combinations such as the following:
G2 G3 G4 G5 G6 G7 rowSums
0 0 0 0 0 0 0 # Series over after G4 (Cavs lose). No need for game G5-G7.
0 0 0 0 1 0 1 # Series over after G4 (Cavs lose). Double counting!
0 0 0 0 0 1 1 # Double counting!
...
1 1 1 1 0 0 4 # No need for game G6 and G7.
1 1 1 1 0 1 5 # Double counting! This is the same as 1,1,1,1,0,0.
0 1 1 1 1 1 5 # No need for game G7.
1 1 1 1 1 1 6 # Series over after G5 (Cavs win). Double counting!
> rowSums(possibilities)
[1] 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 2 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6
As you can see, these are never possible. After winning the first four of the remaining six games, no more games should be played. Similarly, after losing the first three games of the remaining six games, no more games should be played. So these combinations shouldn't be included in the computation of the probability of winning the series. There is double counting for some of the combinations.
Here is what I did to omit some of the combinations that are not possible in real life.
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities<-possibilities %>% mutate(rowsums=rowSums(possibilities)) %>% filter(rowsums<=4)
But then I am not able to omit the other unnecessary combinations. For example, I want to remove two of these three: (a) 1,0,0,0,0,0 (b) 1,0,0,0,0,1 (c) 1,0,0,0,1,1. This is because no more games will be played after losing three times in a row. And they are basically double counting.
There are too many conditions for me to be able to filter them individually. There has to be a more efficient and intuitive way to do this. Can someone provide me with some hints on how to solve this whole mess?
Here is a way:
library(dplyr)
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities %>%
mutate(rowsums=rowSums(cur_data()),
anti_sum = rowSums(!cur_data())) %>%
filter(rowsums<=4, anti_sum <= 3)
We use the fact that r can coerce into a logical where 0 will be false. See sum(!0) as a short example.

Episode splitting in survival analysis by the timing of an event in R

Is it possible to split episode by a given variable in survival analysis in R, similar to in STATA using stsplit in the following way: stsplit var, at(0) after(time=time)?
I am aware that the survival package allows one to split episode by given cut points such as c(0,5,10,15) in survSplit, but if a variable, say time of divorce, differs by each individual, then providing cutpoints for each individual would be impossible, and the split would have to be based on the value of a variable (say graduation, or divorce, or job termination).
Is anyone aware of a package or know a resource I might be able to tap into?
Perhaps Epi package is what you are looking for. It offers multiple ways to cut/split the follow-up time using the Lesix objects. Here is the documentation of cutLesix().
After some poking around, I think tmerge() in the survival package can achieve what stsplit var can do, which is to split episodes not just by a given cut points (same for all observations), but by when an event occurs for an individual.
This is the only way I knew how to split data
id<-c(1,2,3)
age<-c(19,20,29)
job<-c(1,1,0)
time<-age-16 ## create time since age 16 ##
data<-data.frame(id,age,job,time)
id age job time
1 1 19 1 3
2 2 20 1 4
3 3 29 0 13
## simple split by time ##
## 0 to up 2 years, 2-5 years, 5+ years ##
data2<-survSplit(data,cut=c(0,2,5),end="time",start="start",
event="job")
id age start time job
1 1 19 0 2 0
2 1 19 2 3 1
3 2 20 0 2 0
4 2 20 2 4 1
5 3 29 0 2 0
6 3 29 2 5 0
7 3 29 5 13 0
However, if I want to split by a certain variable, such as when each individuals finished school, each person might have a different cut point (finished school at different ages).
## split by time dependent variable (age finished school) ##
d1<-data.frame(id,age,time,job)
scend<-c(17,21,24)-16
d2<-data.frame(id,scend)
## create start/stop time ##
base<-tmerge(d1,d1,id=id,tstop=time)
## create time-dependent covariate ##
s1<-tmerge(base,d2,id=id,
finish=tdc(scend))
id age time job tstart tstop finish
1 1 19 3 1 0 1 0
2 1 19 3 1 1 3 1
3 2 20 4 1 0 4 0
4 3 29 13 0 0 8 0
5 3 29 13 0 8 13 1
I think tmerge() is more or less comparable with stsplit function in STATA.

Row wise iteration with condition

I have a data frame and I want to generate a new column that has the result of an calculation based on the row before. Additionally the calculation has some conditions.
The data frame consist of energy production = p, energy consumption = c, energy grid = g, energy safe = s
My goal is to calculate the usage of a battery in a PV-System. When the modules produces more then needed the battery gets loaded and otherwise unloaded. When the batterie don't have enough energy the grid delivers the remainig energy.
So in the first line the batterie gets loaded because I produce more than I need. In the 5 line I need more energy than I produce, so the batterie gets unloaded and so on.
One row is one hour. So n+1 is based on the energy demand and supply of n.
### Old:
n p c g
1 2 1 0
2 3 1 0
3 4 3 0
4 3 5 2
5 5 8 3
6 2 1 0
### New:
n p c g s
1 2 1 0 1
2 3 1 0 3
3 4 3 0 4
4 3 5 0 2
5 5 8 1 0
6 2 1 0 1
When i use your code the result is like this:
First column - c
Second Colum - p
Third colum - g
Fourth colum - s
The battery gets loaded but the unload process does not fit from what is expected. The battery has 2.3801 energy and the demand in n+1 is 0.875.
So the result should be 2.3801 - 0.875 = 1.5015
This process should end when s = 0
I dont understand why your codes works for the rest of data.
I found a solution here which works very well for my problem.
My battery is floored at 0 and limited to 16 kWh, so I added just the pmin function.
mutate(result = accumulate(production-consumw1, ~ pmin(16,pmax(0, .x + .y)), .init = 0)[-1])
Thanks for your help!

Imputation for longitudinal data using observation before and after missing data

I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold characters represent changes from the dataset above
The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this: 1,3,2,3,1.5,0,0,
ID# 2 (variable ss) should look like this: 2,4,0,0,0,0,0
ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this: 4,1,2,4,2,3,3
ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this: 2,1,0,NA,NA,0,0 (no change).
I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.
smwrBase::fillMissing(ss, max.fill=1)
The zoo package might be more standard, same issue though.
zoo::na.approx(ss, maxgap=1)
Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.
> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+ # interpolate for gaps
+ mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+ # extension for gap as last value
+ if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+ }
+ }
> mydat
id time ss ss2
1 1 0 1 1.0
2 1 1 3 3.0
3 1 2 2 2.0
4 1 3 3 3.0
5 1 4 NA 1.5
6 1 5 0 0.0
7 1 6 0 0.0
8 2 0 2 2.0
9 2 1 4 4.0
10 2 2 0 0.0
11 2 3 NA 0.0
12 2 4 0 0.0
13 2 5 0 0.0
14 2 6 0 0.0
15 3 0 4 4.0
16 3 1 1 1.0
17 3 2 2 2.0
18 3 3 4 4.0
19 3 4 2 2.0
20 3 5 3 3.0
21 3 6 NA 3.0
22 4 0 2 2.0
23 4 1 1 1.0
24 4 2 0 0.0
25 4 3 NA NA
26 4 4 NA NA
27 4 5 0 0.0
28 4 6 0 0.0
The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).

How can I sort the X axis in a Barplot in R?

I have binned data that looks like this:
(8.048,18.05] (-21.95,-11.95] (-31.95,-21.95] (18.05,28.05] (-41.95,-31.95]
81 76 18 18 12
(-132,-122] (-122,-112] (-112,-102] (-162,-152] (-102,-91.95]
6 6 6 5 5
(-91.95,-81.95] (-192,-182] (28.05,38.05] (38.05,48.05] (58.05,68.05]
5 4 4 4 4
(78.05,88.05] (98.05,108] (-562,-552] (-512,-502] (-482,-472]
4 4 3 3 3
(-452,-442] (-412,-402] (-282,-272] (-152,-142] (48.05,58.05]
3 3 3 3 3
(68.05,78.05] (118,128] (128,138] (-582,-572] (-552,-542]
3 3 3 2 2
(-532,-522] (-422,-412] (-392,-382] (-362,-352] (-262,-252]
2 2 2 2 2
(-252,-242] (-142,-132] (-81.95,-71.95] (148,158] (-1402,-1392]
2 2 2 2 1
(-1372,-1362] (-1342,-1332] (-942,-932] (-862,-852] (-822,-812]
1 1 1 1 1
(-712,-702] (-682,-672] (-672,-662] (-632,-622] (-542,-532]
1 1 1 1 1
(-502,-492] (-492,-482] (-472,-462] (-462,-452] (-442,-432]
1 1 1 1 1
(-432,-422] (-352,-342] (-332,-322] (-312,-302] (-302,-292]
1 1 1 1 1
(-202,-192] (-182,-172] (-172,-162] (-51.95,-41.95] (88.05,98.05]
1 1 1 1 1
(108,118] (158,168] (168,178] (178,188] (298,308]
1 1 1 1 1
(318,328] (328,338] (338,348] (368,378] (458,468]
1 1 1 1 1
How can I plot this data so that the bin is sorted from most negative on the left to most positive on the right? Currently my graph looks like this. Notice that it is not sorted at all. In particular the second bar (value = 76) is placed to the right of the first:
(8.048,18.05] (-21.95,-11.95]
81 76
This is the command I use to plot:
barplot(x,ylab="Number of Unique Tags", xlab="Expected - Observed")
I really want to help answer your question, but I gotta tell you, I can't make heads or tails of your data. I see a lot of opening parenthesis but no closing ones. The data looks sorted descending by whatever the values are on the bottom of each row. I have no idea what to make out of a value like "(8.048,18.05]"
Am I missing something obvious? Can you make a more simple example where your data structure is not a factor?
I would generally expect a data frame or a matrix with two columns, one for the X and one for the Y.
See if this example of sorting helps (I'm sort of shooting in the dark here)
tN <- table(Ni <- rpois(100, lambda=5))
r <- barplot(tN)
#stop here and examine the plot
#the next bit converts the matrix to a data frame,
# sorts it, and plots it again
df<-data.frame(tN)
df2<-df[order(df$Freq),]
barplot(df2$Freq)

Resources