spss counting first consecutive zeros across variables - count

I want to count the days when a subject did not receive treatment (a "0" in my file. If a subject did receive treatment it is denoted with "1". Subject can get multiple courses of treatments and I would like to count the days between the first and second treatment. I am not (yet) interested in the time between the second and third treatment.
Basically my spss file looks like this:
id day1 day2 day3 day4 day28
A--- 1-----0-----0----1------0
B--- 1---- 0-----1----0------1
C---etc
I am only interested in the first series of zeros. The output I hope to get is:
id first_series_zero
A 2
B 1
C ...
Can anyone help my out, here. Obviously, just counting all the zeros isn't going to work, because there might be multiple sets of zeroes in one row.
Cheers, Dylan

Here is one pretty general approach that will allow you to calculate the times between all of the different treatments. First I create a vector that stores the locations of all of the treatments, Loc1 TO Loc5 (using day1 to day5 as an example).
DATA LIST FREE / day1 day2 day3 day4 day5.
BEGIN DATA
1 0 0 1 0
1 0 1 0 1
END DATA.
VECTOR day = day1 TO day5.
VECTOR Loc(5,F2.0).
COMPUTE #id = 1.
LOOP #i = 1 TO 5.
DO IF day(#i) = 1.
COMPUTE Loc(#id) = #i.
COMPUTE #id = #id + 1.
END IF.
END LOOP.
Now if you run this transformation, the Loc vector will look like this for this example data.
Loc1 Loc2 Loc3 Loc4 Loc5
1 4 . . .
1 3 5 . .
Now to calculate the difference for the first series is as simple as:
COMPUTE first_series_zero = Loc2 - Loc1 - 1.
This will return missing if there is never a second (or first) treatment, and is not dependent on day1 always being the first day of the treatment. Now to calculate the difference between all of the treatments is quite simple, and here is a DO REPEAT approach.
VECTOR DifS(4,F2.0).
DO REPEAT F = Loc1 TO Loc4 /B = Loc2 TO Loc5 /D = DifS1 TO DifS4.
COMPUTE D = B - F - 1.
END REPEAT.
And so DifS1 would be the zeroes between the 1st and 2nd treatment, DifS2 would be the zeroes between the 2nd and 3rd treatment etc. (Both this do repeat and the first loop could be made more efficient with a loop that only goes over valid/possible values.)

Related

Simplify time-dependent data created with tmerge

I have a large data.table containing many time-dependent variables(50+) for use in coxph models. This dataset has been generated by using tmerge. Patients are identified by the patid variable and time intervals are defined by tstart and tstop.
The majority of the models I want to fit only use a selection of these time-dependent variables. Unfortunately the speed of Cox proportional hazards models is dependent on the number of rows and the number of timepoints in my data.table even if all the data in these rows is identical. Is there a good/fast way of combining rows which are identical apart from the time interval in order to speed up my models? In many cases, tstop for one line is equal to tstart for the next with everything else identical after removing some columns.
For example I would want to convert the data.table example into results.
library(data.table)
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
results=data.table(patid = c(1,1,2,2), tstart=c(0,2,0,1), tstop=c(2,3,1,3), x=c(0,1,1,2), y=c(0,1,2,3))
This example is extremely simplified. My current dataset has ~600k patients, >20M rows and 3.65k time points. Removing variables should significantly reduce the number of needed rows which should significantly increase the speed of models fit using a subset of variables.
The best I can come up with is:
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
example = example[order(patid,tstart),]
example[,matched:=x==shift(x,-1)&y==shift(y,-1),by="patid"]
example[is.na(matched),matched:=FALSE,by="patid"]
example[,tstop:=ifelse(matched,shift(tstop,-1),tstop)]
example[,remove:=tstop==shift(tstop),by="patid"]
example = example[is.na(remove) | remove==FALSE,]
example$matched=NULL
example$remove=NULL
This solves this example; however, this is pretty complex and overkill code and when I have a number of columns in the dataset having to edit x==shift(x,-1) for each variable is asking for error. Is there a sane way of doing this? The list of columns will change a number of times based on loops, so accepting as input a vector of column names to compare would be ideal.
This solution also doesn't cope with multiple time periods in a row that contain the same covariate values(e.g. time periods of (0,1), (1,3), (3,4) with the same covariate values)
this solution create a temporary group-id based on the rleid() of the combination of x and y. This temp value is used, and then dropped (temp := NULL)
example[, .(tstart = min(tstart), tstop = max(tstop), x[1], y[1]),
by = .(patid, temp = rleid(paste(x,y, sep = "_")))][, temp := NULL][]
# patid tstart tstop x y
# 1: 1 0 2 0 0
# 2: 1 2 3 1 1
# 3: 2 0 1 1 2
# 4: 2 1 3 2 3
Here is an option that builds on our conversation/comments above, but allows the flexibility of setting a vector column names:
cols=c("x","y")
cbind(
example[, id:=rleidv(.SD), .SDcols = cols][, .(tstart=min(tstart), tstop=max(tstop)), .(patid,id)],
example[,.SD[1],.(patid,id),.SDcols =cols][,..cols]
)[,id:=NULL][]
Output:
patid tstart tstop x y
1: 1 0 2 0 0
2: 1 2 3 1 1
3: 2 0 1 1 2
4: 2 1 3 2 3
Based on Wimpel's answer I have created the following solution which also allows using a vector of column names for input.
example=data.table(patid = c(1,1,1,2,2,2), tstart=c(0,1,2,0,1,2), tstop=c(1,2,3,1,2,3), x=c(0,0,1,1,2,2), y=c(0,0,1,2,3,3))
variables = c("x","y")
example[,key_ := do.call(paste, c(.SD,sep = "_")),.SDcols = variables]
example[, c("tstart", "tstop") := .(min(tstart),max(tstop)),
by = .(patid, temp = rleid(key_))][,key_:=NULL]
example = unique(example)
I would imagine this could be simplified, but I think it does what is needed for more complex examples.

Obtaining CCF values using a loop in R

I have a data frame which looks like this:
files
Time
Male
Female
A
1.1
0
1
A
1.2
0
1
A
1.3
1
1
A
1.4
1
0
B
2.4
0
1
B
2.5
1
1
B
2.6
0
1
B
2.7
1
1
The 'files' column represent recording file names, 'Time' represents discretised time bins of 0.1 seconds, the 'Male' and 'Female' column represents whether the male and female are calling (1) or not (0) during that time bin.
I want to find at which lags the female and male are most correlated for all different recordings. More specifically I want the output to be a dataframe with three columns: recording file names, peak correlation score between female and male, and the lag value (at which peak correlation occurred).
I have so far could measure the peak cross-correlation of the files individually:
file1 <- dataframe %>% filter(file == unique(dataframe$`Begin File`)[1])
#obtaining observations of first file entry
Then I used following function to find peak correlation:
Find_Abs_Max_CCF <- function(a, b, e=0) {
d <- ccf(a, b, plot = FALSE, lag.max = length(a)/2)
cor = d$acf[,,1]
abscor = abs(d$acf[,,1])
lag = d$lag[,,1]
res = data.frame(cor, lag)
absres = data.frame(abscor, lag)
maxcor = max(absres$abscor)
absres_max = res[which(absres$abscor >= maxcor-maxcor*e & absres$abscor <= maxcor+maxcor*e),]
return(absres_max)
}
Find_Abs_Max_CCF(file1$f,file1$m,0.05)
Is there a way to to use a function or loop to automate the process so that I get peak correlation value, respective lag value of all the distinct file entries?
Any help is highly appreciated. Thanks in advance.
Edits:
I used group_map() function with following code:
part.cor<-dataframe %>% group_by(files) %>% group_map(~Find_Abs_Max_CCF(datframe$f, dataframe$m, 0.05))
However, it is returning the same values of peak correlation and lag repeated throughout output dataframe.

Proportion across rows and specific columns in R

I want to get a proportion of pcr detection and create a new column using the following test dataset. I want a proportion of detection for each row, only in the columns pcr1 to pcr6. I want the other columns to be ignored.
site sample pcr1 pcr2 pcr3 pcr4 pcr5 pcr6
pond 1 1 1 1 1 0 1 1
pond 1 2 1 1 0 0 1 1
pond 1 3 0 0 1 1 1 1
I want the output to create a new column with the proportion detection. The dataset above is only a small sample of the one I am using. I've tried:
data$detection.proportion <- rowMeans(subset(testdf, select = c(pcr1, pcr2, pcr3, pcr4, pcr5, pcr6)), na.rm = TRUE)
This works for this small dataset but I tried on my larger one and it did not work and it would give the incorrect proportions. What I'm looking for is a way to count all the 1s from pcr1 to pcr6 and divide them by the total number of 1s and 0s (which I know is 6 but I would like R to recognize this in case it's not inputted).
I found a way to do it in case anyone else needed to. I don't know if this is the most effective but it worked for me.
data$detection.proportion <- length(subset(testdf, select = c(pcr1, pcr2, pcr3, pcr4, pcr5, pcr6)))
#Calculates the length of pcrs = 6
p.detection <- rowSums(data[,c(-1, -2, -3, -10)] == "1")
#Counts the occurrences of 1 (which is detection) for each row
data$detection.proportion <- p.detection/data$detection.proportion
#Divides the occurrences by the total number of pcrs to find the detected proportion.

Complex data calculation for consecutive zeros at row level in R (lag v/s lead)

I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")

Combining data using R (or maybe Excel) -- looping to match stimuli

I have two sets of data, which correspond to different experiment tasks that I want to merge for analysis. The problem is that I need to search and match up certain rows for particular stimuli and for particular participants. I'd like to use a script to save some trouble. This is probably quite simple, but I've never done it before.
Here's my problem more specifically:
In the first data set, each row corresponds to a two-alternative forced choice task where two stimuli are presented at a time and the participant selects one. In the second data set, each row corresponds to a single item task where the participants are asked if they have ever seen the stimulus before. The stimuli in the second task match the stimuli in the pairs on the first task (twice as many rows). I want to be able to match up and add two columns to the first dataset--one that states if the leftside item was recognized later and one for the rightside stimulus.
I assume this could be done with nested loops, but I'm not sure if there is a elegant way to do this or perhaps a package.
As I understand it, your first dataset looks something like this:
(dat1 <- data.frame(person=1:2, stim1=1:2, stim2=3:4))
# person stim1 stim2
# 1 1 1 3
# 2 2 2 4
This would mean person 1 got stimuli 1 and 3 and person 2 got stimuli 2 and 4. Then your second dataset looks something like this:
(dat2 <- data.frame(person=c(1, 1, 2, 2), stim=c(1, 3, 4, 2), responded=c(0, 1, 0, 1)))
# person stim responded
# 1 1 1 0
# 2 1 3 1
# 3 2 4 0
# 4 2 2 1
This gives information about how each person responded to each stimulus they were given.
You can merge these two by matching person/stimulus pairs with the match function:
dat1$response1 <- dat2$responded[match(paste(dat1$person, dat1$stim1), paste(dat2$person, dat2$stim))]
dat1$response2 <- dat2$responded[match(paste(dat1$person, dat1$stim2), paste(dat2$person, dat2$stim))]
dat1
# person stim1 stim2 response1 response2
# 1 1 1 3 0 1
# 2 2 2 4 1 0
Another option (starting from the original dat1 and dat2) would be to merge twice with the merge function. You have a little less control on the names of the output columns, but it requires a bit less typing:
merged <- merge(dat1, dat2, by.x=c("person", "stim1"), by.y=c("person", "stim"))
merged <- merge(merged, dat2, by.x=c("person", "stim2"), by.y=c("person", "stim"))

Resources