I hope this finds you well. I was hoping to get some help analyzing some code where I identify a series of trials based on the start trigger (but ignoring the immediate triggers that follow). In the example below I would like to find the first 1 in a series of 1's and take the average across the next three numbers in Value_1 and Value_2. It should then find the next start period (the 8th value with the next set of 1's) and again take the average for the following 3 values, and so on. Thank you for your help and I am happy to answer any questions.
df <- data.frame(Value_1 = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10), Value_2 = c(10,2,3,4,5,6,7,8,10,10,1,2,3,4,5,6,7,8,9,10), Triggers = c(0,1,1,1,0,0,0,1,1,1,0,0,0,0,0,1,1,1,0,0))
In the updated_df example below I would like the code to be able to work through possible interruptions in the Trigger value (e.g., a 0 in list of 1) and find the first 1 in a group of 1's and possible zeros and take the average across the next four numbers in Value_1 and Value_2. It should then find the next start period (the 9th value with the next set of 1's and 0's) and again take the average for the following 4 values, and so on. Thank you for your help and I am happy to answer any questions.
updated_df <-df <- data.frame(
Value_1 = c(1,2,3,3,4,5,6,7,8,9,9,10,1,2,3,4,5,6,7,8,9,9,10),
Value_2 = c(10,2,3,3,4,5,6,7,8,10,10,10,1,2,3,4,5,6,6,7,8,9,10),
Triggers = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,1,1,0,0)
)
Here's a base R solution that handles the updated question ("interruptions in trigger values"). It includes a lag function based on this SO answer.
updated_df <- data.frame(
Value_1 = c(1,2,3,3,4,5,6,7,8,9,9,10,1,2,3,4,5,6,7,8,9,9,10),
Value_2 = c(10,2,3,3,4,5,6,7,8,10,10,10,1,2,3,4,5,6,6,7,8,9,10),
Triggers = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,1,1,0,0)
)
# lag function, based on #Andrew's answer at
# https://stackoverflow.com/a/13128713/17303805
lag_fx <- function(x, by = 1L, default = NA) {
if (by < 0 || !isTRUE(all.equal(by, round(by)))) {
stop("`by` should be a whole number >= 0")
}
c(rep(default, by), x)[1:length(x)]
}
# number of trials per set
set_k <- 4
### to find index of each start trigger:
# (1) make matrix to "look back" at previous k - 1 trials
lagged <- sapply(
1:(set_k - 1),
\(x) lag_fx(updated_df$Triggers, by = x, default = 0)
)
# (2) then find rows where trigger == 1, but no 1s in previous k - 1 trials
starts <- which(updated_df$Triggers == 1 & rowSums(lagged) == 0)
# indices of each trigger and following k - 1 rows
sets <- lapply(starts, \(x) x + 0:(set_k - 1))
# means of each set of trials
Value_1 <- sapply(sets, \(x) mean(updated_df$Value_1[x]))
Value_2 <- sapply(sets, \(x) mean(updated_df$Value_2[x]))
# back to a data.frame
data.frame(Value_1, Value_2)
# Value_1 Value_2
# 1 3.0 3.00
# 2 9.0 9.50
# 3 7.5 6.75
I have a column with dates (Time), I consider one duration as consecutive times of 1s:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"))
I would like to generate an Id for each duration like this:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"),Id = c(1,1,1,2,2))
With dplyr...
Thank you
Up front:
cumsum(c(TRUE, as.numeric(diff(as.POSIXct(data$Time)), units = "secs") > 1L))
# [1] 1 1 1 2 2
First, you should really be working with real timestamps and not strings. If you're doing anything else with your Time field, it is almost certainly going to be a number-like operation, so you should do this up-front with
data$Time <- as.POSIXct(data$Time)
This works easily here because they are well-formed along the default format of "%Y-%m-%d %H:%M:%S"; see ?strptime for the %-codes.
From here, you want to keep track of when a difference in time is more than 1 second. The differencing is easy enough with:
as.numeric(diff(data$Time), units = "secs")
# [1] 1 1 2 1
Really, the key operator is diff, but it can report minutes or hours or such if the data is widely-enough spaced; there's an internal heuristic for that. Wrapping it in as.numeric(., units="secs") forces it to always be in seconds.
From here, we need a cumulative sum of when it is above 1, ergo > 1L, so cumsum(. > 1L).
Note that we have input length 5 but output length 4, this makes sense realizing that differences are between two elements. We force the first difference-test to be TRUE. If you have since changed to POSIXt-class, then the original code is reduced slightly to be
cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))
and therefore to store it as Id,
data$Id <- cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))
For example, i have the dataset below where 1 = yes and 0 = no, and I need to figure out how many calls were made by landline that lasted under 10 minutes.
Image of example dataset
You can also specifically define the values you're looking for in each column when you're finding the sum. (This will help if you need count rows with values other than 1 in a column.)
sum(df$landline == 1 & df$`under 10 minutes` == 1)
We can use sum
sum(df1[, "under 10 minutes"])
If two columns are needed
colSums(df1[, c("landline", "under 10 minutes")])
If we are checking both columns, use rowSums
sum(rowSums(df1[, c("landline", "under 10 minutes")], na.rm = TRUE) == 2)
The grep function finds the rows where landline=1. We then only call those rows and sum the under 10 min column.
sum( df[ grep(1,df[,1]) ,4] )
R will conveniently treat 1 and 0 as if they mean TRUE and FALSE, so we can apply logical Boolean operations like AND (&) and OR (|) on them.
df <- data.frame(x = c(1, 0, 1, 0),
y = c(0, 0, 1, 1))
> sum(df$x & df$y)
[1] 1
> sum(df$x | df$y)
[1] 3
For future questions, you should look up how to use functions like dput or other ways to give an example data set instead of using an image.
this is my first project in R, after just having learned java.
I have a (large) data set that I have imported from a csv file into data frame.
I have identified the two relevent columns for this question, the first that has the name of the patient, and second that asks the patient the level of swelling.
The level of swelling is relative i.e. better, worse or about the same.
Not all patients have the same number of observations.
I am having difficulty converting these relative values into numerical values that can be used as part of a greater analysis.
Below is psuedocode to what i think could be an appropriate solution:
for row in 'patientname'
patientcounter = dtfr1[row, 'patientname'];
if dtfr1[row, 'patientname'] == patientcounter
if dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditioncounter--;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
elseif [row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditoncounter++;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
else
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
if dtfr1[row, 'patientname'] =! patientcounter
patientcounter = dtfr1[row, 'patientname'];
What would your advice be for a good solution to this problem? Thanks!
If I'm understanding correctly, you want the difference in the counts of worse and better, by patient? If so, something like this would work.
# Simulated data
dtfr1 <- data.frame(patient = sample(letters[1:3], 100, replace=TRUE),
condition = sample(c("better", "worse"), 100, replace=TRUE))
head(dtfr1)
# patient condition
# 1 a worse
# 2 b better
# 3 b worse
# 4 a better
# 5 c worse
# 6 a better
better_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "better"))
worse_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "worse"))
worse_count - better_count
# a b c
# 5 0 -1
I'm looking for a nice way to count the longest number of consecutive reductions in a row in a data.table (package version 1.9.2) in R. I am horribly lost and any help is much appreciated. For the example I am trying to do, a reduction is where a value is less than or equal to the previous value (<=).
Below is an toy sample of the data I am dealing with. I have also put down my best attempt so far which to be honest went horribly wrong and it returned an error. My attempt also uses 2 for loops which I'm not hugely keen on since I have been advised apply loops are more often used in R. I have tried searching this site and the web for a similar solution but haven't had any luck. The number of rows I actually have in my full data table is just over 1 million and the number of columns I have is 17.
library(data.table)
TEST_DF <- data.table(COL_1 = c(5,2,3,1), COL_2 = c(1,0,4,2),
COL_3 = c(0,1,6,3), COL_4 = c(0,0,0,4))
TEST_DF$COUNT <- as.numeric(0)
for( i in 1:NROW(TEST_DF))
{
for (j in 1:(NCOL(TEST_DF) - 1))
{
TEST_DF$COUNT[j] <- if (TEST_DF[i, j, with = FALSE] >=
TEST_DF[i, j + 1, with = FALSE])
{
TEST_DF$COUNT[j] + 2
}
}
}
DESIRED <- data.table(COL_1 = c(5,2,3,1), COL_2 = c(1,0,4,2),
COL_3 = c(0,1,6,3), COL_4 = c(0,0,0,4),
COUNT = c(4,2,1,0))
The desired output appears at the bottom of the code. As the 4 four "COL" columns appear in the longest reduction sequence, the COUNT column for the first row would get a value of 4. In the second row, there is a reduction in the first 2 columns and the last two but none in between so the COUNT would get a value of 2 for this. In the last column, there is a reduction from COL_3 to COL_4 so COUNT would get a value of 2 for this row. In any row where there is no reduction such as the last there would be a value of 0 for the COUNT.
Let me know if any further clarification or information is needed.
Thank you so much in advance.
You can use the functions diff() and rle() to build a function to extract the run lengths. Then use apply() across the rows of your data:
foo <- function(x) {
runs <- rle(c(x[2] <= x[1], diff(x) <= 0))
if(all(runs$value == 0)) 0 else max(runs$lengths[runs$value == 1])
}
apply(TEST_DF, 1, foo)
[1] 4 2 1 0
I used apply with one four loop to accomplish what you're looking for. The apply acts on each row, and the for loop compares consecutive columns.
COUNT <- rep(0,4)
for (i in 1:(ncol(TEST_DF)-1)) {
COUNT<-COUNT+apply(TEST_DF,1,function(x) ifelse(x[i]>=x[i+1],1,0))
}
This produces: 3, 2, 1, 0, as there are 3 reductions in the first row. The last column has nothing to compare to, so there can only be three comparisons. I'm not sure why you want it to be 4?
If you want count to be part of your original table:
TEST_DF$COUNT<-COUNT