Subsetting data in R based on a test - r

I would like to subset a dataframe based on a test performed. For instance, I ran the test
CheckUnsystematic(dat = long, deltaq = 0.025, bounce = 0.1, reversals = 0, ncons0 = 2)
It gave me this:
> CheckUnsystematic(dat = long, deltaq = 0.025, bounce = 0.1, reversals = 0, ncons0 = 2)
> CheckUnsystematic(dat = long, deltaq = 0.025, bounce = 0.1, reversals = 0, ncons0 = 2)
id TotalPass DeltaQ DeltaQPass Bounce BouncePass Reversals ReversalsPass NumPosValues
1 2 3 0.9089 Pass 0.0000 Pass 0 Pass 15
2 3 3 0.6977 Pass 0.0000 Pass 0 Pass 16
3 4 2 0.0000 Fail 0.0000 Pass 0 Pass 18
4 5 3 0.2107 Pass 0.0000 Pass 0 Pass 18
5 6 3 0.2346 Pass 0.0000 Pass 0 Pass 18
6 7 3 0.9089 Pass 0.0000 Pass 0 Pass 16
7 8 3 0.9622 Pass 0.0000 Pass 0 Pass 15
8 9 3 0.8620 Pass 0.0000 Pass 0 Pass 11
9 10 3 0.9089 Pass 0.0000 Pass 0 Pass 12
10 11 3 0.9089 Pass 0.0000 Pass 0 Pass 11
I want to keep only the observations that have a "3" in "TotalPass".
I tried this:
CleanAPT <- long[ which(long$TotalPass==3),]

Since you did tag this as a dplyr question, let's use it:
library(dplyr)
check_df <- CheckUnsystematic(dat = long, deltaq = 0.025,
bounce = 0.1, reversals = 0, ncons0 = 2)
CleanAPT <- check_df %>%
filter(TotalPass == 3)
The reason the CleanAPT <- long[ which(long$TotalPass==3),] is not working is because you are calling on the long dataframe (which is unmodified from the CheckUnsystematic function). In the above, I save the function results to check_df. So, CleanAPT <- check_df[which(check_df$TotalPass==3),] should work.
Merging back with the original data (difficult to say exactly how to do this since column names of long - so assuming id is present and unique), can be done with a semi_join from dplyr:
long_filtered <- long %>%
mutate(id = as.character(id)) %>%
semi_join(CleanAPT %>%
mutate(id = as.character(id)),
by = "id")

Try this with your long dataset.
CleanAPT <- subset(long, TotalPass == 3)
CheckUnsystematic(dat = CleanAPT, deltaq = 0.025, bounce = 0.1, reversals = 0, ncons0 = 2)

Related

Cumulative conditional product with reset

I have a large xts object. However the example is in a data.frame two column subset of the data. I would like to calculate (in a new column) the cumulative product of the first column df$rt whenever the second column df$dd is less than 0. Whenever df$dd is 0 I want to reset the cumulating to 0 again. So for the next instance that df$dd is less than 0 the cumulative product starts again for df$rt.
The following example dataframe adds the desired outcome as column three df$crt, for reference. Note that some rounding has been applied.
df <- data.frame(
rt = c(0, 0.0171, 0.0796, 0.003, 0.0754, -0.0314, 0.0275, -0.0323, 0.0364, 0.0473, -0.0021),
dd = c(0, -0.0657, -0.0013, 0, -0.018, -0.0012, 0, 0, 0, -0.0016, -0.0856),
crt = c(0, 0.171, 0.0981, 0, 0.0754, 0.0415, 0, 0, 0, 0.473, 0.045)
)
I have tried various combinations of with, ifelse and cumprod like:
df$crt <- with(df, ifelse(df$dd<0, cumprod(1+df$rt)-1, 0))
However this does not reset the cumulative product after a 0 in df$dd, it only writes a 0 and continues the previous cumulation of df$rt when df$dd is below zero again.
I think I am missing a counter of some sort to initiate the reset. Note that the dataframe I'm working with to implement this is large.
Create a grouping column by taking the cumulative sum of logical vector (dd == 0) so that it increments by 1 at positions where dd is 0, then use replace with the condition to do the cumulative product in 'rt' only in places where 'dd' is not equal to 0
library(dplyr)
df %>%
group_by(grp = cumsum(dd == 0)) %>%
mutate(crt1 = replace(dd, dd != 0, (cumprod(1 + rt[dd!=0]) - 1))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 11 x 4
rt dd crt crt1
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0.0171 -0.0657 0.171 0.0171
3 0.0796 -0.0013 0.0981 0.0981
4 0.003 0 0 0
5 0.0754 -0.018 0.0754 0.0754
6 -0.0314 -0.0012 0.0415 0.0416
7 0.0275 0 0 0
8 -0.0323 0 0 0
9 0.0364 0 0 0
10 0.0473 -0.0016 0.473 0.0473
11 -0.0021 -0.0856 0.045 0.0451
Or using base R
with(df, ave(rt * (dd != 0), cumsum(dd == 0), FUN = function(x)
replace(x, x != 0, (cumprod(1 + x[x != 0]) - 1))))
-ouptut
[1] 0.00000000 0.01710000 0.09806116 0.00000000 0.07540000 0.04163244 0.00000000 0.00000000 0.00000000 0.04730000 0.04510067

Find edges of intervals in dataframe column and use them for geom_rect xmin-xmax in ggplot

I have a data frame consituted by two columns
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
df
positionx pvalue
1 1 0.100
2 2 0.040
3 3 0.030
4 4 0.020
5 5 0.001
6 6 0.200
7 7 0.500
8 8 0.600
9 9 0.001
10 10 0.002
I would like to find in which intervals of values of positionx my pvalue is below a certain treshold, let's say 0.05.
Using 'which' I can find the index of the rows and I could go back to the vlaues of positionx.
which(df[,2]<0.05)
[1] 2 3 4 5 9 10
Howeverm what I would like are the edges of the intervals, with that I mean a result like: 2-5, 9-10
I also tried to use the findInterval function as below
int <- c(-10, 0.05, 10)
separation <- findInterval(pvalue,int)
separation
[1] 2 1 1 1 1 2 2 2 1 1
df_sep <- data.frame(cbind(df, separation))
df_sep
positionx pvalue separation
1 1 0.100 2
2 2 0.040 1
3 3 0.030 1
4 4 0.020 1
5 5 0.001 1
6 6 0.200 2
7 7 0.500 2
8 8 0.600 2
9 9 0.001 1
10 10 0.002 1
However I am stuck again with a column of numbers, while I want the edges of the intervals that contain 1 in the separation column.
Is there a way to do that?
This is semplified example, in reality I have many plots and for each plot one data frame of this type (just much longer and with pvalues not as easy to judge at a glance).
The reason why I think I need the information of the edges of my intervals, is that I would like to colour the background of my ggplot according to the pvalue. I know I can use geom_rect for it, but I think I need the edges of the intervals in order to build the coloured rectangles.
Is there a way to do this in an automated way instead of manually?
This seems like a great use case for run length encoding.
Example as below:
library(ggplot2)
# Data from question
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
# Sort data (just to be sure)
df <- df[order(df$positionx),]
# Do run length encoding magic
threshold <- 0.05
rle <- rle(df$pvalue < threshold)
starts <- {ends <- cumsum(rle$lengths)} - rle$lengths + 1
df2 <- data.frame(
xmin = df$positionx[starts],
xmax = df$positionx[ends],
type = rle$values
)
# Filter on type
df2 <- df2[df2$type == TRUE, ] # Satisfied threshold criterium
ggplot(df2, aes(xmin = xmin, xmax = xmax, ymin = 0, ymax = 1)) +
geom_rect()
Created on 2020-05-22 by the reprex package (v0.3.0)

R: Moving sum of one column in a data frame based on values in other column

I want to calculate a moving sum of one column (populated with ones and zeroes), but only when the value in a corresponding column (time) is within a (moving) range of values.
My data looks like this:
values <- c(1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0)
seconds <- c(0.0, 1.0, 2.5, 3.0, 5.5, 6.0, 6.5, 7.0, 8.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.5, 16.0, 17.0, 18.0, 19.0, 20.0)
data <- data.frame(values, seconds)
Say I want to sum every 5 seconds worth of data in the 'values' column.
Then my first 5-second sum (seconds >=0 & seconds <= 5) would be:
1 (because it corresponds to a 'seconds', 0.0, within the interval of interest)
+
0 (corresponds to 1.0 in 'seconds')
+
0 (2.5)
+
0 (3.0)
= 1
STOPs here because the next value (1) corresponds to 5.5 seconds, outside of the interval.
The next 5-second interval (seconds >= 1 & seconds <= 6) would equal:
0 + 0 + 0 + 1 + 1 = 2
3rd interval:
(seconds >= 2.5 & seconds <= 7.5) = 0 + 0 + 1 + 1 + 0 + 1 = 3
and so on.
I'm an R noob, so this is the method I'm using to calculate it (and it is super slow, so I know there must be a better way):
for(i in 1:20){movsum[i] <- sum(subset(data, seconds >= (seconds[i] - 5.0) & seconds <= seconds[i])$values)}
Thanks for your help. Let me know if there's anything I should clarify.
Here's a possible data.table::foverlaps solution. The idea here is to create 5 seconds interval look up table and then lookup within data which values fall in each interval.
Choose an interval
int <- 5 ## 5 seconds
The load the package, add additional (identical) column to data in order to set boundaries, create a new data set which will have the desired boundaries per row, run foverlaps, key data in order to enable the binary join, find the corresponding values in data$values and sum them per each interval, something like the following seem to work
library(data.table)
setkey(setDT(data)[, seconds2 := seconds], seconds, seconds2)
lookup <- data[, .(seconds, seconds2 = seconds + int)]
res <- foverlaps(lookup, data, which = TRUE)[, values := data$values[yid]]
res[, .(SumValues = sum(values)), by = .(SecInterval = xid)]
# SecInterval SumValues
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 3
# 5: 5 3
# 6: 6 2
# 7: 7 1
# 8: 8 2
# 9: 9 1
# 10: 10 2
# 11: 11 3
# 12: 12 3
# 13: 13 2
# 14: 14 2
# 15: 15 1
# 16: 16 0
# 17: 17 0
# 18: 18 0
# 19: 19 0
# 20: 20 0
You may try some functions from the zoo package:
library(zoo)
# convert your data to a zoo time series
z <- read.zoo(data, index = "seconds")
# create an empty, regular time series,
# which contains the full time range, in steps of 0.5 sec
z0 <- zoo(, seq(from = start(z), to = end(z), by = 0.5))
# 'expand' the irregular, original data to a regular series, by merging it with z0
z2 <- merge(z, z0)
# apply the desired function (sum) to a rolling window of width 11
# (number of observations in each window)
# move the time frame in steps of 2 (by = 2) which correspond to 1 sec
# use partial = TRUE, to allow the window to pass outside the time range
rollapply(z2, width = 11, by = 2, FUN = sum, na.rm = TRUE,
align = "left", partial = TRUE)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 1 2 3 3 3 3 2 2 1 2 2 3 3 2 2 1 0 0 0 0 0

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

R Adding one new row for each subject

I would like to add one new row for each of the subjects in my dataframe, which looks something like this:
Subject = c("1","5","10")
time = c("2", "2.25", "2.5")
value = c("3", "17", "9")
DF <- data.frame(Subject, time, value)
Subject time value
1 1 2 3
2 5 2.25 17
3 10 2.5 9
I want to add a new row for each subject with a time = 0 and value = 0, giving this:
Subject = c("1","1","5","5","10","10")
time = c("0","2","0", "2.25","0", "2.5")
value = c("0","3","0", "17","0", "9")
DF2 <- data.frame(Subject, time, value)
Subject time value
1 1 0 0
2 1 2 3
3 5 0 0
4 5 2.25 17
5 10 0 0
6 10 2.5 9
I have a lot of subjects with a lot of gaps in their subject numbers, and want do this for all of them in a reasonable way. Any suggestions?
Thank you in advance.
Sincerily,
ykl
I would just rbind in the new values (not sure why you specified all your values as character values, here I changed them to numeric)
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)
DF2 <- rbind(
DF,
data.frame(Subject = unique(DF$Subject), time="0", value="0")
)
this puts them at the bottom, but you could re-sort of you like
DF2[order(DF2$subject, DF2$time), ]
You can also use interleave from the "gdata" package:
library(gdata)
interleave(DF, data.frame(Subject = 0, time = 0, value = 0))
# Subject time value
# 1 1 2.00 3
# 11 0 0.00 0
# 2 5 2.25 17
# 1.1 0 0.00 0
# 3 10 2.50 9
# 1.2 0 0.00 0
This uses #MrFlick's sample data.
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)

Resources