I have a large table with timestamps from several nights. Columns are an id for what night, an id for what timestamp within that night and the hearth rate at that timestamp, it looks like this:
allData <- data.table(nightNo=c(1,1,1,1,1,1,2,2,2,2), withinNightNo=c(1,2,3,4,5,6,1,2,3,4), HR=c(1:10))
nightNo withinNightNo HR
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
1 6 6
2 1 7
2 2 8
2 3 9
2 4 10
I'd like to add two new columns to the table, the slope and the cumsum of HR from up to the last 10 rows of the same night. I calculate the slope using linear regression and defined cumsum as: CUMSUMn = MAX(CUMSUMn-1, 0) + (valuen - MEAN(value1-n)). The result should look like this:
nightNo withinNightNo HR HRSlope HRCumsum
1 1 1 NaN 0.0
1 2 2 1 0.5
1 3 3 1 1.5
1 4 4 1 3.0
1 5 5 1 5.0
1 6 6 1 7.5
2 1 7 NaN 0.0
2 2 8 1 0.5
2 3 9 1 1.5
2 4 10 1 3.0
I've created code for both of these functions using for loops. They work, but my table is so large that it takes a long time to even calculate the slope/cumsum of a single value. My code looks like this:
# Add HRSlope column
allData$HRSlope <- 0
for(i in 1:nrow(allData)){
# Get points from up to last 10 seconds of the same night
start <- ifelse(i < 11, 1, (i-10))
points <- filter(allData[start:i,], nightNo == allData[i,]$nightNo)[, c("withinNightNo", "HR")]
# Calculate necessary values
meanX <- mean(points$withinNightNo)
meanY <- mean(points$HR)
meanXY <- mean(points$withinNightNo * points$HR)
meanX2 <- mean(points$withinNightNo^2)
# Calculate slope and add to table
allData[i,]$HRSlope <- (meanX * meanY - meanXY) / (meanX^2 - meanX2)
cat(i, "\n")
}
# Add cumsum column, and add first value to sum
allData$HRCumsum <- 0
Sum <- allData[1,]$HR
for(i in 2:nrow(allData)){
# Get sum and average of HR in night so far, reset Sum if new night started
Sum <- allData[i,]$HR + ifelse(allData[i,]$nightNo != allData[i-1,]$nightNo, 0 , Sum )
Average <- Sum / allData[i,]$withinNightNo
# Get previous cumsum, if available
pCumsum <- ifelse(allData[i,]$nightNo != allData[i-1,]$nightNo, 0 , allData[i-1,]$HRCumsum )
# Calculate current cumsum
allData[i,]$HRCumsum <- max(pCumsum, 0) + (allData[i,]$HR - Average)
cat(i, "\n")
}
Is there a more efficient way to do this, presumably without for loops?
EDIT:
I've been able to increase the speed of my slope function somewhat. It however still uses a forloop and it actually puts down a wrong value in a field for 9 times before putting down the correct value. Any thoughts on how to fix these two issues?
getSlope <- function(x, y) {
# Calculate necessary values
meanX <- mean(x)
meanY <- mean(y)
meanXY <- mean(x * y)
meanX2 <- mean(x^2)
# Calculate slope
return((meanX * meanY - meanXY) / (meanX^2 - meanX2))
}
# Loop back to 1
for(i in max(allData):1){
# Prevent i<=0
low <- ifelse(i < 10, 0, i-10)
# Grab up to last 10 points and calculate slope
allData[with(allData, withinNightNo > i-10 & withinNightNo <= i), slope := getSlope(withinNightNo, HR), by= nightNo]
}
EDIT2:
I've also been able to improve my cumsum a little, but it suffers from the same things as the slope. Besides that it takes larger chuncks of the table, because it needs to get the average, and needs to loop over all the data twice. Any thoughts on improving this would also be highly be appreciated.
# Calculate part of the cumsum
getCumsumPart <- function(x){
return(x-mean(x))
}
# Calculate valueN - mean(value1:N)
for(i in max(allData$withinNightNo):1){
allData[with(allData, withinNightNo <= i), cumsumPart:=
getCumsumPart(HR), by=nightNo]
}
# Calculate + max(cumsumN-1, 0)
for(i in max(allData$withinNightNo):1){
allData[with(allData, withinNightNo <= i & cumsumPart > 0), cumsum:= sum(cumsumPart), by=nightNo]
}
# Remove part table
allData$cumsumPart <- NULL
# Set NA values to 0
allData[with(allData, is.na(cumsum)), cumsum := 0]
Try this approach
library(dplyr)
library(caTools)
allData <- data.frame(nightNo=c(1,1,1,1,1,1,2,2,2,2),
withinNightNo=c(1,2,3,4,5,6,1,2,3,4),
HR=c(1:10))
group_fun <- function(grouped_df, window=10L) {
# slope
mean_x <- runmean(grouped_df$withinNightNo, window, align="right")
mean_y <- runmean(grouped_df$HR, window, align="right")
mean_xy <- runmean(grouped_df$withinNightNo * grouped_df$HR, window, align="right")
mean_xx <- runmean(grouped_df$withinNightNo * grouped_df$withinNightNo, window, align="right")
grouped_df$slope <- (mean_x * mean_y - mean_xy) / (mean_x^2 - mean_xx)
# cumsum
partial <- grouped_df$HR - mean_y # from above
# the "loop" is unavoidable here, I think
cumsum <- 0
grouped_df$cumsum <- sapply(partial, function(val) {
cumsum <<- max(cumsum, 0) + val
cumsum
})
grouped_df
}
out <- allData %>%
group_by(nightNo) %>%
do(group_fun(., window=3L)) # change window as desired
Related
I am working with the R programming language. I am trying to build a loop that performs the following :
Step 1: Keep generating two random numbers "a" and "b" until both "a" and "b" are greater than 12
Step 2: Track how many random numbers had to be generated until it took for Step 1 to be completed
Step 3: Repeat Step 1 and Step 2 100 times
Since I do not know how to keep generating random numbers until a condition is met, I tried to generate a large amount of random numbers hoping that the condition is met (there is probably a better way to write this):
results <- list()
for (i in 1:100){
# do until break
repeat {
# repeat many random numbers
a = rnorm(10000,10,1)
b = rnorm(10000,10,1)
# does any pair meet the requirement
if (any(a > 12 & b > 12)) {
# put it in a data.frame
d_i = data.frame(a,b)
# end repeat
break
}
}
# select all rows until the first time the requirement is met
# it must be met, otherwise the loop would not have ended
d_i <- d_i[1:which(d_i$a > 10 & d_i$b > 10)[1], ]
# prep other variables and only keep last row (i.e. the row where the condition was met)
d_i$index = seq_len(nrow(d_i))
d_i$iteration = as.factor(i)
e_i = d_i[nrow(d_i),]
results[[i]] <- e_i
}
results_df <- do.call(rbind.data.frame, results)
Problem: When I look at the results, I noticed that the loop is incorrectly considering the condition to be met, for example:
head(results_df)
a b index iteration
4 10.29053 10.56263 4 1
5 10.95308 10.32236 5 2
3 10.74808 10.50135 3 3
13 11.87705 10.75067 13 4
1 10.17850 10.58678 1 5
14 10.14741 11.07238 1 6
For instance, in each one of these rows - both "a" and "b" are smaller than 12.
Does anyone know why this is happening and can someone please show me how to fix this problem?
Thanks!
How about this way? As you tag while-loop, I tried using it.
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- 1
while(a < 12 | b < 12) {
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- i + 1
}
x <- c(a,b,i)
res <- rbind(res, x)
}
head(res)
[,1] [,2] [,3]
x 12.14232 12.08977 399
x 12.27158 12.01319 1695
x 12.57345 12.42135 302
x 12.07494 12.64841 600
x 12.03210 12.07949 82
x 12.34006 12.00365 782
dim(res)
[1] 100 3
Survey shows average score of 4.2 out of 5, with sample size of 14. How do I create a dataframe that provides a combination of results to achieve score of 4.2?
I tried this but it got too big
library(tidyverse)
n <- 14
avg <- 4.2
df <- expand.grid(rep(list(c(1:5)),n))
df <- df %>%
rowwise() %>%
mutate(avge = mean(c_across())) %>%
filter(ave >= 4)
The aim for this is, given the limited information above, I want to know the distribution of combinations of individual scores and see which combination is more likely to occur and how many low scores + high scores needed to have an average of that score above.
Thanks!
If you can tolerate doing this randomly, then
set.seed(42) # only so that you get the same results I show here
n <- 14
iter <- 1000000
scores <- integer(0)
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
if (mean(tmp) > 4) {
scores <- tmp
break
}
iter <- iter - 1
}
mean(scores)
# [1] 4.142857
scores
# [1] 5 3 5 5 5 3 3 5 5 2 5 5 4 3
Notes:
The reason I use iter in there is to preclude the possibility of an "infinite" loop. While here it reacts rather quickly and is highly unlikely to go that far, if you change the conditions then it is possible your conditions could be infeasible or just highly improbable. If you don't need this, then remove iter and use instead while (TRUE) ...; you can always interrupt R with Escape (or whichever mechanism your IDE provides).
The reason I prefill scores with an empty vector and use tmp is so that you won't accidentally assume that scores having values means you have your average. That is, if the constraints are too tight, then you should find nothing, and therefore scores should not have values.
FYI: if you're looking for an average of 4.2, two things to note:
change the conditional to be what you need, such as looking for 4.2 ... but ...
looking for floating-point equality is going to bite you hard (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), I suggest looking within a tolerance, perhaps
tol <- 0.02
# ...
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
# ...
where tol is some meaningful number. Unfortunately, using this seed (and my iter limit) there is no combination of 14 votes (of 1 to 5) that produce a mean that is within tol = 0.01 of 4.2:
set.seed(42)
n <- 14
iter <- 100000
scores <- integer(0)
tol <- 0.01
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
# if (mean(tmp) > 4) {
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
iter <- iter - 1
}
iter
# [1] 0 # <-- this means the loop exited on the iteration-limit, not something found
scores
# integer(0)
if you instead set tol = 0.02 then you will find something:
tol <- 0.02
# ...
scores
# [1] 4 4 4 4 4 5 4 5 5 5 3 4 3 5
mean(scores)
# [1] 4.214286
You can try the code below
n <- 14
avg <- 4.2
repeat{
x <- sample(1:5, n, replace = TRUE)
if (sum(x) == round(avg * n)) break
}
and you will see
> x
[1] 5 5 5 5 5 5 4 5 5 4 1 5 1 4
> mean(x)
[1] 4.214286
I want to evaluate the distance between non-zero data. So if i have 50 data, and only the first and last data is non-zero, thus i want the result to be 49.
For example, my data is:
1. 0
2. 0
3. 5
4. 6
5. 0
6. 1
7. 0
Based on my data above, i want to get 4 variables:
v0 = 3 (because the distance between 0th to 3rd data is 3 jumps)
v1 = 1 (because the distance between 3rd to 4th data is 1 jump)
v2 = 2 (because the distance between 4rd to 6th data is 2 jump)
v3 = 1 (because the distance between 6rd to 7th data is 1 jump)
This is my code:
data=c(0,0,5,6,0,1,0)
t=1
for (i in data) {
if (i == 0) {
t[i]=t+1
}
else {
t[i]=1
}
}
t
The result is:
[1] 1 NA NA NA 1 1
Could you help me in figuring out this problem? I also hope that the code is using some kind of loop, so that it can be applied to any other data.
The general rule is not clear from the question but if x is the input we assume that:
the input is non-negative
the first element in output is the position of the first +ve element in x
subsequent elements of output are distances between successive +ve elements of x
if that results in a vector whose sum is less than length(x) append the remainder
To do that determine the positions of the positive elements of c(1, x), calculate the differences between successive elements in that reduced vector using diff and then if they don't sum to length(x) append the remainder.
dists <- function(x) {
d <- diff(which(c(1, x) > 0))
if (sum(d) < length(x)) c(d, length(x) - sum(d)) else d
}
# distance to 5 is 3 and then to 6 is 1 and then to 1 is 2 and 1 is left
x1 <- c(0, 0, 5, 6, 0, 1, 0)
dists(x1)
## [1] 3 1 2 1
# distance to first 1 is 1 and from that to second 1 is 3
x2 <- c(1, 0, 0, 1)
dists(x2)
## [1] 1 3
Here it is redone using a loop:
dists2 <- function(x) {
pos <- 0
out <- numeric(0)
for(i in seq_along(x)) {
if (x[i]) {
out <- c(out, i - pos)
pos <- i
}
}
if (sum(out) < length(x)) out <- c(out, length(x) - sum(out))
out
}
dists2(x1)
## [1] 3 1 2 1
dists2(x2)
## [1] 1 3
Updates
Simplification based on comments below answer. Added loop approach.
I have not been able to find a solution to a problem similar to this on StackOverflow. I hope someone can help!
I am using the R environment.
I have data from turtle nests. There are two types of hourly data in each nest. The first is hourly Temperature, and it has an associated hourly Development (amount of "anatomical" embryonic development").
I am calculating a weighted median. In this case, the median is temperature and it is weighted by development.
I have a script here that I am using to calculated weighted median:
weighted.median <- function(x, w, probs=0.5, na.rm=TRUE) {
x <- as.numeric(as.vector(x))
w <- as.numeric(as.vector(w))
if(anyNA(x) || anyNA(w)) {
ok <- !(is.na(x) | is.na(w))
x <- x[ok]
w <- w[ok]
}
stopifnot(all(w >= 0))
if(all(w == 0)) stop("All weights are zero", call.=FALSE)
#'
oo <- order(x)
x <- x[oo]
w <- w[oo]
Fx <- cumsum(w)/sum(w)
#'
result <- numeric(length(probs))
for(i in seq_along(result)) {
p <- probs[i]
lefties <- which(Fx <= p)
if(length(lefties) == 0) {
result[i] <- x[1]
} else {
left <- max(lefties)
result[i] <- x[left]
if(Fx[left] < p && left < length(x)) {
right <- left+1
y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]- Fx[left])
if(is.finite(y)) result[i] <- y
}
}
}
names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
return(result)
}
So from the function you can see that I need two input vectors, x and w (which will be temperature and development, respectively).
The problem I'm having is that I have hourly temperature traces that last anywhere from 5 days to 53 days (i.e., 120 hours to 1272 hours).
I would like to calculate the daily weighted median for all days within a nest (i.e., take the 24 rows of x and w, and calculate the weighted median, then move onto rows 25-48, and so forth.) The output vector would therefore be a list of daily weighted medians with length n/24 (where n is the total number of rows in x).
In other words, I would like to analyse my data automatically, in a fashion equivalent to manually doing this (nest1 is the datasheet for Nest 1 which contains two vectors, temp and devo (devo is the weight))):
`weighted.median(nest1$temp[c(1,1:24)],nest1$devo[c(1,1:24)],na.rm=TRUE)`
followed by
weighted.median(nest1$temp[c(1,25:48)],nest1$devo[c(1,25:48)],na.rm=TRUE)
followed by
weighted.median(nest1$temp[c(1,49:72)],nest1$devo[c(1,49:72)],na.rm=TRUE)
all the way to
`weighted.median(nest1$temp[c(1,n-23:n)],nest1$devo[c(1,n-23:n)],na.rm=TRUE)`
I'm afraid I don't even know where to start. Any help or clues would be very much appreciated.
The main idea is to create a new column for day 1, day 2, ..., day n/24, split the dataframe into subsets by day, and apply your function to each subset.
First I create some sample data:
set.seed(123)
n <- 120 # number of rows
nest1 <- data.frame(temp = rnorm(n), devo = rpois(n, 5))
Create the splitting variable:
nest1$day <- rep(1:(nrow(nest1)/24), each = 24)
Then, use the by() function to split nest1 by nest1$day and apply the function to each subset:
out <- by(nest1, nest1$day, function(d) {
weighted.median(d$temp, d$devo, na.rm = TRUE)
})
data.frame(day = dimnames(out)[[1]], x = as.vector(out))
# day x
# 1 1 -0.45244433
# 2 2 0.15337312
# 3 3 0.07071673
# 4 4 0.23873174
# 5 5 -0.27694709
Instead of using by, you can also use the group_by + summarise functions from the dplyr package:
library(dplyr)
nest1 %>%
group_by(day) %>%
summarise(x = weighted.median(temp, devo, na.rm = TRUE))
# # A tibble: 5 x 2
# day x
# <int> <dbl>
# 1 1 -0.452
# 2 2 0.153
# 3 3 0.0707
# 4 4 0.239
# 5 5 -0.277
I have a data that looks like this:
"A" "B"
3 0
2 1
3 0
3 1
4 0
3 0
3 0
3 1
3 1
3 0
3 1
B is the event that may happen or not on each day (B==1 - event happened, B==0 - event did not happen), A is a measured variable, each rows is a day, data is in time order. I need to plot means for each of the N days before and after as function of whether the event happened (B==1) or not (B==0). Is there a function somewhere that can do this so I won't need to create new variables for each day before and after, reshape, etc.?
Writing the function to do this turned out to be easier than I expected:
plot_n_bef_aft<-function(data, n=5){
data<-data.table(data)
for (i in c(1:n)) {
data[, paste0('A','.-',i):= c(rep(NA,i), A[seq_len(.N-i)]), with=F]
data[, paste0('A','.',i):= c( A[seq_len(.N-i)+i],rep(NA,i)), with=F]
}
data[,'A.0':=A, with=F]
datar<-reshape(data,varying=c('A.0', paste0('A','.-',c(1:n)),paste0('A','.',c(1:n))), direction='long')
qplot(data=datar, x=time, y=A, stat='summary', fun.data=mean_cl_boot, color= factor(B), geom=c('pointrange','line'))
}
You may find it convenient to use the filter function to calculate the N-before and N-after values.
n <- 5
m <- 25
x <- 1:m
a <- 3 + round(rnorm(m))
b <- round(runif(m)^1.5)
plot(x, a, type="l", col=grey(.5))
fafter <- c(rep(1/n, n), rep(0, n + 1))
fbefore <- c(rep(0, n + 1), rep(1/n, n))
points(x[b==1], filter(a, fbefore)[b==1], col="red", pch="-")
points(x[b==1], filter(a, fafter)[b==1], col="blue", pch="+")
I've used the b==1 condition to only show the filter results for the events, though the calculations are done for all time values, if that matters.