Calculate the SD in a different dataset with different data value limits - r

I am a beginner with R and want to calculate the SD of values in another dataframe several times within limits of values in a dataframe.
Imagine I have a dataframe looking like this.
peak <- c("max", "max", "max")
value <- c(42, 105, 170)
minbefore<- c(20, 50, 115)
minafter <- c(50, 115, 180)
extrema <- data.frame(peak, value, minbefore, minafter)
I now want to calculate the SD of the values in another dataframe em$Position within the limits of extrema$minbeforeand extrema$minafter for each row of the dataframe extreme.
My idea was something like this
extrema$SD <- sd(em$Position[em$Position>extrema$minbefore & em$Position<extrema$minafter])
Then I get the following error message: longer object length is not a multiple of shorter object length
Which absolutely makes sense to me because I assume that R probably tries to insert the whole vector extrema$minbefore and extrema$minafter resepectively and at the same time which obviuosly makes no sense.
What would be the right way to do it?
Thanks in advance.
Dominik.

You can use apply function to do this:
# dummy data
em <- data.frame(Position = unlist(as.integer(runif(n = 30, min = 20, max = 190))))
# function to calculate sd
extrema$SD <- apply(extrema[,c('minbefore','minafter')], 1, function(x){
return( sd(em[(em$Position > x[1]) & (em$Position < x[2]),'Position']))
})
print(extreme)
peak value minbefore minafter SD
1 max 42 20 50 5.966574
2 max 105 50 115 19.07878
3 max 170 115 180 18.407426
Explanation:
We traverse through each row of extreme, get the min and max values.
Using min, max values, we subset the em$Position and calculate the sd.

Related

Generate random decimal numbers with given mean in given range in R

Hey I want to generate 100 decimal numbers in the range of 10 and 50 with the mean of 32.2.
I can use this to generate the numbers in the wanted range, but I don't get the mean:
runif(100, min=10, max=50)
Or I could use this and I dont get the range:
rnorm(100,mean=32.2,sd=10)
How can I combine those two or can I use another function?
I have tried to use this approach:
R - random distribution with predefined min, max, mean, and sd values
But I dont get the exact mean I want... (31.7 in my example try)
n <- 100
y <- rgbeta(n, mean = 32.2, var = 200, min = 10, max = 50)
Edit: Ok i have lowered the var and the mean gets near to 32.2 but I still want some values near the min and max range...
In order to get random numbers between 10 and 50 with a (true) mean of 32.2, you would need a density function that would fulfill those properties.
A uniform distribution with a min of 10 and a max of 50 (runif) will never deliver you that mean, as the true mean is 30 for that distribution.
The normal distribution has a range from - infinity to infinity, independent of the mean it has, so runif will return numbers greater than 50 and smaller than 10.
You could use a truncated normal distribution
rnormTrunc(n = 100, mean = 32.2, sd = 1, min = 10, max = 50),
if that distribution would be okay. If you need a different distibution, things will get a little more complicated.
Edit: feel free to ask if you need the math behind that, but depending on what your density function should look like it will get very complicated
This isn't perfect, but maybe its a start. I can't get the range to work out perfectly, so I just played with the "max" until I got an output I was happy with. There is probably a more solid math way to do this. The result is uniform-adjacent... at best...
rand_unif_constrained <- function(num, min, max, mean) {
vec <- runif(num, min, max)
vec / sum(vec) * mean*num
}
set.seed(35)
test <- rand_unif_constrained(100, 10, 40, 32.2) #play with max until max output is less that 50
mean(test)
#> [1] 32.2
min(test)
#> [1] 12.48274
max(test)
#> [1] 48.345
hist(test)

Find adjacent rows that match condition

I have a financial time series in R (currently an xts object, but I'm also looking into tibble right now).
How do I find the probability of 2 adjacent rows matching a condition?
For example I want to know the probability of 2 consecutive days having a higher than mean/median value. I know I can lag the previous days value into the next row which would allow me to get this statistic, but that seems very cumbersome and inflexible.
Is there a better way to get this done?
xts sample data:
foo <- xts(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days"))
What's the probability of 2 consecutive days having a higher than median value?
You can create a new column that calls out which are higher than the median, and then take only those that are consecutive and higher
> foo <- as_tibble(data.table(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days")))
Step 1
Create column to find those that are higher than median
> foo$higher_than_median <- foo$x > median(foo$x)
Step 2
Compare that column using diff,
Take it only when both are consecutively higher or lower..c(0, diff(foo$higher_than_median) == 0
Then add the condition that they must both be higher foo$higher_than_median == TRUE
Full Expression:
foo$both_higher <- c(0, diff(foo$higher_than_median)) == 0 & $higher_than_median == TRUE
Step 3
To find probability take the mean of foo$both_higher
mean(foo$both_higher)
[1] 0.1428571
Here is a pure xts solution.
How do you define the median? There are several ways.
In an online time series use, like computing a moving average, you can compute the median over a fixed lookback window (shown below), or from the origin up to now (an anchored window calculation). You won't know future values in the median computation beyond the current time step (Avoid look ahead bias).:
library(xts)
library(TTR)
x <- rep(c(1,1,5,1,5,5,1, 5, 5, 5), 10)
y <- xts(x = x, seq(as.Date("2016-01-01"), length = length(x), by = "days"), dimnames = list(NULL, "x"))
# Avoid look ahead bias in an online time series application by computing the median over a rolling fixed time window:
nMedLookback <- 5
y$med <- runPercentRank(y[, "x"], n = nMedLookback)
y$isAboveMed <- y$med > 0.5
nSum <- 2
y$runSum2 <- runSum(y$isAboveMed, n = nSum)
z <- na.omit(y)
prob <- sum(z[,"runSum2"] >= nSum) / NROW(z)
The case where your median is over the entire data set is obviously a much easier modification of this.

Sampling a specific age distribution from a dataset

Suppose I have a dataset with 1,000,000 observations. Variables are age, race, gender. This dataset represents the whole US.
How can I draw a sample of 1,000 people from this dataset, given a certain age distribution? E.g. I want this datset with 1000 people distributed like this:
0.3 * Age 0 - 30
0.3 * Age 31 - 50
0.2 * Age 51 - 69
0.2 * Age 70 - 100
Is there a quick way to do it? I already created a sample of 1000 people with the desired age distribution, but how do I combine that now with my original dataset?
As an example, this is how I have created the population distribution of Maine:
set.seed(123)
library(magrittr)
popMaine <- data.frame(min=c(0, 19, 26, 35, 55, 65), max=c(18, 25, 34, 54, 64, 113), prop=c(0.2, 0.07, 0.11, 0.29, 0.14, 0.21))
Mainesample <- sample(nrow(popMaine), 1000, replace=TRUE, prob=popMaine$prop)
Maine <- round(popMaine$min[Mainesample] + runif(1000) * (popMaine$max[Mainesample] - popMaine$min[Mainesample])) %>% data.frame()
names(Texas) <- c("Age")
Now I don't know how to bring this together with my other dataset which has the whole US population... I'd appreciate any help, I am stuck for quite a while now...
Below are four different approaches. Two use functions from, respectively, the splitstackshape and sampling packages, one uses base mapply, and one uses map2 from the purrr package (which is part of the tidyverse collection of packages).
First let's set up some fake data and sampling parameters:
# Fake data
set.seed(156)
df = data.frame(age=sample(0:100, 1e6, replace=TRUE))
# Add a grouping variable for age range
df = df$age.groups = cut(df$age, c(0,30,51,70,Inf), right=FALSE)
# Total number of people sampled
n = 1000
# Named vector of sample proportions by group
probs = setNames(c(0.3, 0.3, 0.2, 0.2), levels(df$age.groups))
Using the above sampling parameters, we want to sample n total values with a proportion probs from each age group.
Option 1: mapply
mapply can apply multiple arguments to a function. Here, the arguments are (1) the data frame df split into the four age groupings, and (2) probs*n, which gives the number of rows we want from each age group:
df.sample = mapply(a=split(df, df$age.groups), b=probs*n,
function(a,b) {
a[sample(1:nrow(a), b), ]
}, SIMPLIFY=FALSE)
mapply returns a list with of four data frames, one for each stratum. Combine this list into a single data frame:
df.sample = do.call(rbind, df.sample)
Check the sampling:
table(df.sample$age.groups)
[0,30) [30,51) [51,70) [70,Inf)
300 300 200 200
Option 2: stratified function from the splitstackshape package
The size argument requires a named vector with the number of samples from each stratum.
library(splitstackshape)
df.sample2 = stratified(df, "age.groups", size=probs*n)
Option 3: strata function from the sampling package
This option is by far the slowest.
library(sampling)
# Data frame must be sorted by stratification column(s)
df = df[order(df$age.groups),]
sampled.rows = strata(df, 'age.groups', size=probs*n, method="srswor")
df.sample3 = df[sampled.rows$ID_unit, ]
Option 4: tidyverse packages
map2 is like mapply in that it applies two arguments in parallel to a function, in this case the dplyr package's sample_n function. map2 returns a list of four data frames, one for each stratum, which we combine into a single data frame with bind_rows.
library(dplyr)
library(purrr)
df.sample4 = map2(split(df, df$age.groups), probs*n, sample_n) %>% bind_rows
Timings
library(microbenchmark)
Unit: milliseconds
expr min lq mean median uq max neval cld
mapply 86.77215 110.82979 156.66855 123.95275 145.25115 486.2078 10 a
strata 5028.42933 5541.40442 5709.16796 5699.50711 5845.69921 6467.7250 10 b
stratified 38.33495 41.76831 89.93954 45.43525 79.18461 408.2346 10 a
tidyverse 71.48638 135.49113 143.12011 142.86866 155.72665 192.4174 10 a

How to make raw data from descriptives?

I have a table of descriptives, but I'd like to generate the raw data so that I can run stats like t.test and whatnot.
# Table. Heart rate between groups
# Mean heart rate sd n
# Group1 125 11 218
# Group2 133 12 156
I'd like to fill in the remainder of group 2 with NAs so the vectors are the same length. Then, I'd like to run a t test to see if the two groups are different. I've been hacking away at it, but I can't seem to get everything working properly.
I was using norm, I'm just not sure how to then get that into a suitable format for stats:
group1 <- rnorm(n=218, mean = 125, sd = 11)
group2 <- rnorm(n=156, mean = 133, sd = 12)

How can I introduce values to a vector in random positions in R?

I'm just starting to learn R, and my assignment was to create a vector of 10000 values with normal distribution, mean = 0 and sd = 100. Which I did.
x <- rnorm(10000, mean = 0, sd = 100)
But now I'm asked to introduce values between 500 and 700 at 1000 random positions in that vector.
Can anyone help me?
If you mean to replace 1000 elements in the x vector with values between 500 and 700, you first need to generate these 1000 elements:
r <- runif(1000, min=500, max=700)
I am assuming here that random values are uniformly between 500 and 700.
Then you need to select places to put these values in:
idx <- sample(10000, 1000)
Finally, replace the values at these places:
x[ idx ] <- r
Finally, to see the results of your action:
hist(x)
It should look like:

Resources