generate normal distribution with exactly N elements in Y bins - r

I'll probably want to hit myself over the head for not getting this:
How do I generate a vector with the expected height of a normal distribution over Y bins (nbins in the below), of exactly N elements.
Like so, in the below picture:
Y or nbins = 15
N or nstat = 77
... should return something like: c(1,1,2,4, ...)
I know I could draw rnorm(77), but that'll never be exactly normal, and looping over 10.000 iterations or so seems overkill.
So I tried using qnorm for that purpose, but I have a hunch that:
sth is wrong with the below code
there has to be an easier, more elegant way
Here is what I got:
nbins <- 15
nstat <- 77
item.pos <- qnorm( # to the left of which value lies...
1:(nstat) / (nstat+1)# ... the n-statement?
# using nstat + 1 because we want midpoints, not cutoffs for later
)
bins <- cut(
x = item.pos,
breaks = nbins,
ordered_result = TRUE
)
height <- summary(bins)
height <- as.numeric(bins)

If your range of data is from -2:2 with 15 intervals and the sample size is 77 I would suggest the following to get the expected heights of the 15 intervals:
rn <- dnorm(seq(-2,2, length = 15))/sum(dnorm(seq(-2,2, length = 15)))*77
[1] 1.226486 2.084993 3.266586 4.716619 6.276462 7.697443 8.700123 9.062576 8.700123 7.697443
[11] 6.276462 4.716619 3.266586 2.084993 1.226486
The barplot of this looks like:
barplot(height = rn, names.arg = round(seq(-2, 2, length = 15), 2))
So, in your sample of 77 you would get the first value of the sequence in 1.226486, the second value in 2.084993 cases, etc. Its difficult to generate a vector as you described at the beginning, because the sequence above does not consist of integers.

Related

for loop to determine the top 10 percent of values in an interval

I essentially have two columns (vectors) with speed and accel in a data.frame as such:
speed acceleration
1 3.2694444 2.6539535522
2 3.3388889 2.5096979141
3 3.3888889 2.2722134590
4 3.4388889 1.9815256596
5 3.5000000 1.6777544022
6 3.5555556 1.3933215141
7 3.6055556 1.1439051628
8 3.6527778 0.9334115982
9 3.6722222 0.7561602592
I need to find for each value speed on the x axis (speed), what is the top 10% max values from the y axis (acceleration). This also needs to be in a specific interval. For example speed 3.2-3.4, 3.4-3.6, and so on. Can you please show me how a for loop would look like in this situation?
As #alistaire already pointed out, you have provided a very limited amount of data. So we first have to simulate I a bit more data based on which we can test our code.
set.seed(1)
# your data
speed <- c(3.2694444, 3.3388889, 3.3388889, 3.4388889, 3.5,
3.5555556, 3.6055556, 3.6527778, 3.6722222)
acceleration <- c(2.6539535522, 2.5096979141, 2.2722134590,
1.9815256596, 1.6777544022, 1.3933215141,
1.1439051628, 0.9334115982, 0.7561602592)
df <- data.frame(speed, acceleration)
# expand data.frame and add a little bit of noise to all values
# to make them 'unique'
df <- as.data.frame(do.call(
rbind,
replicate(15L, apply(df, 2, \(x) (x + runif(length(x), -1e-1, 1e-1) )),
simplify = FALSE)
))
The function create_intervals, as the name suggests, creates user-defined intervals. The rest of the code does the 'heavy lifting' and stores the desired result in out.
If you would like to have intervals of speed with equal widths, simply specify the number of groups (n_groups) you would like to have and leave the rest of the arguments (i.e. lwr, upr, and interval_span) unspecified.
# Cut speed into user-defined intervals
create_intervals <- \(n_groups = NULL, lwr = NULL, upr = NULL, interval_span = NULL) {
if (!is.null(lwr) & !is.null(upr) & !is.null(interval_span) & is.null(n_groups)) {
speed_low <- subset(df, speed < lwr, select = speed)
first_interval <- with(speed_low, c(min(speed), lwr))
middle_intervals <- seq(lwr + interval_span, upr - interval_span, interval_span)
speed_upp <- subset(df, speed > upr, select = speed)
last_interval <- with(speed_upp, c(upr, max(speed)))
intervals <- c(first_interval, middle_intervals, last_interval)
} else {
step <- with(df, c(max(speed) - min(speed))/n_groups)
intervals <- array(0L, dim = n_groups)
for(i in seq_len(n_groups)) {
intervals[i] <- min(df$speed) + i * step
}
}
return(intervals)
}
# three intervals with equal width
my_intervals <- create_intervals(n_groups = 3L)
# Compute values of speed when acceleration is greater then
# or equal to the 90th percentile
out <- lapply(1:(length(my_intervals)-1L), \(i) {
x <- subset(df, speed >= my_intervals[i] & speed <= my_intervals[i+1L])
x[x$acceleration >= quantile(x$acceleration, 0.9), ]
})
# function to round values to two decimal places
r <- \(x) format(round(x, 2), nsmall = 2L)
# assign names to each element of out
for(i in seq_along(out)) {
names(out)[i] <- paste0(r(my_intervals[i]), '-', r(my_intervals[i+1L]))
}
Output 1
> out
$`3.38-3.57`
speed acceleration
11 3.394378 2.583636
21 3.383631 2.267659
57 3.434123 2.300234
83 3.394886 2.580924
101 3.395459 2.460971
$`3.57-3.76`
speed acceleration
6 3.635234 1.447290
41 3.572868 1.618293
51 3.615017 1.420020
95 3.575412 1.763215
We could also compute the desired values of speed based on intervals that make more 'sense' than just equally spaced speed intervals, e.g. [min(speed), 3.3), [3.3, 3.45), [3.45, 3.6), and [3.6, max(speed)).
This can be accomplished by leaving n_groups unspecified and instead specify lwr, upr, and an interval_span that makes sense. For instance, it makes sense to have a interval span of 0.15 when the lower limit is 3.3 and the upper limit is 3.6.
# custom boundaries based on a lower limit and upper limit
my_intervals <- create_intervals(lwr = 3.3, upr = 3.6, interval_span = 0.15)
Output 2
> out
$`3.18-3.30`
speed acceleration
37 3.238781 2.696456
82 3.258691 2.722076
$`3.30-3.45`
speed acceleration
11 3.394378 2.583636
19 3.328292 2.711825
73 3.315306 2.644580
83 3.394886 2.580924
$`3.45-3.60`
speed acceleration
4 3.520530 2.018930
40 3.517329 2.032943
58 3.485247 2.079893
67 3.458031 2.078545
$`3.60-3.76`
speed acceleration
6 3.635234 1.447290
34 3.688131 1.218969
51 3.615017 1.420020
78 3.628465 1.348873
Note: use function(x) instead of \(x) if you use a version of R <4.1.0

Hoping for help to translate a thought experiment into R code, using randomization

I'm more experienced with R than many of my peers, yet it sometimes takes hours to move a novel-to-me concept into the code line, and usually a few more to get a successful output. I don't know how to describe this in R language, so I hope you can help me- either with sample code, or pointing me in the right direction.
I have c(X1,X2,X3,...Xn) for starting variable, a non-random numeric value.
I have c(Y1,Y2,Y3,...Yn) for change variable, a non-random numeric value denoting by how much to change X, give or take, and a value between 0-10.
I have c(Z1,Z2,Z3,...Zn) which is the min and max range of X.
What I want to observe is the random sampling of all numbers X, which have all randomly had corresponding Y variable subtracted or added to them. What I'm trying to ask in this problem, is how many times will I draw X values which are exactly the X values which I initially input as well as give or take only a low Y value.
For instance,
Exes<-c(135,462,579,222)
Whys<-c(1,3,3,2)
Zees<-c(c(115,155),c(450,474),c(510,648),c(200,244))
First iteration: X=c(135,562,579,222), second iteration: X=c(130,471,585,230)<- as you can see, X of second iteration has changed by (-5*Y1), (+3*Y2), (+2*Y3), and (+11*Y4)
What I want to output is a list of randomized X values which have changed by only a factor of their corresponding Y value, and always fall within the range of given Z values. Further, I want to examine how many times at least one- and only one- X value will be be significantly different from the corresponding,starting input X.
I feel like I'm not wording the question succinctly, but I also feel that this is why I've posted. I'm not trying to ask for hand-holding, but rather seeking advice.
I am not sure that I understood the question, do you want to reiterate the process numerous times? is it for the purpose of simulation?. Here is a start of a solution.
library(dplyr)
x <- c(135,462,579,222)
y <- c(1,3,3,2)
z.lower <- c(115, 450, 510, 200)
z.upper <- c(155, 474, 648, 244)
temp.df <- data.frame(x, y, z.lower, z.upper)
df %>%
mutate(samp = sample(seq(-10, 10, 1), nrow(temp.df))) %>% ### Sample numbers between 0 and 10
mutate(new.val = x + samp * y) %>% ### Create new X
mutate(is.bound = new.val < z.upper & new.val > z.lower) ### Check that falls in bounds
x y z.lower z.upper samp new.val is.bound
1 135 1 115 155 -10 125 TRUE
2 462 3 450 474 10 492 FALSE
3 579 3 510 648 8 603 TRUE
4 222 2 200 244 6 234 TRUE
For this dataset, this is a possibility:
Exes<-c(135,462,579,222)
Whys<-c(1,3,3,2)
Zees<-c(c(115,155),c(450,474),c(510,648),c(200,244))
n = 10000
x_range_l <- split(Zees, rep(seq_len(length(Zees) / 2), each = 2))
mapply(function(y, x_range) sample(seq(from = x_range[1], to = x_range[2], by = y), size = n, replace = T),
Whys, x_range_l)
Note that this option depends more on the Zees than the Exes. A more complete way to do it would be:
Exes<-c(135,462,579,222)
Whys<-c(1,3,3,2)
Why_Range <- c(20, 4, 13, 11)
x_range_l <- Map(function(x, y, rng) c(x - y * rng, x + y * rng), Exes, Whys, Why_Range)
n = 10000
mapply(function(y, x_range) sample(seq(from = x_range[1], to = x_range[2], by = y), size = n, replace = T),
Whys, x_range_l)

Mapping slope of an area and returning percent above and below a threshold in R

I am trying to figure our the proportion of an area that has a slope of 0, +/- 5 degrees. Another way of saying it is anything above 5 degrees and below 5 degrees are bad. I am trying to find the actual number, and a graphic.
To achieve this I turned to R and using the Raster package.
Let's use a generic country, in this case, the Philippines
{list.of.packages <- c("sp","raster","rasterVis","maptools","rgeos")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)}
library(sp) # classes for spatial data
library(raster) # grids, rasters
library(rasterVis) # raster visualisation
library(maptools)
library(rgeos)
Now let's get the altitude information and plot the slopes.
elevation <- getData("alt", country = "PHL")
x <- terrain(elevation, opt = c("slope", "aspect"), unit = "degrees")
plot(x$slope)
Not very helpful due to the scale, so let's simply look at the Island of Palawan
e <- drawExtent(show=TRUE) #to crop out Palawan (it's the long skinny island that is roughly midway on the left and is oriented between 2 and 8 O'clock)
gewataSub <- crop(x,e)
plot(gewataSub, 1)## Now visualize the new cropped object
A little bit better to visualize. I get a sense of the magnitude of the slopes and that with a 5 degree restriction, I am mostly confined to the coast. But I need a little bit more for analysis.
I would like Results to be something to be in two parts:
1. " 35 % (made up) of the selected area has a slope exceeding +/- 5 degrees" or " 65 % of the selected area is within +/- 5 degrees". (with the code to get it)
2. A picture where everything within +/- 5 degrees is one color, call it good or green, and everything else is in another color, call it bad or red.
Thanks
There are no negative slopes, so I assume you want those that are less than 5 degrees
library(raster)
elevation <- getData('alt', country='CHE')
x <- terrain(elevation, opt='slope', unit='degrees')
z <- x <= 5
Now you can count cells with freq
f <- freq(z)
If you have a planar coordinate reference system (that is, with units in meters or similar) you can do
f <- cbind(f, area=f[,2] * prod(res(z)))
to get areas. But for lon/lat data, you would need to correct for different sized cells and do
a <- area(z)
zonal(a, z, fun=sum)
And there are different ways to plot, but the most basic one
plot(z)
You can use reclassify from the raster package to achieve that. The function assigns each cell value that lies within a defined interval a certain value. For example, you can assign cell values within interval (0,5] to value 0 and cell values within the interval (5, maxSlope] to value 1.
library(raster)
library(rasterVis)
elevation <- getData("alt", country = "PHL")
x <- terrain(elevation, opt = c("slope", "aspect"), unit = "degrees")
plot(x$slope)
e <- drawExtent(show = TRUE)
gewataSub <- crop(x, e)
plot(gewataSub$slope, 1)
m <- c(0, 5, 0, 5, maxValue(gewataSub$slope), 1)
rclmat <- matrix(m, ncol = 3, byrow = TRUE)
rc <- reclassify(gewataSub$slope, rclmat)
levelplot(
rc,
margin = F,
col.regions = c("wheat", "gray"),
colorkey = list(at = c(0, 1, 2), labels = list(at = c(0.5, 1.5), labels = c("<= 5", "> 5")))
)
After the reclassification you can calculate the percentages:
length(rc[rc == 0]) / (length(rc[rc == 0]) + length(rc[rc == 1])) # <= 5 degrees
[1] 0.6628788
length(rc[rc == 1]) / (length(rc[rc == 0]) + length(rc[rc == 1])) # > 5 degrees
[1] 0.3371212

Print frequencies (as numbers) in plot

In R, I would like to insert frequencies (as numbers) in a plot:
my code to create the plot:
par(mar=c(4.5,4.5,9.5,4), xpd=TRUE)
plot(factor(ArtMehrspr)~Mehrspr_Vielf, data=datProjektMehr, col=terrain.colors(4),
bty='L', main="Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(datProjektMehr$ArtMehrspr)),
fill=terrain.colors(4), horiz=TRUE)
par(mar=c(5,4,4,2)+0.1)
In the plot, 2 columns of my dataframe are depicted: ArtMehrspr and Mehrspr_Vielf.
Now what I would like to know is, how many "Kombi" are in category "1", how many "Paral" are in category "1" and so on, and then to print this number in the plot, so that in every box of the plot, I can see the corresponding number of observations. R must know these numbers, otherwise it could not vary the height of the different boxes according to the number of observations. So it cannot be that hard to get these numbers into the plot, can it?
With the command table(), I can get these numbers, but I would have to have 5 table()-commands to get all the numbers. Example for category = 1:
> table(subset(datProjektMehr, Mehrspr_Vielf=="1")$ArtMehrspr)
einspr Kombi Paral Versc Wechs
0 1 9 2 1
Apparently, you can achieve what I am looking for by adding the command labels = TRUE. But it does not work:
par(mar=c(4.5,4.5,9.5,4), xpd=TRUE, labels = TRUE)
plot(factor(ArtMehrspr)~Mehrspr_Vielf, data=datProjektMehr, col=terrain.colors(4),
bty='L', main="Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(datProjektMehr$ArtMehrspr)),
fill=terrain.colors(4), horiz=TRUE)
par(mar=c(5,4,4,2)+0.1)
R gives me the following warning message:
Warning message:
In par(mar = c(4.5, 4.5, 9.5, 4), xpd = TRUE, labels = TRUE) :
"labels" is not a graphical parameter
Is this not the right command? Does anyone know how to do this?
First of all, the warning informs that there is not a labels argument you can use inside par.
Regarding the plotting of the table output, I'm not aware if there is an easy way of doing this, but I managed a pretty UNreliable and, maybe, inefficient code. In my machine, though, it works every time I run it.
The concept I had in mind is to text all values from your table inside the plot. To do so, coordinates in xx' and yy' had to be estimated. I prefer the term "estimated" instead of "calculated" because I didn't find a way to compute absolute values for the coordinates, due to the fact that the plot method was plot.factor.
So:
#random data. DF = datProjektMehr, artmehr = ArtMehrspr, mehrviel = Mehrspr_Vielf
DF <- data.frame(artmehr = sample(letters[1:4], 20, T), mehrviel = as.factor(sample(1:5, 20, T)))
#your code of plotting
par(mar = c(4.5,4.5,9.5,4), xpd = TRUE)
plot(factor(artmehr) ~ mehrviel, data = DF, col = terrain.colors(4),
bty = 'L', main = "Vielfalt nutzen")
legend("topright", inset=c(0,-.225), title="Art der Mehrsprachigkeit", levels(factor(DF$artmehr)),
fill=terrain.colors(4), horiz=TRUE)
#no need to "table()" many times
tab = table(DF$artmehr, DF$mehrviel)
#maximum value of x axis (at least in my machine)
#I found -through trial and error- that for a factor of n levels, x.max = 1 + (n-1)*0.02
x.max = 1 + (length(levels(DF$mehrviel)) - 1) * 0.02
#coordinates of "mehrviel" (as I named it)
mehrviel.coords = ((cumsum(apply(tab, 2, sum)) / sum(tab)) * x.max) - ((apply(tab, 2, sum) / sum(tab)) / 2)
#coordinates of "artmehr" (as I named it)
artmehr.coords <- apply(tab, 2, function(x) { cumsum(x / sum(x)) })
artmehr.coords <- apply(artmehr.coords, 2, function(x) { x - c(x[1]/2, diff(x)/2) })
#"text" the values in your table
#don't plot "0"s
for(i in 1:ncol(artmehr.coords))
{
text(x = mehrviel.coords[i], y = artmehr.coords[,i], labels = ifelse(tab[,i] != 0, tab[,i], ""), cex = 2)
}
The values of table:
tab
1 2 3 4 5
a 1 1 0 1 0
b 0 0 2 1 2
c 1 1 2 1 0
d 2 0 0 3 2
The plot:
EDIT: 1) "Tidied" the answer. 2) Aadded an extra level to the factor ploted in xx' axis to match your data exactly. 3)texted the frequencies in the middle of each box.

Running Mean/SD: How can I select within the averaging window based on criteria

I need to calculate a moving average and standard deviation for a moving window. This is simple enough with the catools package!
... However, what i would like to do, is having defined my moving window, i want to take an average from ONLY those values within the window, whose corresponding values of other variables meet certain criteria. For example, I would like to calculate a moving Temperature average, using only the values within the window (e.g. +/- 2 days), when say Relative Humidity is above 80%.
Could anybody help point me in the right direction? Here is some example data:
da <- data.frame(matrix(c(12,15,12,13,8,20,18,19,20,80,79,91,92,70,94,80,80,90),
ncol = 2, byrow = TRUE))
names(da) = c("Temp", "RH")
Thanks,
Brad
I haven't used catools, but in the help text for the (presumably) most relevant function in that package, ?runmean, you see that x, the input data, can be either "a numeric vector [...] or matrix with n rows". In your case the matrix alternative is most relevant - you wish to calculate mean of a focal variable, Temp, conditional on a second variable, RH, and the function needs access to both variables. However, "[i]f x is a matrix than each column will be processed separately". Thus, I don't think catools can solve your problem. Instead, I would suggest rollapply in the zoo package. In rollapply, you have the argument by.column. Default is TRUE: "If TRUE, FUN is applied to each column separately". However, as explained above we need access to both columns in the function, and set by.column to FALSE.
# First, specify a function to apply to each window: mean of Temp where RH > 80
meanfun <- function(x) mean(x[(x[ , "RH"] > 80), "Temp"])
# Apply the function to windows of size 3 in your data 'da'.
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE)
meanTemp
# If you want to add the means to 'da',
# you need to make it the same length as number of rows in 'da'.
# This can be acheived by the `fill` argument,
# where we can pad the resulting vector of running means with NA
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE, fill = NA)
# Add the vector of means to the data frame
da2 <- cbind(da, meanTemp)
da2
# even smaller example to make it easier to see how the function works
da <- data.frame(Temp = 1:9, RH = rep(c(80, 81, 80), each = 3))
meanTemp <- rollapply(data = da, width = 3, FUN = meanfun, by.column = FALSE, fill = NA)
da2 <- cbind(da, meanTemp)
da2
# Temp RH meanTemp
# 1 1 80 NA
# 2 2 80 NaN
# 3 3 80 4.0
# 4 4 81 4.5
# 5 5 81 5.0
# 6 6 81 5.5
# 7 7 80 6.0
# 8 8 80 NaN
# 9 9 80 NA

Resources