Filtering a vector on condition - r

I am trying to filter a vector of integers.
My condition is that the distance between 2 consecutive elements should be at least 100 ; if not, remove the element and look at the next candidate.
Here is an example :
set.seed(42)
input <- sort(sample(1:1000, 20))
head(input, 20)
[1] 24 49 74 128 146 153 165 228 303 321 356 410 532 561 601 622 634 839 882 997
If I start from the first element 24, I would like to keep the first element that has a distance at least 100 from it.
In this case, it would be 128.
Then, from 128, repeat the same process.
The result should be :
24 128 228 356 532 634 839 997
I managed to create a quick and dirty loop that gives the correct result, but I can guess that it would not be very efficient for very large vectors...
result <- integer(length(input))
result[1] <- input[1]
for(i in seq_along(input)[-1]) {
if(is.na(input[2])) break
if(input[2] - input[1] < 100) {
input <- input[-2]
} else {
result[i] <- input[2]
input <- input[-1]
}
}
result <- result[result != 0]
What would be an efficient way to get the expected result ? Can it be done using vectorization ?

unique(Reduce(function(x,y)ifelse(y-x>=100,y,x),input,accumulate = T))
[1] 24 128 228 356 532 634 839 997

Not thoroughly tested, but I believe this gets you there. I am using purrr::accumulate. This is a pretty neat problem :-) hoping to see some other solutions/approaches, so maybe leave this open (unanswered) for a bit...
library(purrr)
input <- c(24, 49, 74, 128, 146, 153, 165, 228, 303, 321, 356, 410, 532, 561, 601, 622, 634, 839, 882, 997)
idx <- which(accumulate(diff(input), ~ ifelse(.x >= 100, .y, .x + .y)) >= 100)
input[c(1, idx + 1)]
#> [1] 24 128 228 356 532 634 839 997
And to make this read a little more purrr, I suppose we could do:
accumulate(diff(input), ~ if_else(.x >= 100, .y, .x + .y)) %>%
map_lgl(~ . >= 100) %>%
which %>%
{ input[c(1, . + 1)] }

Related

Get a seed to generate a specific set of pseudo-casual number

I was wondering if there is a way in R to find the specific seed that generates a specific set of numbers;
For example:
sample(1:300, 10)
I want to find the seed that gives me, as the output of the previous code:
58 235 243 42 281 137 2 219 284 184
As far as I know there is no elegant way to do this, but you could brute force it:
desired_output <- c(58, 235, 243, 42, 281, 137, 2, 219, 284, 184)
MAX_SEED <- .Machine$integer.max
MIN_SEED <- MAX_SEED * -1
i <- MIN_SEED
while (i < MAX_SEED - 1) {
set.seed(i)
actual_output <- sample(1:300, 10)
if (identical(actual_output, desired_output)) {
message("Seed found! Seed is: ", i)
break
}
i <- i + 1
}
This takes 11.5 seconds to run with the first 1e6 seeds on my laptop - so if you're unlucky then it would take about 7 hours to run. Also, this is exactly the kind of task you could run in parallel in separate threads to cut the time down quite a bit.
EDIT: Updated to include negative seeds which I had not considered. So in fact it could take twice as long.

One-to-one mapping of 'between' values?

I am trying to map with a one-to-one relation of values that returns a dataframe when the values are between two values. For example on this dataset:
Coastal_Cities Summer_2009 Summer_2010 Summer_2011 Summer_2012 Summer_2013 Summer_2014 Summer_2015 Summer_2016 Summer_2017 Summer_2018 Summer_2019
1 Aberdeen City 497 434 437 310 541 556 556 492 474 616 526
2 Barrow-in-Furness 552 555 637 445 671 726 616 514 547 773 627
3 Blackpool 551 550 623 433 664 700 585 493 535 738 611
Try to map with a one-to-one relation each value in the row, relative to the values between the upper and lower confidence, if not then set to 0.
lower_confidence upper_confidence
1 479.8784 509.0307
2 588.6927 622.7619
3 573.3041 605.4232
So row 1 should map to row 1, row 2 should map to row 2 etc ...
For example, given a set of values:
structure(list(Coastal_Cities = c("Aberdeen City", "Barrow-in-Furness",
"Blackpool", "Bournemouth, Christchurch and Poole", "Caerdydd - Cardiff"
), Summer_2009 = c(497, 552, 551, 654, 529), Summer_2010 = c(434,
555, 550, 642, 598), Summer_2011 = c(437, 637, 623, 567, 549),
Summer_2012 = c(310, 445, 433, 481, 433), Summer_2013 = c(541,
671, 664, 776, 733), Summer_2014 = c(556, 726, 700, 799,
741), Summer_2015 = c(556, 616, 585, 619, 621), Summer_2016 = c(492,
514, 493, 598, 524), Summer_2017 = c(474, 547, 535, 659,
569), Summer_2018 = c(616, 773, 738, 806, 730), Summer_2019 = c(526,
627, 611, 688, 561)), row.names = c(NA, 5L), class = "data.frame")
#Get its 95% confidence interval by rows
ci <- function(x){
z= rowMeans(x[,-1])-1.96*(apply(x[, -1], 1, sd)/length(x[,-1]));
v =rowMeans(x[,-1])+1.96*(apply(x[, -1], 1, sd)/length(x[,-1])) ;
y=data.frame(lower_confidence = z, upper_confidence = v);
return(y)}
Now I am trying to produce a function that stores those values in each row that are between the upper and lower interval:
diff_ci <- function(x, y) {
for(i in nrow(x)) {
for(j in length(x[, -1])){
t = x[j] > ci(y)[1][[1]][i]
p = x[j] < ci(y)[2][[1]][i]
}
}
e = data.frame(t, p)
return(e)
}
However, when I call the function I get an output that I cannot seem to interpret what exactly is it representing by the :
diff_ci(weather[, -1], weather)
Summer_2018 Summer_2018.1
1 FALSE TRUE
2 TRUE FALSE
3 TRUE FALSE
4 TRUE FALSE
5 TRUE FALSE
It is not clear to me that it has done anything that I had in mind for.
As when I run this into a lapply function:
> lapply(weather[, -1], diff_ci, y=weather)
I get the error:
Error in data.frame(t, p) : object 'p' not found
I am guessing I messed up with the allocation of arguments within the for-loop?
My expected output (for first 2 rows):
Coastal_Cities Summer_2009 Summer_2010 Summer_2011 Summer_2012 Summer_2013 Summer_2014 Summer_2015 Summer_2016 Summer_2017 Summer_2018 Summer_2019
1 Barrow-in-Furness 497 0 0 0 0 0 0 492 0 0 0
2 Blackpool 0 0 0 0 0 0 0 0 0 0 0
I am also expecting a cleaner version from the interpretation of my code with an explanation as to how I went wrong?
EDIT:
I have also tried this which fixes some of the i /j uses, however it just prints the entire output ...:
diff_ci <- function(x, y) {
for(i in nrow(x)) {
for(j in length(x[, -1])){
if(x[[j]][i] > ci(y)[1][[1]][i] | x[[j]][i] < ci(y)[2][[1]][i]){
print(x)
}
}
}
}
diff_ci(ten_year.average[, -1], ten_year.average)
Reading your question it seems to me that what you need is fairly straight forward. If a value falls within the rowwise confidence interval provide that value. If a value is outside the confidence interval return 0. I named the data you supplied reprex.
I'm not sure what your use case is, so I didn't really fiddle with your ci calculation except to pull it out of a function so you can step through building the dataframe.
min <-
sapply(seq_along(reprex$Coastal_Cities),
function(x) {
rowMeans(reprex[x,-1])-1.96*(apply(reprex[x, -1], 1, sd)/length(reprex[x,-1]))
}
)
max <-
sapply(seq_along(reprex$Coastal_Cities),
function(x) {
rowMeans(reprex[x,-1])+1.96*(apply(reprex[x, -1], 1, sd)/length(reprex[x,-1]))
}
)
confint <- data.frame(min = min, max = max)
Instead of stepping into indexing hell via for loops, you can take advantage of the dataframe structure. A dataframe is a list of vectors and so it is easy to work with lapply and sapply to iterate through.
If I use lapply on a dataframe object itself, it will iterate through each column. You can see this in action by running the simple example below:
lapply(data.frame(a = 1:3, b = 4:6, c = 7:9), print)
So for your use case you want to iterate through each element of the vector we call by an inital lapply. We can use a nested sapply to keep the vector structure.
using our really simple example above, let's say we want to paste "b" into each observation of the dataframe:
lapply(data.frame(a = 1:3, b = 4:6, c = 7:9),
function(x) {
sapply(x,
function(y) {
paste(y, "b") }
)
})
# if we print the above output
$a
[1] "1 b" "2 b" "3 b"
$b
[1] "4 b" "5 b" "6 b"
$c
[1] "7 b" "8 b" "9 b"
So we apply the same logic, except instead of pasting "b" we use an if else statement to either return the original value, or return 0 if the corresponding row (y) in the confint dataframe.
Finally, we want to get a dataframe from our output. So we use do.call to supply the arguments (a list of vectors) to the data.frame() function call.
conditional <-
lapply(reprex[-1], function(x) {
sapply(seq_along(x), function(y) {
if(x[y] > confint$min[y] & x[y] < confint$max[y]) {x[y]} else {0}
})
})
do.call(data.frame, conditional)
But for real, if you do this kind of thing often, I recommend taking some time to learn the tidyverse. Tons of time saving tools for tasks like this. Using the tidyverse you could tackle this problem like so:
library(tidyverse)
reprex %>%
pivot_longer(starts_with("Summer")) %>%
group_by(Coastal_Cities) %>%
mutate(sd = sd(value),
mean = mean(value),
ci_min = mean - 1.96 * sd/n(),
ci_max = mean + 1.96 * sd/n()) %>%
ungroup() %>%
mutate(value = case_when(
ci_min <= value & value <= ci_max ~ value,
TRUE ~ 0
))

Accessing a global vector component using a logical condition inside a function

This function returns people's BMIs. The function the returns the BMIs does what I want but I cannot figure out how to print the WEIGHTS of those people that have a BMI over 25. I feel like I should be subscripting but how do I get that association? Thanks in advance.
heights_in_cms <- c(180, 165, 160, 193)
weights_in_kg <- c(87, 58, 65, 100)
bmi_calc <- function(h, w) {
bmi = w / (h / 100)^2
cat("BMIs are:", bmi, "\n")
for (b in bmi) {
if (b > 25) {
print(b)
}
}
}
bmi_seq <- bmi_calc(heights_in_cms, weights_in_kg)
using the same method as tmfmnk's answer you can get make the output of the function a dataframe if you prefer.
bmi_calc <- function(h, w) {
bmi <- w/(h/100)^2
return(data.frame(BMI=bmi[bmi > 25], Weight = w[bmi > 25]))
}
> bmi_calc(heights_in_cms, weights_in_kg)
BMI Weight
1 26.85185 87
2 25.39062 65
3 26.84636 100
Edit: from the discussion in the comments
data.frame(Height.cm=heights_in_cms, Weight.kg=weights_in_kg, BMI=weights_in_kg*(heights_in_cms/100)^2)
Height.cm Weight.kg BMI
1 180 87 281.880
2 165 58 157.905
3 160 65 166.400
4 193 100 372.490
You can do:
bmi_calc <- function(h, w) {
bmi = w/(h/100)^2
print(list(`BMIs are:` = bmi[bmi > 25],
`Weights are:` = w[bmi > 25]))
}
bmi_calc(heights_in_cms, weights_in_kg)
$`BMIs are:`
[1] 26.85185 25.39062 26.84636
$`Weights are:`
[1] 87 65 100

FitDist ERROR: 'data must be a numeric vector of length greater than 1'

I am trying to fit a distribution using the following code:
fit.gamma <- fitdist(x, distr = "gamma", method = "mle")
I get the following error:
Error in fitdist(x, distr = "gamma", method = "mle") :
data must be a numeric vector of length greater than 1
X is a numeric variable. It looks like this when plotted.1
Why do I get this error. Any tips are greatly appreciated.
> class(x)
[1] "numeric"
> str(x)
atomic [1:18839] 7 175 386 375 397 333 378 394 330 346 ...
- attr(*, "na.action")=Class 'omit' int [1:17] 1 209 267 286 288 297 299 300 304 305 ...
> dput(head(x, 20))
c(7, 175, 386, 375, 397, 333, 378, 394, 330, 346, 306, 344, 308,
278, 291, 284, 252, 294, 277, 241)
Thanks
The issue appears to be that you used na.omit, which doesn't return a vector like fitdist is expecting.
Instead of this
x <- na.omit(x.init)
fit.gamma <- fitdist(x, distr = "gamma", method = "mle")
try converting the output of na.omit to a vector
x <- c(na.omit(x.init))
fit.gamma <- fitdist(x, distr = "gamma", method = "mle")

R - overestimation predict function in sinusoidal linear model

Background
I have a data file which consists of Sea Levels near the Dutch Storm Barrier over time, for various days. The goal is to fit a linear model that describes this evolution of the sea level, and given a certain time frame, make a 5 minute ahead-prediction of the sea level (forecasting).
Approach
Given a certain day (chosen on forehand), I chose a time frame on which I want to fit\train the linear model. After some technical adjustments (see below for actual code), I fitted the model. Then, the linear model object and 5 minutes time range are used in the command 'predict()' for a prediction, and the 'forecast' along with a confidence interval is graphed, just behind the fitted model in the first time frame, all in one plot (see below for an example plot).
Problem
The forecast of the model always over- or under predicts hugely. In terms of magnitude, the forecast is a factor 10^10 (or, equivalently, e+10) off. At the same time, the R^2 and R_adj^2 are 'quite high', (0,972 and 0,9334, respectively) and the model diagnostics (leverages, fitted vs residuals, normal qq) look 'reasonably good'. Hence, my problem/question is: How can the model that fits the data so well, predict/forecast so badly? My only explanation is mistake in the code, but I can't spot it.
The data set
More specifically, the dataset is a data frame, which consists (apart from a index column) of 3 columns: 'date' ( "yyyy-mm-dd" format), 'time' ( "hh:mm:ss" format) and 'water' (integer between approx -150 and 350, sea level in cm). Here's a small slice of the data which already gives rise to the above problem:
> SeaLvlAug30[fitRngAug, ]
date time water
1574161 2010-08-30 04:40:00 253
1574162 2010-08-30 04:40:10 254
1574163 2010-08-30 04:40:20 253
1574164 2010-08-30 04:40:30 250
1574165 2010-08-30 04:40:40 250
1574166 2010-08-30 04:40:50 252
1574167 2010-08-30 04:41:00 250
1574168 2010-08-30 04:41:10 247
1574169 2010-08-30 04:41:20 246
1574170 2010-08-30 04:41:30 245
1574171 2010-08-30 04:41:40 242
1574172 2010-08-30 04:41:50 241
1574173 2010-08-30 04:42:00 242
1574174 2010-08-30 04:42:10 244
1574175 2010-08-30 04:42:20 245
1574176 2010-08-30 04:42:30 247
1574177 2010-08-30 04:42:40 247
1574178 2010-08-30 04:42:50 249
1574179 2010-08-30 04:43:00 250
1574180 2010-08-30 04:43:10 250
Minimal runnable R code
# Construct a time frame of a day with steps of 10 seconds
SeaLvlDayTm <- c(1:8640)*10
# Construct the desired fit Range and prediction Range
ftRng <- c(1:20)
predRng <- c(21:50)
# Construct the desired columns for the data frame
date <- rep("2010-08-30", length(c(ftRng,predRng)))
time <- c("04:40:00", "04:40:10", "04:40:20", "04:40:30", "04:40:40", "04:40:50", "04:41:00",
"04:41:10", "04:41:20", "04:41:30", "04:41:40", "04:41:50", "04:42:00", "04:42:10",
"04:42:20", "04:42:30", "04:42:40", "04:42:50", "04:43:00", "04:43:10", "04:43:20",
"04:43:30", "04:43:40", "04:43:50", "04:44:00", "04:44:10", "04:44:20", "04:44:30",
"04:44:40", "04:44:50", "04:45:00", "04:45:10", "04:45:20", "04:45:30", "04:45:40",
"04:45:50", "04:46:00", "04:46:10", "04:46:20", "04:46:30", "04:46:40", "04:46:50",
"04:47:00", "04:47:10", "04:47:20", "04:47:30", "04:47:40", "04:47:50", "04:48:00",
"04:48:10")
timeSec <- c(1681:1730)*10
water <- c(253, 254, 253, 250, 250, 252, 250, 247, 246, 245, 242, 241, 242, 244, 245, 247,
247, 249, 250, 250, 249, 249, 250, 249, 246, 246, 248, 248, 245, 247, 251, 250,
251, 255, 256, 256, 257, 259, 257, 256, 260, 260, 257, 260, 261, 258, 256, 256,
258, 258)
# Construct the data frame
SeaLvlAugStp2 <- data.frame(date, time, timeSec, water)
# Change the index set of the data frame to correspond that of a year
rownames(SeaLvlAugStp2) <- c(1574161:1574210)
#Use a seperate variable for the time (because of a weird error)
SeaLvlAugFtTm <- SeaLvlAugStp2$timeSec[ftRng]
# Fit the linear model
lmObjAug <- lm(SeaLvlAugStp2$water[ftRng] ~ sin((2*pi)/44700 * SeaLvlAugFtTm)
+ cos((2*pi)/44700 * SeaLvlAugFtTm) + poly(SeaLvlAugFtTm, 3)
+ cos((2*pi)/545 * SeaLvlAugFtTm) + sin((2*pi)/545 * SeaLvlAugFtTm)
+ cos((2*pi)/205 * SeaLvlAugFtTm) + sin((2*pi)/205 * SeaLvlAugFtTm)
+ cos((2*pi)/85 * SeaLvlAugFtTm) + sin((2*pi)/85 * SeaLvlAugFtTm),
data = SeaLvlAug30Stp2[ftRng, ])
# Get information about the linear model fit
summary(lmObjAug)
plot(lmObjAug)
#Compute time range prediction and fit
prdtRngTm <- timeSec[prdtRng]
ftRngTm <- timeSec[ftRng]
#Compute prediction/forecast based on fitted data linear model
prdtAug <- predict(lmObjAug, newdata=data.frame(SeaLvlAugFtTm = prdtRngTm), interval="prediction", level=0.95)
#Calculate lower and upper bound confidence interval prediction
lwrAug <- prdtAug[, 2]
uprAug <- prdtAug[, 3]
#Calculate minimum and maximum y axis plot
yminAug <- min(SeaLvlAug30$water[fitRngAug], SeaLvlAug30$water[prdtRngAug], lwrAug)
ymaxAug <- max(SeaLvlAug30$water[fitRngAug], SeaLvlAug30$water[prdtRngAug], uprAug)
#Make the plot
plot((timeSec/10)[ftRng], SeaLvlAugStp2$water[ftRng], xlim = c(min(timeSec/10), max(prdtRngAug30)), ylim = c(yminAug, ymaxAug), col = 'green', pch = 19, main = "Sea Level high water & prediction August 30 ", xlab = "Time (seconds)", ylab = "Sea Level (cm)")
polygon(c(sort(prdtRngTm/10), rev(sort(prdtRngTm/10))), c(uprAug, rev(lwrAug)), col = "gray", border = "gray")
points(prdtRngTm/10, SeaLvlAug30$water[prdtRngTm/10], col = 'green', pch = 19)
lines(ftRngTm/10, fitted(lmObjAug), col = 'blue', lwd = 2)
lines(prdtRngTm/10, prdtAug[, 1], col = 'blue', lwd = 2)
legend("topleft", legend = c("Observ.", "Predicted", "Conf. Int."), lwd = 2, col=c("green", "blue", "gray"), lty = c(NA, 1, 1), pch = c(19, NA, NA))
Example plot
Sea Lvl High Water & prediction August 30
Until you post a question with code that we can run we won't be able to help you more but in the meantime here is a quick and dirty forecast from Rob J Hyndman forecast package:
string_data <- "row date time water
1574161 2010-08-30 04:40:00 253
1574162 2010-08-30 04:40:10 254
1574163 2010-08-30 04:40:20 253
1574164 2010-08-30 04:40:30 250
1574165 2010-08-30 04:40:40 250
1574166 2010-08-30 04:40:50 252
1574167 2010-08-30 04:41:00 250
1574168 2010-08-30 04:41:10 247
1574169 2010-08-30 04:41:20 246
1574170 2010-08-30 04:41:30 245
1574171 2010-08-30 04:41:40 242
1574172 2010-08-30 04:41:50 241
1574173 2010-08-30 04:42:00 242
1574174 2010-08-30 04:42:10 244
1574175 2010-08-30 04:42:20 245
1574176 2010-08-30 04:42:30 247
1574177 2010-08-30 04:42:40 247
1574178 2010-08-30 04:42:50 249
1574179 2010-08-30 04:43:00 250
1574180 2010-08-30 04:43:10 250"
SeaLvlAug30 <- read.table(textConnection(string_data), header=TRUE)
library(forecast)
fit <- auto.arima(SeaLvlAug30$water)
summary(fit)
preds <- forecast(fit, h = 25)
preds
# Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
# 21 249.7563 247.7314 251.7812 246.6595 252.8531
# 22 249.4394 246.1177 252.7611 244.3593 254.5195
# 23 249.1388 244.9831 253.2945 242.7832 255.4944
# 24 248.8930 244.2626 253.5234 241.8114 255.9746
# 25 248.7110 243.8397 253.5822 241.2610 256.1609
# 26 248.5867 243.6073 253.5661 240.9713 256.2021
# 27 248.5085 243.4867 253.5302 240.8284 256.1885
# 28 248.4636 243.4280 253.4991 240.7624 256.1647
# 29 248.4410 243.4020 253.4800 240.7345 256.1475
# 30 248.4322 243.3927 253.4718 240.7249 256.1396
# 31 248.4311 243.3916 253.4707 240.7238 256.1385
# 32 248.4337 243.3941 253.4733 240.7263 256.1411
# 33 248.4376 243.3979 253.4773 240.7300 256.1452
# 34 248.4414 243.4016 253.4812 240.7337 256.1492
# 35 248.4447 243.4048 253.4845 240.7368 256.1525
# 36 248.4471 243.4072 253.4870 240.7392 256.1550
# 37 248.4488 243.4089 253.4887 240.7409 256.1567
# 38 248.4499 243.4100 253.4898 240.7420 256.1578
# 39 248.4505 243.4106 253.4905 240.7426 256.1585
# 40 248.4509 243.4109 253.4908 240.7429 256.1588
# 41 248.4510 243.4111 253.4910 240.7431 256.1589
# 42 248.4510 243.4111 253.4910 240.7431 256.1590
# 43 248.4510 243.4111 253.4910 240.7431 256.1590
# 44 248.4510 243.4110 253.4909 240.7430 256.1589
# 45 248.4509 243.4110 253.4909 240.7430 256.1589
plot(preds)

Resources