Sequentially re-ordering sections of a vector around NA values - r

I have a large set of data that I want to reorder in groups of twelve using the sample() function in R to generate randomised data sets with which I can carry out a permutation test. However, this data has NA characters where data could not be collected and I would like them to stay in their respective original positions when the data is shuffled.
With help on a previous question I have managed to shuffle the data around the NA values for a single vector of 24 values with the code:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42)
example.data[!is.na(example.data)] <- sample(example.data[!is.na(example.data)], replace = F, prob = NULL)
[1] 0.64 0.83 NA 0.33 0.47 0.90 0.25 0.12 NA NA 0.42 1.00
Extending from this, if I have a set of data with a length of 24 how would I go about re-ordering the first and second set of 12 values as individual cases in a loop?
For example, a vector extending from the first example:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42, 0.73, NA, 0.56, 0.12, 1.0, 0.47, NA, 0.62, NA, 0.98, NA, 0.05)
Where example.data[1:12] and example.data[13:24] are shuffled separately within their own respective groups around their NA values.
The code I am trying to work this solution into is as follows:
shuffle.data = function(input.data,nr,ns){
simdata <- input.data
for(i in 1:nr){
start.row <- (ns*(i-1))+1
end.row <- start.row + actual.length[i] - 1
newdata = sample(input.data[start.row:end.row], size=actual.length[i], replace=F)
simdata[start.row:end.row] <- newdata
}
return(simdata)}
Where input.data is the raw input data (example.data); nr is the number of groups (2), ns is the size of each sample (12); and actual.length is the length of each group exluding NAs stored in a vector (actual.length <- c(9, 8) for the example above).
Would anyone know how to go about achieving this?
Thank you again for your help!

I agree with Gregor's comment that it may be a better approach to work with the data in another form. However, what you need to accomplish can still be done easily enough even if all the data is in one vector.
First make a function that shuffles only non-NA values of an entire vector:
shuffle_real <- function(data){
# Sample from only the non-NA values,
# and store the result only in indices of non-NA values
data[!is.na(data)] <- sample(data[!is.na(data)])
# Then return the shuffled data
return(data)
}
Now write a function that takes in a larger vector, and applies this function to each group in the vector:
shuffle_groups <- function(data, groupsize){
# It will be convenient to store the length of the data vector
N <- length(data)
# Do a sanity check to make sure there's a match between N and groupsize
if ( N %% groupsize != 0 ) {
stop('The length of the data is not a multiple of the group size.',
call.=FALSE)
}
# Get the index of every first element of a new group
starts <- seq(from=1, to=N, by=groupsize)
# and for every segment of the data of group 'groupsize',
# apply shuffle_real to it;
# note the use of c() -- otherwise a matrix would be returned,
# where each column is one group of length 'groupsize'
# (which I note because that may be more convenient)
return(c(sapply(starts, function(x) shuffle_real(data[x:(x+groupsize-1)]))))
}
For example,
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00,
0.42, 0.73, NA, 0.56, 0.12, 1.0, 0.47, NA, 0.62, NA, 0.98,
NA, 0.05)
set.seed(1234)
shuffle_groups(example.data, 12)
which results in
> shuffle_groups(example.data, 12)
[1] 0.12 0.83 NA 1.00 0.47 0.64 0.25 0.33 NA NA 0.90 0.42 0.47 NA
[15] 0.05 1.00 0.56 0.62 NA 0.73 NA 0.98 NA 0.12
or try shuffle_groups(example.data[1:23], 12), which results in Error: The length of the data is not a multiple of the group size.

Related

Recursive regression using R

I would like to run a recursive regression using my variables residential_ddiff and interest_diff thus testing the stability of the coefficients of my variable interest_diff.
The issue occurs as I want the recursive regression to be run on the 1 lag in the window of [i:(i+10)] observations but keep getting the same error:
Error in merge.zoo(residential_ddiff[i], L(interest_diff, 1)[i:(i + 10)], :
all(sapply(args, function(x) is.zoo(x) || !is.plain(x) || (is.plain(x) && .... is not TRUE
Both time series are n=91 and stored as ts objects. I've tried using both ts and numerical objects in my loop.
I've attached a screenshot of my code. For loop in R
Gratefull for any help, thank you.
I've tried lots of different options. Both trying to coerce using the as.zoo() function as well as defining the current observation of residential_ddiff[i] both same error keep occuring.
Reproducable example:
library(dynlm)
# Creating two datasets
data_1 <- c(3.2705, -1.9710, 1.3821, 1.3194, -0.8008, 0.2832, 3.2705, -1.9710, 1.3821, 1.3194, -0.8008, 0.2832, 3.2705, -1.9710, 1.3821, 1.3194, -0.8008, 0.2832, 3.2705, -1.9710, 1.3821, 1.3194, -0.8008, 0.2832)
data_2 <- c(1.41, 0.33, 0.32, 1.53, -1.55, 0.73, 1.41, 0.33, 0.32, 1.53, -1.55, 0.73, 1.41, 0.33, 0.32, 1.53, -1.55, 0.73, 1.41, 0.33, 0.32, 1.53, -1.55, 0.73)
# Storing data in a dataframe
df <- data.frame(data_1, data_2)
# Making sure the dataframe are numeric
df1 <- mutate_all(df, function(x) as.numeric(as.character(x)))
# Creating variable to store coefficients
estimate.store <- matrix(ncol = 6, nrow = nrow(df1)-3)
# Loop begins
for (i in 1:(nrow(df1-3))) {
# Creating af dynamic linear regression with data 1 and the rekursive values of i:(i+3) for the first lag of data_2. Here the main issue arises, I think.
estimation.store <- dynlm(df1$data_1[i] ~ L(df1$data_2,1)[i:(i+3)], as.zoo(df1))
estimate.store[i,] <- c(estimation.store$coef[1], confint(estimation.store)[1,], estimation.store$coef[2], confint(estimation.store)[1,])
}

Find indices of slope changes in a vector in R

I have a data frame with two columns: (1) datetimes and (2) streamflow values. I would like to create a 3rd column with indicator values to find sudden increases (usually a 0 but it is a 1 when the streamflow shows a big increase).
datetime <- as.POSIXct(c(1557439200, 1557440100, 1557441000, 1557441900,1557442800,
1557443700, 1557444600, 1557445500, 1557446400, 1557447300, 1557448200, 1557449100, 1557450000, 1557450900,
1557451800, 1557452700, 1557453600, 1557454500, 1557455400, 1557456300, 1557457200, 1557458100, 1557459000), origin = "1970-01-01")enter code here
streamflow <- c(0.35, 0.35, 0.36, 0.54, 1.0, 2.7, 8.4, 9.3, 6.2, 3.8, 4.7,
2.91, 2.01, 1.65, 1.41, 1.12, 0.95, 0.62, 0.52, 0.53, 0.53, 0.44, 0.35)
data <- data.table(as.POSIXct(datetime), as.numeric(streamflow))
I am trying to create a function that would identify the datetime of where it jumps from 0.5 to 1 because that is when the event starts. It would then stop indicating it is an event when the streamflow goes below a certain threshold.
My current idea is a function that compares the local slope between two consecutive points in streamflow to a slope of all the values of streamflow within some window, but I don't really know how to write that. Or maybe there is a better idea for how to do what I am trying to do
data = data[, delta := (V2-lag(V2))/lag(V2)][
, ind_jump := delta > 0.5
]
indices <- data[ind_jump==TRUE, V1]
Not related to this, but for some weird reason R gives
(0.54 - 0.36)/0.36 > 0.5
[1] TRUE
while
0.18/0.36 > 0.5
[1] FALSE

Continously calculate the an initial investment by its return vector

I´d need some help to create a vector that contains the value of an investment in every point in time.
Imagine, I have the return (in%) of a single stock to 10 different consecutive months. Then I got an intital value of $100 and consecutively multiply the return of period t with the value of the Investment of period t-1. The output must be a vector because I want to plot the results.
Unfortunately, I have no idea to create a code - probably its a for loop?
The Monthly return:
c(-0.09, -0.11, -0.2, -0.45, -0.11, 0.2, -0.27, -0.15, -0.24,
0.16)
Value of Investment respectively:
100*(1+(-0.09))=91
91*(1+(-0,11))= 80,99
...
Desired Output Vector:
c(91, 80.99, 64.792, …)
I´m not quite sure how to compute this vector with a loop, function or other method.
I´m very glad about any help! Cheers!
r <- c(-0.09, -0.11, -0.2, -0.45, -0.11, 0.2, -0.27, -0.15, -0.24,
0.16)
100*cumprod(1 + r)
# [1] 91.00000 80.99000 64.79200 35.63560 31.71568 38.05882 27.78294 23.61550 17.94778 20.81942
or
Reduce(function(x, r) x*(1 + r), r, init = 100, accumulate = T)
# [1] 100.00000 91.00000 80.99000 64.79200 35.63560 31.71568 38.05882 27.78294 23.61550
# [10] 17.94778 20.81942

Optimisation of matrix in R

I'm new to optimisation/calibration of models in R, but i'm eager to learn and really need some help. My question relates to demographic modelling.
I've done some research and found help here and here but neither have quite answered my question.
I have a matrix of scalars (propensities) where each column must total to 1. These propensities are used to estimate the number of households that would arise from a given population (persons by age). The propensities model tends to overestimate the number of households in history (for which I know the true number of households).
I want to calibrate the model to minimise the error in the number of households by tweaking the propensities such that the columns still add to 1 and propensities with an initial value of zero must remain zero.
Simple example:
# Propensities matrix
mtx <- matrix(c(0.00, 0.00, 0.85, 0.00, 0.15, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Population by age cohort
pop <- c(2600, 16200, 13400)
# True number of households
target <- c(7000, 4500, 5500)
# Function to optimise
hh <- function(mtx, pop, target) {
# Estimate living arrangements
x <- mtx %*% pop
# Estimate number of households using parent cohorts (1,2 and 4)
x <- c(x[1,1]/2, x[2,1]/2, x[4,1]) - target
return(x)
}
I haven't included any of my code for the optimisation/calibration step as it would be embarrassing and I've haven't been able to get anything to work!
Ideally i will have one set of propensities that generalises well for lots of different regions at the end of this process. Any advice on how i should go about achieving it? Helpful links?
Update
The snippet of code below executes the local search method as suggested by Enrico.
library(tidyverse)
library(NMOF)
data <- list(mtx = matrix(c(0.00, 0.00, 0.90, 0.00, 0.10, 0.25, 0.50, 0.00,
0.25, 0.00, 0.60, 0.20, 0.00, 0.20, 0.00), ncol = 3),
pop = c(2600, 16200, 13400),
target = c(7190, 4650, 5920))
# True mtx
mtx.true <- matrix(c(0.00, 0.00, 0.75, 0.00, 0.25, 0.35, 0.45, 0.00,
0.20, 0.00, 0.65, 0.15, 0.00, 0.20, 0.00), ncol = 3)
# Function to optimise
households <- function(x, data) {
# Estimate living arrangements
z <- x %*% data$pop
# Estimate number of households using parent cohorts (1,2 and 4)
z <- c(z[1,1]/2, z[2,1]/2, z[4,1]) - data$target
sum(abs(z))
}
# Local search function to perturb propensities
neighbour <- function(x, data) {
# Choose random column from mtx
i <- sample(1:ncol(x), 1)
# Select two non-zero propensities from mtx column
j <- which(x[, i] != 0) %>% sample(2, replace = FALSE)
# Randomnly select one to perturb positively
x[j[1], i] <- 0.1 * (1 - x[j[1], i]) + x[j[1], i]
# Perturb second propensity to ensure mtx column adds to 1
x[j[2], i] <- x[j[2], i] + (1 - sum(x[,i]))
x
}
# Local search algorithm inputs
localsearch <- list(x0 = data$mtx,
neighbour = neighbour,
nS = 50000,
printBar = FALSE)
# Execute
now <- Sys.time()
solution <- LSopt(OF = households, algo = localsearch, data)
#>
#> Local Search.
#> Initial solution: 2695
#> Finished.
#> Best solution overall: 425.25
Sys.time() - now
#> Time difference of 6.33272 secs
# Inspect propensity matrices
print(solution$xbest)
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.3925 0.6
#> [2,] 0.0000000 0.4250 0.2
#> [3,] 0.2937976 0.0000 0.0
#> [4,] 0.0000000 0.1825 0.2
#> [5,] 0.7062024 0.0000 0.0
print(mtx.true)
#> [,1] [,2] [,3]
#> [1,] 0.00 0.35 0.65
#> [2,] 0.00 0.45 0.15
#> [3,] 0.75 0.00 0.00
#> [4,] 0.00 0.20 0.20
#> [5,] 0.25 0.00 0.00
Thanks!
I can only comment on the optimisation part.
The code you have provided is sufficient; only your objective function evaluates to a vector. You will need to transform this vector into a single number that is to be minimised, such as the sum of squares or of absolute values.
When it comes to methods, I would try heuristics; in fact, I would try a Local-Search method. These methods operate on the solution through functions which you define; thus, you may code your solution as a matrix. More specifically, you would need two functions: the objective function (which you essentially have) and a neighbourhood function, which takes as input a solution and modifies it. In your particular case, it could take a matrix, select two none-zero elements from one column, and increase one and decrease the other. Thus, the column sum would remain unchanged.
Perhaps the tutorial http://enricoschumann.net/files/NMOF_Rmetrics2012.pdf is of interest, with R code http://enricoschumann.net/files/NMOF_Rmetrics2012.R .

r- The confuse details of findCorrelation() (caret package) when setting exact=True

According to the findCorrelation() document I run the official example as shown below:
Code:
library(caret)
R1 <- structure(c(1, 0.86, 0.56, 0.32, 0.85, 0.86, 1, 0.01, 0.74, 0.32,
0.56, 0.01, 1, 0.65, 0.91, 0.32, 0.74, 0.65, 1, 0.36,
0.85, 0.32, 0.91, 0.36, 1),
.Dim = c(5L, 5L))
colnames(R1) <- rownames(R1) <- paste0("x", 1:ncol(R1))
findCorrelation(R1, cutoff = .6, exact = TRUE, names = TRUE
,verbose = TRUE)
Result:
> findCorrelation(R1, cutoff = .6, exact = TRUE, names = TRUE, verbose = TRUE)
## Compare row 1 and column 5 with corr 0.85
## Means: 0.648 vs 0.545 so flagging column 1
## Compare row 5 and column 3 with corr 0.91
## Means: 0.53 vs 0.49 so flagging column 5
## Compare row 3 and column 4 with corr 0.65
## Means: 0.33 vs 0.352 so flagging column 4
## All correlations <= 0.6
## [1] "x1" "x5" "x4"
I have no idea how the computation process works, i. e. why there are first compared row 1 and column 5, and how the mean is calculated, even after I have read the source file.
I hope that someone could explain the algorithm with the help of my example.
First, it determines the average absolute correlation for each variable. Columns x1 and x5 have the highest average (mean(c(0.85, 0.56, 0.32, 0.86)) and mean(c(0.85, 0.9, 0.36, 0.32)) respectively), so it looks to remove one of these on the first step. It finds x1 to be the most globally offensive, so it removes it.
After that, it recomputes and compares x5 and x3 using the same process.
It stops after removing three columns since all pairwise correlations are below your threshold.

Resources