R: Extrapolating x no. of values beyond known values - r

I'm looking for a function/method to extrapolate (linearly) for an x number of values beyond the original values.
Let's say I start with:
a <- c(NA, NA, NA, NA, NA, NA, 1, 2, 3, NA, NA, NA, NA, NA, NA)
And I want to extrapolate two values beyond, I would end up with:
[1] NA NA NA NA -1 0 1 2 3 4 5 NA NA NA NA
What I found so far is the approxExtrap function from Hmisc (https://rdrr.io/cran/Hmisc/man/approxExtrap.html). But since you have to define 'xout', I feel that I have to write a loop and every time select pieces I want to extrapolate on. This is possible of course, but ultimately I expect to have sequences of millions of datapoints with a lot of gaps, so I feel this may be too time consuming. So I hope I'm overlooking a simpler solution.
Added: There are no small gaps in the data, but typically ~ 100 NA's and then ~ 40 datapoints. I would like to extrapolate/extend the 40 datapoints with 5 new datapoints before the start and after the end of the 40 datapoints and replace 5 NA's at both locations. It is not possible to interpolate between two sequences of 40 datapoints.

I managed to solve the problem by:
Determining the ranges of the different series of data
Define the range I want to extrapolate to
Do the actual extrapolation through the Hmisc package
Initially, I thought I could only manage this by some loops that had to go through the raw data row by row, and was hoping for an existing function.
I'm sure many of you would have coded this way more efficient and nicer. But wanted to post my script anyway for people with a similar problem.
require(Hmisc)
extrapol.length <- 5
test <- data.frame('Time' = c(1:100), # I didn't use this as my data was equally spread in time, if you want to use it, see the first argument in the approxExtrap-function in the secondlast line
'x' = c(rep(NA, 10), 1:30, rep(NA, 30), 1:10, rep(NA, 20)))
## Determine start and end of the continuous (non-NA) data streams
length.values <- diff(c(0, which(is.na(test[,2]))))-2 # length non-NA's
length.values <- length.values[length.values > -1]
length.nas <- diff(c(0, which(!is.na(test[,2])))) # length NA's
length.nas <- length.nas[length.nas > 1]
if(is.na(test[1,2])){
# data starts with NA
length.nas <- data.frame('Order' = seq(1, length(length.nas)*2, by = 2),
'Length' = length.nas, 'Type' = 'na')
length.values <- data.frame('Order' = seq(2, length(length.values)*2, by = 2),
'Length' = length.values, 'Type' = 'value')
start.end <- rbind(length.nas, length.values)
start.end <- start.end[order(start.end$Order),]
value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'na'])),
'start' = NA, 'end' = NA)
for(a in value.seqs$no){
value.seqs$start[a] <- sum(start.end$Length[1:((a*2)-1)])
value.seqs$end[a] <- sum(start.end$Length[1:(a*2)])
}
}else{
# Data starts with actual values
length.nas <- data.frame('Order' = seq(2, length(length.nas)*2, by = 2),
'Length' = length.nas, 'Type' = 'na')
length.values <- data.frame('Order' = seq(1, length(length.values)*2, by = 2),
'Length' = length.values, 'Type' = 'value')
start.end <- rbind(length.nas, length.values)
start.end <- start.end[order(start.end$Order),]
value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'value'])),
'start' = c(1,rep(NA, (length(start.end$Type[start.end$Type == 'value'])-1))), 'end' = NA)
for(a in value.seqs$no){
value.seqs$end[a] <- sum(start.end$Length[1:((a*2)-1)])+1
if(a < max(value.seqs$no))
value.seqs$start[a+1] <- sum(start.end$Length[1:(a*2)])+1
}
}
## Do not extrapolate outside of the time-range of the original dataframe
value.seqs$start.extr <- value.seqs$start - extrapol.length
value.seqs$start.extr[value.seqs$start.extr < 1] <- 1 # do not extrapolate below time < 1
value.seqs$end.extr <- value.seqs$end + extrapol.length
value.seqs$end.extr[value.seqs$end.extr > nrow(test) | is.na(value.seqs$end.extr)] <- nrow(test)
value.seqs$end[is.na(value.seqs$end)] <- max(which(!is.na(test[,2])))
## Extrapolate
for(b in value.seqs$no){
test[c(value.seqs$start.extr[b]:value.seqs$end.extr[b]),3] <- approxExtrap(value.seqs$start[b]:value.seqs$end[b],test[c(value.seqs$start[b]:value.seqs$end[b]),2],xout=c(value.seqs$start.extr[b]:value.seqs$end.extr[b]))[2]
}
Thanks for thinking along!

Related

R: How to access a 'complicated list'

I am working on an assignment, which tasks me to generate a list of data, using the below code.
##Use the make_data function to generate 25 different datasets, with mu_1 being a vector
x <- seq(0, 3, len=25)
make_data <- function(a){
n = 1000
p = 0.5
mu_0 = 0
mu_1=a
sigma_0 = 1
sigma_1 = 1
y <- rbinom(n, 1, p)
f_0 <- rnorm(n, mu_0, sigma_0)
f_1 <- rnorm(n, mu_1, sigma_1)
x <- ifelse(y == 1, f_1, f_0)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
list(train = data.frame(x = x, y = as.factor(y)) %>% slice(-test_index),
test = data.frame(x = x, y = as.factor(y)) %>% slice(test_index))
}
dat <- sapply(x,make_data)
The code looks good to go, and 'dat' appears to be a 25 column, 2 row table, each with its own data frame.
Now, each data frame within a cell has 2 columns.
And this is where I get stuck.
While I can get to the data frame in row 1, column 1, just fine (i.e. just use dat[1,1]), I can't reach the column of 'x' values within dat[1,1]. I've experimented with
dat[1,1]$x
dat[1,1][1]
But they only throw weird responses: error/null.
Any idea how I can pull the column? Thanks.
dat[1, 1] is a list.
class(dat[1, 1])
#[1] "list"
So to reach to x you can do
dat[1, 1]$train$x
Or
dat[1, 1][[1]]$x
As a sidenote, instead of having this 25 X 2 matrix as output in dat I would actually prefer to have a nested list.
dat <- lapply(x,make_data)
#Access `x` column of first list from `train` dataset.
dat[[1]]$train$x
However, this is quite subjective and you can chose whatever format you like the best.

rollapply for moving average with non-business day

I'd like to get MovingAverage in data which have "NA" in the middle of data like below.
date <- seq.Date(as.Date("2018-07-02"),as.Date("2018-07-14"),by = "days")
A <- c(100,110,120,130,140,NA,NA,150,160,170,180,190,200)
B <- c(200,220,240,260,280,NA,NA,300,320,340,360,380,400)
C <- c(150,160,170,180,190,200,210,NA,NA,220,230,240,250)
dataset <- data.frame(A,B,C)
dataset <- as.xts(dataset, order.by = date)
If I use rollapply like below to get 3-day MovingAverage...
y <- rollapply(dataset, width = 3, function(x) mean(x, na.rm = TRUE ))
This is not what I want.
For example, In MovingAverage of A at "2018-07-09", the result is (NA+NA+150)/1 = 150. But I want to get (130+140+150)/3 = 140.
How can I do that?
I assume you want NAs to stay as NA and otherwise to take the mean of the last 3 non-NAs.
1) Take 5 elements at a time and if the last element is NA then return NA; otherwise, remove the NAs and take the mean of the last 3. Note that this does imply that the first 4 rows will be NA.
mean_bus <- function(x) if (is.na(tail(x, 1))) NA else mean(tail(na.omit(x), 3))
y1 <- rollapplyr(dataset, width = 5, mean_bus)
2) An alternate would be to take the last 3 non-NAs and then overwrite that with NAs in all positions where the input is NA.
mean_omit <- function(x) mean(tail(na.omit(x), 3))
y <- rollapplyr(dataset, 5, mean_omit)
y2 <- replace(y, is.na(dataset), NA)
all.equal(y1, y2)
## [1] TRUE
3) If you prefer to fill in the first 4 rows with partial values then convert to zoo and use the partial= argument of rollapplyr.zoo. mean_bus is from (1).
y3 <- as.xts(rollapplyr(as.zoo(dataset), 5, mean_bus, partial = TRUE))
You could either remove the NAs in each series before you compute the moving average (MA).
Or you use a larger window and keep only the last three values for the MA.
y <- rollapply(dataset, width = 5,
function(x) {mean(tail(x[ !is.na(x) ], 3))})

Mice: partial imputation using where argument failing

I encounter a problem with the use of the mice function to do multiple imputation. I want to do imputation only on part of the missing data, what looking at the help seems possible and straightworward. But i can't get it to work.
here is the example:
I have some missing data on x and y:
library(mice)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10),c("x","y")] <- NA
I want only to impute missing data on y:
where <- data.frame(ID = rep(FALSE,100),x = rep(FALSE,100),y = is.na(plouf$y))
I do the imputation
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
I look at the imputed values:
test <- complete(plouf.imp)
Here i still have NAs on y:
> sum(is.na(test$y))
[1] 10
if I use where to say to impute on all values, it works:
where <- data.frame(ID = rep(FALSE,100),x = is.na(plouf$x),y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
test <- complete(plouf.imp)
> sum(is.na(test$y))
[1] 0
but it does the imputation on x too, that I don't want in this specific case (speed reason in a statistial simulation study)
Has anyone any idea ?
This is happening because of below code -
plouf[sample(100,10),c("x","y")] <- NA
Let's consider your 1st case wherein you want to impute y only. Check it's PredictorMatrix
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
plouf.imp
#PredictorMatrix:
# ID x y
#ID 0 0 0
#x 0 0 0
#y 1 1 0
It says that y's missing value will be predicted based on ID & x since it's value is 1 in row y.
Now check your sample data where you are populating NA in x & y column. You can notice that wherever y is NA x is also having the same NA value.
So what happens is that when mice refers PredictorMatrix for imputation in y column it encounters NA in x and ignore those rows as all independent variables (i.e. ID & x) are expected to be non-missing in order to predict the outcome i.e. missing values in y.
Try this -
library(mice)
#sample data
set.seed(123)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10), "x"] <- NA
set.seed(999)
plouf[sample(100,10), "y"] <- NA
#missing value imputation
whr <- data.frame(ID = rep(FALSE,100), x = rep(FALSE,100), y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
test <- complete(plouf.imp)
sum(is.na(test$y))
#[1] 1
Here only one value of y is left to be imputed and in this case both x & y are having NA value i.e. row number 39 (similar to your 1st case).

How to apply a custom function to each participant in a data frame

I created a function that calculates dPrime. The function takes a data frame as its argument. This works fine, however the columns must be called "stimDiff" and "stimSame", as the function calculates dPrime using these specific names. I would like to apply this function to a data frame that has multiple subjects, and be able to calculate dPrime for each subject, with the result being a new data frame with the dPrime score of each subject. The test data frame looks like this:
stimDiff0 <- c(rep("diff", 20), rep("same", 5))
stimSame0 <- c(rep("diff", 10), rep("same", 15))
stimDiff1 <- c(rep("diff", 10), rep("same", 15))
stimSame1 <- c(rep("diff", 10), rep("same", 15))
stimDiff2 <- c(rep("diff", 19), rep("same", 6))
stimSame2 <- c(rep("diff", 11), rep("same", 14))
stimDiff3 <- c(rep("diff", 21), rep("same", 4))
stimSame3 <- c(rep("diff", 9), rep("same", 16))
stimDiff4 <- c(rep("diff", 18), rep("same", 7))
stimSame4 <- c(rep("diff", 12), rep("same", 13))
stimDiff5 <- c(rep("diff", 22), rep("same", 3))
stimSame5 <- c(rep("diff", 14), rep("same", 11))
stimDiff <- c(stimDiff0, stimDiff1, stimDiff2,
stimDiff3, stimDiff4, stimDiff5)
stimSame <- c(stimSame0, stimSame1, stimSame2,
stimSame3, stimSame4, stimSame5)
subject <- rep(0:5, each = 25)
x <- data.frame(subject = subject, stimDiff = stimDiff, stimSame = stimSame)
I am trying to obtain a dPrim by subject data frame using the following code:
tapply(c(x$stimDiff, x$stimSame), x$subject, data = x, FUN = dPrime)
I get the following error:
Error en tapply(list(x$stimDiff, x$stimSame), x$subject, data = x, FUN = dPrime) :
arguments must have same length
I am aware of the fact that there are packages that can calculate dPrime. I am doing this in order to learn how to write functions. I would prefer to find a solution using base R.
Here is the code for the function dPrime:
dPrime <- function(x) {
# Calculate number of same, diff and total responses
# for the stimuli that were actually different
stimDiffRdiff <- nrow(x[x$stimDiff == 'diff', ])
stimDiffRsame <- nrow(x[x$stimDiff == 'same', ])
stimDiffTotal <- length(x$stimDiff)
# Calculate number of same, diff and total responses
# for the stimuli that were actually the same
stimSameRdiff <- nrow(x[x$stimSame == 'diff', ])
stimSameRsame <- nrow(x[x$stimSame == 'same', ])
stimSameTotal <- length(x$stimSame)
# Hit rate = the number of correct responses 'diff'
# when the stimuli were actually diff, divided by
# the total number of responses
hitRate <- stimDiffRdiff / stimDiffTotal
# Miss rate = the number of incorrect responses
# 'same' when the stimuli were actually diff
# divided by the total number of responses
missRate <- stimDiffRsame / stimDiffTotal
# False alarm = the number responses 'diff'
# when the stimuli were actually the same
# divided by the total number of responses
falseAlarm <- stimSameRdiff / stimSameTotal
# Correct rejection = the number of responses
# same when the stimuli were actually the same
# divided by the number of total responses
corrReject <- stimSameRsame / stimSameTotal
# Calculate z-critical values for hit rate
# and false alarm rate
zHitRate <- qnorm(hitRate)
zFalseAlarm <- qnorm(falseAlarm)
# Calculate d prime
dPrime <- zHitRate - zFalseAlarm
print(dPrime)
}
To build on #jvcasill's original function and on other users' responses:
dPrime <- function (data, subj = 1, stimDiff = 2, stimSame = 3) {
# dPrime() returns a vector of the length of the number of subjects
#+ in data[, subj] that contains the sensitivity index "d'" for each.
# `data`: data frame
# `subj`: index of "subject" column in `data`; default is 1
# `stimDiff`: index of "stimDiff" column in `data`; default is 2
# `stimSame`: index of "stimSame" column in `data`; default is 3
if (is.data.frame(data)) {
# Divide `data` by subject with split(), as have done others who've
#+ responded to this question
data.by.subj <- split(data, data[, subj])
# Calculate number of subjects and create vector of same length
#+ to return
n.subj <- length(data.by.subj)
dPrime.by.subj <- vector(mode = "double", length = n.subj)
# Loop through "data.by.subj" subject by subject and calculate d'
for (subj in seq_len(n.subj)) {
# For clarity, create temporary data set with data of
#+ current "subj"
data.tmp <- data.by.subj[[subj]]
stimDiffRdiff <- nrow(data.tmp[data.tmp[, stimDiff] == "diff", ])
stimDiffRsame <- nrow(data.tmp[data.tmp[, stimDiff] == "same", ])
stimDiffTotal <- length(data.tmp[, stimDiff])
stimSameRdiff <- nrow(data.tmp[data.tmp[, stimSame] == "diff", ])
stimSameRsame <- nrow(data.tmp[data.tmp[, stimSame] == "same", ])
stimSameTotal <- length(data.tmp[, stimSame])
hitRate <- stimDiffRdiff / stimDiffTotal
missRate <- stimDiffRsame / stimDiffTotal
falseAlarm <- stimSameRdiff / stimSameTotal
# The following appears unused in the original function
# corrReject <- stimSameRsame / stimSameTotal
zHitRate <- qnorm(hitRate)
zFalseAlarm <- qnorm(falseAlarm)
dPrime <- zHitRate - zFalseAlarm
dPrime.by.subj[subj] <- dPrime
}
# For clarity, give each d' value in vector to be returned,
#+ "dPrime.by.subj", name of corresponding subject
names(dPrime.by.subj) <- names(data.by.subj)
return(dPrime.by.subj)
} else stop("'data' is not a data frame")
}
Note that I'm not sure if the values it returns, which --- for the example data set provided by #jvcasill --- are the same as those obtained with #Splendour's method, make sense.
Try data.table (using length function rather than dPrime):
library(data.table)
xt = data.table(x)
xt[,list(len=length(c(stimSame,stimDiff))),by=subject]
subject len
1: 0 50
2: 1 50
3: 2 50
4: 3 50
5: 4 50
6: 5 50
With base R:
sapply(split(x, x$subject), dPrime)
[1] 1.094968
[1] 0
[1] 0.8572718
[1] 1.352917
[1] 0.6329951
[1] 1.024018
0 1 2 3 4 5
1.0949683 0.0000000 0.8572718 1.3529167 0.6329951 1.0240176
Duplicate output is because of 'print(dPrime)' statement in your dPrime function. You should replace that by return(dPrime). Better still, since dPrime is a function also, you should replace dPrime in 'dPrime <- zHitRate - zFalseAlarm' statement to some other name, say 'ret':
ret = dPrime <- zHitRate - zFalseAlarm
return(ret)
Here's a (somewhat inelegant) solution in base R:
Split the dataframe into lists, one per subject:
by.subject <- split(x, x$subject)
Calculate dPrime for every chunk, returning a named numeric vector:
dPrime.values <- unlist(lapply(by.subject, dPrime), use.names=T)
Construct a new dataframe:
df <- data.frame(dPrime=dPrime.values)
df$subject <- as.numeric(rownames(df))

Generate random number in a for loop in R software

I would like to ask how to generate random numbers in a for loop in R software.
I am trying to make a table with two columns, ID and time. Each ID has 7 times: 0,1,2,3,4,5 and the last number has to be random, between 6 and 7.
ID<-data.frame(rep(1:100,each=7))
for (i in unique(ID)){
ID$time <- c(0,1,2,3,4,5, x <-runif(1,6,7), 100)[ID==i]
}
An error message popped up:
Error in `$<-.data.frame`(`*tmp*`, "time", value = c(0, NA, NA, NA, NA, :
replacement has 8 rows, data has 700
You Could try using replicate, such as
ID$time <- c(replicate(100, c(0:5, runif(1, 6, 7))))
Althoguh replicate is wrapper for sapply which is basically a hidden for loop. Instead, you could also try a vectorized approach, such as:
ID$time <- 0:6
ID[ID$time == 6, "time"] <- runif(100, 6, 7)
I think you want this:
set.seed(123)
ID <- data.frame( time = c(0,1,2,3,4,5), x = runif(100,6,7))
You don't want to use '<-' for the arguments to data.frame
But maybe it's this that you want:
ID <- data.frame( time = rep( c(0,1,2,3,4,5,6), each=100) , x = runif(700,6,7))
(it's always a good idea to describe in a natural language what it is that you want.)

Resources