How to bootstrap respecting within-subject information? - r

This is the first time I post to this forum, and I want to say from the start I am not a skilled programmer. So please let me know if the question or code were unclear!
I am trying to get the 95% confidence interval (CI) for an interaction (that is my test statistic) by doing bootstrapping. I am using the package "boot". My problem is that for every resample, I would like the randomization to be done within subjects, so that observations from different subjects are not mixed. Here is the code to generate a dataframe similar to mine. As you can see, I have two within-subjects factors ("Num" and "Gram" and I am interested in the interaction between both):
Subject = rep(c("S1","S2","S3","S4"),4)
Num = rep(c("singular","plural"),8)
Gram = rep(c("gram","gram","ungram","ungram"),4)
RT = c(657,775,678,895,887,235,645,916,930,768,890,1016,590,978,450,920)
data = data.frame(Subject,Num,Gram,RT)
This is the code I used to get the empirical interaction value:
summary(lm(RT ~ Num*Gram, data=data))
As you can see, the interaction between my two factors is -348. I want to get a bootstrap confidence interval for this statistic, which I can generate using the "boot" package:
# You need the following packages
install.packages("car")
install.packages("MASS")
install.packages("boot")
library("car")
library("MASS")
library("boot")
#Function to create the statistic to be boostrapped
boot.huber <- function(data, indices) {
data <- data[indices, ] #select obs. in bootstrap sample
mod <- lm(RT ~ Num*Gram, data=data)
coefficients(mod) #return coefficient vector
}
#Generate bootstrap estimate
data.boot <- boot(data, boot.huber, 1999)
#Get confidence interval
boot.ci(data.boot, index=4, type=c("norm", "perc", "bca"),conf=0.95) #4 gets the CI for the interaction
My problem is that I think the resamples should be generated without mixing the individual subjects observations: that is, to generate the new resamples, the observations from subject 1 (S1) should be shuffled within subject 1, not mixing them with the observations from subjects 2, etc... I don't know how "boot" is doing the resampling (I read the documentation but don't understand how the function is doing it)
Does anyone know how I could make sure that the resampling procedure used by "boot" respects subject level information?
Thanks a lot for your help/advice!

Just modify your call to boot() like this:
data.boot <- boot(data, boot.huber, 1999, strata=data$Subject)
?boot provides this description of the strata= argument, which does exactly what you are asking for:
strata: An integer vector or factor specifying the strata for
multi-sample problems. This may be specified for any
simulation, but is ignored when ‘sim = "parametric"’. When
‘strata’ is supplied for a nonparametric bootstrap, the
simulations are done within the specified strata.
Additional note:
To confirm that it's working as you'd like, you can call debugonce(boot), run the call above, and step through the debugger until the object i (whose rows contain the indices used to resample rows of data to create each bootstrap resample) has been assigned, and then have a look at it.
debugonce(boot)
data.boot <- boot(data, boot.huber, 1999, strata=data$Subject)
# Browse[2]>
## [Press return 34 times]
# Browse[2]> head(i)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
# [1,] 9 10 11 16 9 14 15 16 9 2 15 16 1 10
# [2,] 9 14 7 12 5 6 15 4 13 6 11 16 13 6
# [3,] 5 10 15 16 9 6 3 4 1 2 15 12 5 6
# [4,] 5 10 11 4 9 6 15 16 9 14 11 16 5 2
# [5,] 5 10 3 4 1 10 15 16 9 6 3 8 13 14
# [6,] 13 10 3 12 5 10 3 4 5 14 7 16 5 14
# [,15] [,16]
# [1,] 7 8
# [2,] 11 16
# [3,] 3 16
# [4,] 3 8
# [5,] 7 8
# [6,] 7 12
(You can enter Q to leave the debugger at any time.)

Related

Stacking kernel density graphs vertically

I have a matrix (C) of outputs from a bayesian model with 3000 rows which contain the week number (1-13) in which a given bird breeding behavior (columns; singing, incubating, fledglings etc.) is most likely to occur. I have visualized kernel density estimates for the week in which a behavior is most likely to occur using this code:
G <- mcmc_dens(C, pars = c("Singing", "Building", "Incubating", "Nestlings", "Empty Nest", "Fledglings Observed", "Fledgling/Adult Interactions", "Fledgling Foraging"))
G <- G + theme(axis.title = element_text(face="plain",size=12)) + labs(x ="Week") + scale_x_continuous(breaks = 1:13)
...which produces these figures:
I would like to stack the figures above one another so that I have one figure with the same x-axis where you can easily see which behaviors peak at the same time, but I don't know how to do this with mcmc_dens (i.e. I want the graph for singing to be above building, both singing and building to be above incubating, and so on so that I have eight vertically aligned graphs).
Data sample from matrix C (does not include all columns):
Singing Building Incubating Nestlings Empty Nest
[1,] 8 8 8 8 13
[2,] 8 8 8 11 4
[3,] 9 8 8 12 13
[4,] 5 4 8 11 13
[5,] 9 8 8 8 13
[6,] 9 8 8 8 13
[7,] 5 8 8 11 13
[8,] 9 8 10 11 12
[9,] 9 4 8 10 8
[10,] 5 7 12 10 8
Figured it out! mcmc_dens has the argument facet_args which turns each figure into its own facet (took me so long because I was unfamiliar with facets). Modifying the first line of my original code gave me the figure I was looking for:
pars <- c("Singing", "Building", "Incubating", "Nestlings",
"Empty", "Fledglings", "Interactions", "Foraging")
G <- mcmc_dens(C, pars=pars, facet_args=list(ncol=1, strip.position="left"))
This is what the images look like now:

How to extract the values from a raster in R

I want to use R to extract values from a raster. Basically, my raster has values from 0-6 and I want to extract for every single pixel the corresponding value. So that I have at the end a data table containing those two variables.
Thank you for your help, I hope my explanations are precisely enough.
Example data
library(raster)
r <- raster(ncol=5, nrow=5, vals=1:25)
To get all values, you can do
values(r)
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#as.matrix(r)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 2 3 4 5
#[2,] 6 7 8 9 10
#[3,] 11 12 13 14 15
#[4,] 16 17 18 19 20
#[5,] 21 22 23 24 25
Also see ?getValues
You can also use indexing
r[2,2]
#7
r[7:8]
#[1] 7 8
For more complex extractions using points, lines or polygons, see ?extract
x is the raster object you are trying to extract values from; y is may be a SpatialPoints, SpatialPolygons,SpatialLines, Extent or a vector representing cell numbers (take a look at ?extract). Your code values_raster <- extract(x = values, df=TRUE) will not work because you're feeding the function with any y object/vector.
You could try to build a vector with all cell numbers of your raster. Imagine your raster have 200 cells. If your do values_raster <- extract(x = values,y=seq(1,200,1), df=TRUE) you'll get a dataframe with values for each cell.
How about simply doing
as.data.frame(s, xy=TRUE) # s is your raster file

Splitting a variable into equally sized groups

I have a continuous variable called Longitude (it corresponds to geographical longitude) that has 12465 unique values. I need to create a new variable called Longitude1024 that consists of the variable Longitude split into 1024 equally sized groups. I did that using the following function:
data$Longitude1024 <- as.factor( as.numeric( cut(data$Longitude,1024)))
However, the problem is that, when I use this function to create the new variable Longitude1024, this new variable consists of only 651 unique elements rather than 1024. Does anyone know what the problem here is and how could I actually get the new variable with 1024 unique values?
Thanks a lot
Use rank, then scale it down. Here's an example with 10 groups:
x <- rnorm(124655)
g <- floor(rank(x) * 10 / (length(x) + 1))
table(g)
# g
# 0 1 2 3 4 5 6 7 8 9
# 12465 12466 12465 12466 12465 12466 12466 12465 12466 12465
Short answer: try cut2 from the Hmisc package
Long answer
Example: split dat, which is 1000 unique values, into 100 equal groups of 10.
Doesn't work:
# dummy data
set.seed(321)
dat <- rexp(1000)
# all unique values
length(unique(dat))
[1] 1000
cut generates 100 levels
init_res <- cut(dat, 100)
length(unique(levels(init_res)))
[1] 100
But does not split the data into equally sized groups
init_grps <- split(dat, cut(dat, 100))
table(unlist(lapply(init_grps, length)))
0 1 2 3 4 5 6 7 9 10 11 13 15 17 18 19 22 23 24 25 27 37 38 44 47 50 63 71 72 77
42 9 8 4 1 3 1 3 2 1 2 1 1 1 2 1 1 1 2 2 2 1 1 1 1 1 1 2 1 1
Works with Hmisc::cut2
cut2 divides the vector into groups of equal length, as desired
require(Hmisc)
final_grps <- split(dat, cut2(dat, g=100))
table(unlist(lapply(final_grps, length)))
10
100
If you want, you can store the results in a data frame, for example
foobar <- do.call(rbind, final_grps)
head(foobar)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[0.000611,0.00514) 0.004345915 0.002192086 0.004849693 0.002911516 0.003421753 0.003159641 0.004855366 0.0006111574
[0.005137,0.01392) 0.009178133 0.005137309 0.008347482 0.007072484 0.008732725 0.009379002 0.008818794 0.0110489833
[0.013924,0.02004) 0.014283326 0.014356782 0.013923721 0.014290554 0.014895342 0.017992638 0.015608931 0.0173707930
[0.020041,0.03945) 0.023047527 0.020437743 0.026353839 0.036159321 0.024371834 0.026629812 0.020793695 0.0214221779
[0.039450,0.05912) 0.043379064 0.039450453 0.050806316 0.054778805 0.040093806 0.047228050 0.055058519 0.0446634954
[0.059124,0.07362) 0.069671018 0.059124220 0.063242564 0.064505875 0.072344089 0.067196661 0.065575249 0.0634142853
[,9] [,10]
[0.000611,0.00514) 0.002524557 0.003155055
[0.005137,0.01392) 0.008287758 0.011683228
[0.013924,0.02004) 0.018537469 0.014847937
[0.020041,0.03945) 0.026233400 0.020040981
[0.039450,0.05912) 0.041310471 0.058449603
[0.059124,0.07362) 0.063608022 0.066316782
Hope this helps

Creating large matrices in R in reasonable time

I am working on a movie recommender predicts a user's movie rating for an unseen movie. Most of the work is done and I have created a 7000x3000 matrix userRatingsNew containing 7000 users and their ratings for 3000 movies, replacing all the missing values with the predicted rating.
I was provided two other files, mapping and test, and used read.csv() to load them into matrices of the following format.
mapping is a 8,400,000x3 matrix that contains id, user, movie, where id is basically the transaction id associated with a user's rating of movie x.
test is a 8,400,000x2 matrix that contains id, rating, where rating is the user's rating for that movie associated with id. The values in the rating column are empty and I need to fill those in using the predicted values that I have already calculated.
Here is my code
writeResult <- function(userRatingsNew, mapping, test, writeToFile = FALSE){
start <- Sys.time()
result <- test
entries <- nrow(test)
for (i in 1:entries){
result[i,2] <- userRatingsNew[mapping[i,2], mapping[i,3]]
}
if (writeToFile)
write.csv(result, "result.csv", row.names=FALSE)
print(Sys.time()-start)
return(result)
}
My problem is that for i=1:100, it takes ~7 seconds. So in order to process all 8.4 million entries, it'd take ~163 hours. I tried using doMC() and implemented parallel processing, but I ran into the problem where my computer ran out of memory. What exactly can I do to speed this process up?
You can index a matrix with another matrix, as in:
M <- matrix(1:25,nc=5,nr=5)
M
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 6 11 16 21
# [2,] 2 7 12 17 22
# [3,] 3 8 13 18 23
# [4,] 4 9 14 19 24
# [5,] 5 10 15 20 25
m <- cbind(1:5,5:1)
m
# [,1] [,2]
# [1,] 1 5
# [2,] 2 4
# [3,] 3 3
# [4,] 4 2
# [5,] 5 1
M[m]
# [1] 21 17 13 9 5
So try
result[,2] <- userRatingsNew[mapping[,2:3]]
You should not need a loop.
A thought:
Instead of the 3000-sized dimension attached directly to the 7000-sized dimension, for each user you can attach an array which specifies the movie id/number/place in array, and their rating, in a series of 2d datapoints. Presumably most users will not rate all 3000 films. Let's say they rate 20 movies on average, and in each of 20 cases now it calls the array of movie names by correctly referring to the location in the array, then now you only need (7000) x (20x2+20) things going on, where 20x2 refers to the 20 ratings plus the reference to the film, and the other 20 is the fact of retrieving the film name. You can compile all reports first using array location and attach the name referring to an array of film names.

Perform 'cross product' of two vectors, but with addition

I am trying to use R to perform an operation (ideally with similarly displayed output) such as
> x<-1:6
> y<-1:6
> x%o%y
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 2 4 6 8 10 12
[3,] 3 6 9 12 15 18
[4,] 4 8 12 16 20 24
[5,] 5 10 15 20 25 30
[6,] 6 12 18 24 30 36
where each entry is found through addition not multiplication.
I would also be interested in creating the 36 ordered pairs (1,1) , (1,2), etc...
Furthermore, I want to use another vector like
z<-1:4
to create all the ordered triplets possible between x, y, and z.
I am using R to look into likelihoods of possible total when rolling dice with varied numbers of sizes.
Thank you for all your help! This site has been a big help to me. I appreciate anyone that takes the time to answer a stranger's question.
UPDATE So I found that `outer(x,y,'+') will do what I wanted first. But I still don't know how to create ordered pairs or ordered triplets.
Your first question is easily handled by outer:
outer(1:6,1:6,"+")
For the others, I suggest you try expand.grid, although there are specialized combination and permutation functions out there as well if you do a little searching.
expand.grid can answer your second question:
expand.grid(1:6,1:6)
expand.grid(1:6,1:6,1:4)

Resources