I'm new to loops and functions in R.
Imagine I have measurements at every 0.1 units from 1.0 to 3.5 for four samples (A, B, C, D).
I want to find the average measurements (+/- 0.2 units) near 1.5, 2.5, and 3.5. So, for 1.5 I'm averaging the values at c(1.3, 1.4, 1.5, 1.6, and 1.7), etc.
How can I write a statement to summarize those three average values for all four samples? I think it might start something like this:
X <- (1.5, 2.5, 3.5)
for (i in X)
{
avg <- colMeans(subset(data,data$measurement > (i - 0.2) & data$measurement < (i + 0.2)))
}
I've also considered using '[' instead too:
colMeans(data[data$measurement > (i-0.2) & data$measurement < (i+0.2)])
Thanks for the help so far, sqldf is a really nice tool, the example does just what I want!
However, I can't get it to work with the real data set. I modified the code so it looks like (sorry, this doesn't correspond with the sample data set anymore):
M <- sqldf("select r.i,avg(w.X1),avg(w.X2),avg(w.X3),avg(w.X4)
from Y r, Y w
where w.i betreen r.i - 1 and r.i + 1
group by r.i
having r.i+0.0 in (600, 700, 800)")
To contextualize it, I am trying to summarize the average of all points from 599–601, 699–701 and 799–801, for four columns named X1, X2, X3, X4. I named this data frame 'Y'. The rows are actually wavelengths, and the data points the amount of light reflected at that wavelength.
Do you see anything wrong with the above code? -- It creates a matrix with the right dimensions, but the averages don't match with what they should from the larger dataset. I'm wondering if I'm not understanding something in the code, for instance, the importance of the 'w' variable.
Proper indexing is faster than the loop.
library(zoo)
set.seed(1)
x <- as.character(seq(1,3.5,.1))
z <- zoo(data.frame(a=rnorm(length(x)),
b=rnorm(length(x)),
c=rnorm(length(x))),
x)
z2 <- rollmean(z, k = 5, align = "center")[as.character(seq(1,3.5,.5)),]
> z2
a b c
1.5 0.46601479 0.40153999 0.2007418
2 0.31015536 -0.22912642 0.4673692
2.5 -0.04141133 0.31978341 0.4350507
3 0.63816023 -0.07509644 -0.3622883
> data.frame(z2, index = index(z2))
a b c index
1.5 0.46601479 0.40153999 0.2007418 1.5
2 0.31015536 -0.22912642 0.4673692 2
2.5 -0.04141133 0.31978341 0.4350507 2.5
3 0.63816023 -0.07509644 -0.3622883 3
If you want the partial fills on the edges where the window is less than 5 wide:
> rollapply(z, width = 5, align = "center", partial = TRUE, FUN = mean)[as.character(seq(1,3.5,.5)),]
a b c
1 -0.42614637 -0.70156598 0.21492677
1.5 0.46601479 0.40153999 0.20074176
2 0.31015536 -0.22912642 0.46736921
2.5 -0.04141133 0.31978341 0.43505071
3 0.63816023 -0.07509644 -0.36228832
3.5 -0.47521823 0.22239574 -0.05024676
If the windows sizes are irregular, but equally spaced as mentioned in the comment:
> z2 <- as.data.frame(z)
> z2$i <- row.names(z2)
> library(sqldf)
> sqldf("select a.i,avg(b.a),avg(b.b),avg(b.c)
from z2 a, z2 b
where b.i between a.i - .21 and a.i + .21
group by a.i
having a.i+0.0 in (1.5,2.0,2.5,3.0,3.5)")
i avg(b.a) avg(b.b) avg(b.c)
1 1.5 0.46601479 0.40153999 0.20074176
2 2 0.31015536 -0.22912642 0.46736921
3 2.5 -0.04141133 0.31978341 0.43505071
4 3 0.63816023 -0.07509644 -0.36228832
5 3.5 -0.47521823 0.22239574 -0.05024676
Related
When using the matchit-function for full matching, the results differ by the order of the input dataframe. That is, if the order of the data is changed, results change, too. This is surprising, because in my understanding, the optimal full algorithm should yield only one single best solution.
Am I missing something or is this an error?
Similar differences occur with the optimal algorithm.
Below you find a reproducible example. Subclasses should be identical for the two data sets, which they are not.
Thank you for your help!
# create data
nr <- c(1:100)
x1 <- rnorm(100, mean=50, sd=20)
x2 <- c(rep("a", 20),rep("b", 60), rep("c", 20))
x3 <- rnorm(100, mean=230, sd=2)
outcome <- rnorm(100, mean=500, sd=20)
group <- c(rep(0, 50),rep(1, 50))
df <- data.frame(x1=x1, x2=x2, outcome=outcome, group=group, row.names=nr, nr=nr)
df_neworder <- df[order(outcome),] # re-order data.frame
# perform matching
model_oldorder <- matchit(group~x1, data=df, method="full", distance ="logit")
model_neworder <- matchit(group~x1, data=df_neworder, method="full", distance ="logit")
# store matching results
matcheddata_oldorder <- match.data(model_oldorder, distance="pscore")
matcheddata_neworder <- match.data(model_neworder, distance="pscore")
# Results based on original data.frame
head(matcheddata_oldorder[order(nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 27
2 63.949637 a 529.2733 0 2 0.5283582 1.0 32
3 52.217666 a 526.7928 0 3 0.5028106 0.5 17
4 48.936397 a 492.9255 0 4 0.4956569 1.0 9
5 36.501507 a 512.9301 0 5 0.4685876 1.0 16
# Results based on re-ordered data.frame
head(matcheddata_neworder[order(matcheddata_neworder$nr),], 10)
x1 x2 outcome group nr pscore weights subclass
1 69.773776 a 489.1769 0 1 0.5409943 1.0 25
2 63.949637 a 529.2733 0 2 0.5283582 1.0 31
3 52.217666 a 526.7928 0 3 0.5028106 0.5 15
4 48.936397 a 492.9255 0 4 0.4956569 1.0 7
5 36.501507 a 512.9301 0 5 0.4685876 2.0 14
Apparently, the assignment of objects to subclasses differs. In my understanding, this should not be the case.
The developers of the optmatch package (which the matchit function calls) provided useful help:
I think what we're seeing here is the result of the tolerance argument
that fullmatch has. The matching algorithm requires integer distances,
so we have to scale then truncate floating point distances. For a
given set of integer distances, there may be multiple matchings that
achieve the minimum, so the solver is free to pick among these
non-unique solutions.
Developing your example a little more:
> library(optmatch)
> nr <- c(1:100) x1 <- rnorm(100, mean=50, sd=20)
> outcome <- rnorm(100, mean=500, sd=20) group <- c(rep(0, 50),rep(1, 50))
> df_oldorder <- data.frame(x1=x1, outcome=outcome, group=group, row.names=nr, nr=nr) > df_neworder <- df_oldorder[order(outcome),] # > re-order data.frame
> glm_oldorder <- match_on(glm(group~x1, > data=df_oldorder), data = df_oldorder)
> glm_neworder <- > match_on(glm(group~x1, data=df_neworder), data = df_neworder)
> fm_old <- fullmatch(glm_oldorder, data=df_oldorder)
> fm_new <- fullmatch(glm_neworder, data=df_neworder)
> mean(sapply(matched.distances(fm_old, glm_oldorder), mean))
> ## 0.06216174
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.062058 mean(sapply(matched.distances(fm_old, glm_oldorder), mean)) -
> mean(sapply(matched.distances(fm_new, glm_neworder), mean))
> ## 0.00010373
which we can see is smaller than the default tolerance of 0.001. You can always decrease the tolerance level, which may
require increased run time, in order to get closer to the true
floating put minimum. We found 0.001 seemed to work well in practice,
but there is nothing special about this value.
I'm writing my master thesis and I'm stuck with the complexity of my data. Therefore I'd like to plot my data to see what's in there.
My dataframe looks like that: I've 333 perceivers (PID) who rated 60 target photos (TID) each, resulting in 19980 rows. Each perceiver (PID) rated every target's photo on how likeable they are (Rating) and provided multiple self-reports about themselves (SDO_mean, KSA_mean, threats_overall).
The photos were either from photo type A (Dwithin = 0) or type B (Dwithin = 1), which is my within-subject factor as every perceiver saw all photos. In addition perceivers were assigned to one of two between-subject condition (Dbetween): All photos (TID) from type B (Dwithin = 1) were labeled either as people with migration background (Dbetween = 0) or as refugees (Dbetween = 1).
This results in a nested design where the Ratings are nested in the PID and also in the TID. My data looks like that:
TID PID Dwithin Dbetween Rating SDO_mean KSA_mean threats_overall
1 1 0 0 5 3.1 2.3 2.2
2 1 1 0 2 3.1 2.3 2.2
3 1 0 0 5 3.1 2.3 2.2
4 1 1 0 1 3.1 2.3 2.2
5 1 0 0 3 3.1 2.3 2.2
6 1 1 0 3 3.1 2.3 2.2
Now I want to predict the likeable-rating mainly by the categorial variables Dwithin and Dbetween. As Dbetween can only be interpreted as an interaction of Dwithin*Dbetween (because the label was only for Dwitihn=1 targets), the formula would be:
model1 <- lmer(Rating~1+Dwithin+Dbetween+Dwithin*Dbetween+(1+Dwithin|PID)+(1|TID),data=df)
Now I want to plot the data which I'm using for my regression. An option could be to plot the Rating seperately for each Dwithin / Dbetween condition. Or to plot the regression as in the model1 formula. But as these are categorial predictors, I didn't manage to plot the data in the right way. I looked into lattice() but couldn't apply it on my data. Is there anyone who could help me plotting it? Thanks a lot in advance!
#SASpencer: I thought for example of something like this. But my y-scale isn't continious... it only has integer numbers from 1-5.It could also be interesting for the combination of Dwithin and Dbetween (so like in your plot)
Here is a reproducible example:
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
}
options(digits=2)
TID <- rep(1:60, times=333)
PID <- rep(1:333,each=60)
Dwithin <- rep(rep(0:1, times=19980/2))
Dbetween <- rep(rep(0:1, each=60),times=333)[1:19980]
Rating <- floor(runif(19980, min=1, max=6))
SDO_mean <- rep(mysamp(n=333, m=4, s=2.5, lwr=1, upr=5, nnorm=1000000), each=60)
KSA_mean <- rep(mysamp(n=333, m=2, s=0.8, lwr=1, upr=5, nnorm=1000000), each=60)
threats_overall <- rep(mysamp(n=333, m=3, s=1.5, lwr=1, upr=5, nnorm=1000), each=60)
df <- data.frame(TID,PID,Dwithin,Dbetween, Rating, SDO_mean, KSA_mean, threats_overall)
I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216
I have a dataset of species and their rough locations in a 100 x 200 meter area. The location part of the data frame is not in a format that I find to be usable. In this 100 x 200 meter rectangle, there are two hundred 10 x 10 meter squares named A through CV. Within each 10 x 10 square there are four 5 x 5 meter squares named 1, 2, 3, and 4, respectively (1 is south of 2 and west of 3. 4 is east of 2 and north of 3). I want to let R know that A is the square with corners at (0 ,0), (10,0), (0,0), and (0,10), that B is just north of A and has corners (0,10), (0,20), (10,10), and (10,20), and K is just east of A and has corners at (10,0), (10,10), (20,0), and (20,10), and so on for all the 10 x 10 meter squares. Additionally, I want to let R know where each 5 x 5 meter square is in the 100 x 200 meter plot.
So, my data frame looks something like this
10x10 5x5 Tree Diameter
A 1 tree1 4
B 1 tree2 4
C 4 tree3 6
D 3 tree4 2
E 3 tree5 3
F 2 tree6 7
G 1 tree7 12
H 2 tree8 1
I 2 tree9 2
J 3 tree10 8
K 4 tree11 3
L 1 tree12 7
M 2 tree13 5
Eventually, I want to be able to plot the 100 x 200 meter area and have each 10 x 10 meter square show up with the number of trees, or number of species, or total biomass
What is the best way to turn the data I have into spatial data that R can use for graphing and perhaps analysis?
Here's a start.
## set up a vector of all 10x10 position tags
tags10 <- c(LETTERS,
paste0("A",LETTERS),
paste0("B",LETTERS),
paste0("C",LETTERS[1:22]))
A function to convert (e.g.) {"J",3} to the center of the corresponding sub-square.
convpos <- function(pos10,pos5) {
## convert letters to major (x,y) positions
p1 <- as.numeric(factor(pos10,levels=tags10)) ## or use match()
p1.x <- ((p1-1) %% 10) *10+5 ## %% is modulo operator
p1.y <- ((p1-1) %/% 10)*10+5 ## %/% is integer division
## sort out sub-positions
p2.x <- ifelse(pos5 <=2,2.5,7.5) ## {1,2} vs {3,4} values
p2.y <- ifelse(pos5 %%2 ==1 ,2.5,7.5) ## odd {1,3} vs even {2,4} values
c(p1.x+p2.x,p1.y+p2.y)
}
usage:
convpos("J",2)
convpos(mydata$tenbytenpos,mydata$fivebyfivepos)
Important notes:
this is a proof of concept, I can pretty much guarantee I haven't got the correspondence of x and y coordinates quite right. But you should be able to trace through this line-by-line and see what it's doing ...
it should work correctly on vectors (see second usage example above): I switched from switch to ifelse for that reason
your column names (10x10) are likely to get mangled into something like X10.10 when reading data into R: see ?data.frame and ?check.names
Similar to what #Ben Bolker has done, here's a lookup function (though you may need to transpose something to make the labels match what you describe).
tenbyten <- c(LETTERS[1:26],
paste0("A",LETTERS[1:26]),
paste0("B",LETTERS[1:26]),
paste0("C",LETTERS[1:22]))
tenbyten <- matrix(rep(tenbyten, each = 2), ncol = 10)
tenbyten <- t(apply(tenbyten, 1, function(x){rep(x, each = 2)}))
# the 1234 squares
squares <- matrix(c(rep(c(1,2),10),rep(c(4,3),10)), nrow = 20, ncol = 20)
# stick together into a reference grid
my.grid <- matrix(paste(tenbyten, squares, sep = "-"), nrow = 20, ncol = 20)
# a lookup function for the site grid
coordLookup <- function(tbt, fbf, .my.grid = my.grid){
x <- col(.my.grid) * 5 - 2.5
y <- row(.my.grid) * 5 - 2.5
marker <- .my.grid == paste(tbt, fbf, sep = "-")
list(x = x[marker], y = y[marker])
}
coordLookup("BB",2)
$x
[1] 52.5
$y
[1] 37.5
If this isn't what you're looking for, then maybe you'd prefer a SpatialPolygonsDataFrame, which has proper polygon IDs, and you attach data to, etc. In that case just Google around for how to make one from scratch, and manipulate the row() and col() functions to get your polygon corners, similar to what's given in this lookup function, which only returns centroids.
Edit: getting SPDF started:
This is modified from the function example and can hopefully be a good start:
library(sp)
# really you have a 20x20 grid, counting the small ones.
# c(2.5,2.5) specifies the distance in any direction from the cell center
grd <- GridTopology(c(1,1), c(2.5,2.5), c(20,20)))
grd <- as.SpatialPolygons.GridTopology(grd)
# get centroids
coords <- coordinates(polys)
# make SPDF, with an extra column for your grid codes, taken from the above.
# you can add further columns to this data.frame(), using polys#data
polys <- SpatialPolygonsDataFrame(grd,
data=data.frame(x=coords[,1], y=coords[,2], my.ID = as.vector(my.grid),
row.names=getSpPPolygonsIDSlots(grd)))
The question hast 2 parts.
Which is the data structure in R that allows to store the paired data:
0:0
0.5:10
1:20
(Python dictionary {[0]:0, [0.5]:10, [1]:20})
and how to initiate it with one liner? i.e. to couple seq(0,1,by=0.5)
with seq(0,10,by=5) in this data structure
Assume I added 0.25 to the list, then I want the weighted average of the neighbor nodes to appear (automatically) in the data set, i.e. the element 0.25:5 and the paired set would be
0:0
0.25:5
0.5:10
1:20
If I add the element 0.3, then it must be paired with 5+(10-5)*(0.3-0.25)/(0.5-0.25)=6 and element 0.3:6 to be added.
How I can create the class with S4 or Reference Class class model where I could put this functionality?
Not really sure what you are getting at but maybe the package hash may have what you want
library(hash)
h<-hash(keys=seq(0,1,by=0.5),values=seq(0,10,by=5))
h[['0.25']]<-2.5
Probably deals with the first part of your question. http://cran.r-project.org/web/packages/hash/hash.pdf may allude to help on the second.
a similar construct with lists
lst<-list()
lst<-seq(0,10,5)
names(lst)<-seq(0,1,0.5)
> lst['0.5']
0.5
5
lst['0.25']<-2.5
for your second part you could construct a simple function to update you hash/list with a new value.
A two-column data.frame seems appropriate:
xy <- data.frame(x = seq(0, 1, by = 0.5), y = seq(0, 20, by = 10))
xy
# x y
# 1 0.0 0
# 2 0.5 10
# 3 1.0 20
Then, what you are trying to do is a linear-interpolation, which you can achieve using the approx function. For example:
approx(xy$x, xy$y, xout = 0.3)
# $x
# [1] 0.3
#
# $y
# [1] 6
If you want to add that result to the data.frame, you can do something like:
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, 0.3))))
xy
# x y
# 1 0.0 0
# 2 0.3 6
# 3 0.5 10
# 4 1.0 20
which is a bit expensive, especially if you plan to add points one at a time. You could instead add all your points at once since the result is independent of the order in which you add them:
add.points <- c(0.25, 0.3)
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, add.points))))
xy
# x y
# 1 0.00 0
# 2 0.25 5
# 3 0.30 6
# 4 0.50 10
# 5 1.00 20