Related
I am trying to apply the Simpson's Diversity Index across a number of different datasets with a variable number of species ('nuse') captured. As such I am trying to construct code which can cope with this automatically without needing to manually construct a formula each time I do it. Example dataset for a manual formula is below:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x) {
total <- x[,"total"]
nuse1 <- x[,"nuse1"]
nuse2 <- x[,"nuse2"]
nuse3 <- x[,"nuse3"]
nuse4 <- x[,"nuse4"]
div <- round(((1-(((nuse1*(nuse1 - 1)) + (nuse2*(nuse2 - 1)) + (nuse3*(nuse3 - 1)) + (nuse4*(nuse4 - 1)))/(total*(total - 1))))),digits=4)
return(div)
}
diverse$Simpson <- simp(diverse)
diverse
As you can see this works fine. However, how would I be able to create a function which could automatically adjust to, for example, 9 species (so up to nuse9)?
I have experimented with the paste function + as.formula as indicated here Formula with dynamic number of variables; however it is the expand form of (nuse1 * (nuse1 - 1)) that I'm struggling with. Does anyone have any suggestions please? Thanks.
How about something like:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x, species) {
spcs <- grep(species, colnames(x)) # which column names have "nuse"
total <- rowSums(x[,spcs]) # sum by row
div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
return(div)
}
diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse
# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1 0 5 0 5 10 0.5556
# 2 20 5 2 8 35 0.6151
# 3 40 3 8 2 53 0.4107
# 4 20 20 20 20 80 0.7595
All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.
I'm writing my master thesis and I'm stuck with the complexity of my data. Therefore I'd like to plot my data to see what's in there.
My dataframe looks like that: I've 333 perceivers (PID) who rated 60 target photos (TID) each, resulting in 19980 rows. Each perceiver (PID) rated every target's photo on how likeable they are (Rating) and provided multiple self-reports about themselves (SDO_mean, KSA_mean, threats_overall).
The photos were either from photo type A (Dwithin = 0) or type B (Dwithin = 1), which is my within-subject factor as every perceiver saw all photos. In addition perceivers were assigned to one of two between-subject condition (Dbetween): All photos (TID) from type B (Dwithin = 1) were labeled either as people with migration background (Dbetween = 0) or as refugees (Dbetween = 1).
This results in a nested design where the Ratings are nested in the PID and also in the TID. My data looks like that:
TID PID Dwithin Dbetween Rating SDO_mean KSA_mean threats_overall
1 1 0 0 5 3.1 2.3 2.2
2 1 1 0 2 3.1 2.3 2.2
3 1 0 0 5 3.1 2.3 2.2
4 1 1 0 1 3.1 2.3 2.2
5 1 0 0 3 3.1 2.3 2.2
6 1 1 0 3 3.1 2.3 2.2
Now I want to predict the likeable-rating mainly by the categorial variables Dwithin and Dbetween. As Dbetween can only be interpreted as an interaction of Dwithin*Dbetween (because the label was only for Dwitihn=1 targets), the formula would be:
model1 <- lmer(Rating~1+Dwithin+Dbetween+Dwithin*Dbetween+(1+Dwithin|PID)+(1|TID),data=df)
Now I want to plot the data which I'm using for my regression. An option could be to plot the Rating seperately for each Dwithin / Dbetween condition. Or to plot the regression as in the model1 formula. But as these are categorial predictors, I didn't manage to plot the data in the right way. I looked into lattice() but couldn't apply it on my data. Is there anyone who could help me plotting it? Thanks a lot in advance!
#SASpencer: I thought for example of something like this. But my y-scale isn't continious... it only has integer numbers from 1-5.It could also be interesting for the combination of Dwithin and Dbetween (so like in your plot)
Here is a reproducible example:
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
}
options(digits=2)
TID <- rep(1:60, times=333)
PID <- rep(1:333,each=60)
Dwithin <- rep(rep(0:1, times=19980/2))
Dbetween <- rep(rep(0:1, each=60),times=333)[1:19980]
Rating <- floor(runif(19980, min=1, max=6))
SDO_mean <- rep(mysamp(n=333, m=4, s=2.5, lwr=1, upr=5, nnorm=1000000), each=60)
KSA_mean <- rep(mysamp(n=333, m=2, s=0.8, lwr=1, upr=5, nnorm=1000000), each=60)
threats_overall <- rep(mysamp(n=333, m=3, s=1.5, lwr=1, upr=5, nnorm=1000), each=60)
df <- data.frame(TID,PID,Dwithin,Dbetween, Rating, SDO_mean, KSA_mean, threats_overall)
I am trying to compute in R. I have the following values.
nb <- 100
tb <- 25
ns <- 90
ts <- 15
A0 <- 1
S_norm <- 0.4
R <-tb/ts
y_meas <- (ns-nb/R)/A0
sigma_meas = sqrt(ns+(nb+1)/R^2)/A0
I am very confused on how I can integrate L(psi), say from -10 to 10. Because I am integrating with respect to log A.
You can substitute for logA and for a fixed value of psi you can integrate as follows:
psi <- 5
integrate(function(x) exp(-0.5*(((x/A0)/S_norm)^2 + ((psi-y_meas*A0/exp(x))/sigma_meas)^2)),
-10, 10)
# 0.1775989 with absolute error < 6.6e-05
On top of an excellent answer by #SandipanDey, if you could extend limits to -Infinity...+Infinity, there is a better way to integrate functions with e-x2 kernel: Gauss-Hermite quadrature, and there is an R package for that.
Simple example:
library(gaussquad)
n.quad <- 128 # integration order
# get the particular (weights,abscissas) as data frame
# with 2 observables and n.quad observations
rule <- ghermite.h.quadrature.rules(n.quad, mu = 0.0)[[n.quad]]
# test function - integrate 1 over exp(-x^2) from -Inf to Inf
# should get sqrt(pi) as an answer
f <- function(x) {
1.0
}
q <- ghermite.h.quadrature(f, rule)
print(q - sqrt(pi))
I have a simple (indeed standard in economics) nonlinear constrained discrete maximisation problem to solve in R and am having trouble. I found solutions for parts of the problem (nonlinear maximisation; discrete maximisation) but not for the union of all the problems.
Here is the problem. A consumer wants to buy three products (ananas, banana, cookie), knows the prices and has a budget of 20€. He likes variety (i.e., he wants to have all three products if possible) and his satisfaction is decreasing in the amount consumed (he likes his first cookie way more than his 100th).
The function he wishes to maximise is
and of course since each has a price, and he has a limited budget, he maximises this function under the constraint that
What I want to do is to find the optimal buying list (N ananas, M bananas, K cookies) that satisfies the constraint.
If the problem were linear, I would simply use linprog::solveLP(). But the objective function is nonlinear.
If the problem were of a continuous nature, ther would be a simple analytic solution to it.
The question being discrete and nonlinear, I do not know how to proceed.
Here is some toy data to play with.
df <- data.frame(rbind(c("ananas",2.17),c("banana",0.75),c("cookie",1.34)))
names(df) <- c("product","price")
I'd like to have an optimization routine that gives me an optimal buying list of (N,M,K).
Any hints?
1) no packages This can be done by brute force. Using df from the question as input ensure that price is numeric (it's a factor in the df of the question) and calculate the largest number mx for each variable. Then create grid g of variable counts and compute the total price of each and the associated objective giving gg. Now sort gg in descending order of objective and take those solutions satisfying the constraint. head will show the top few solutions.
price <- as.numeric(as.character(df$price))
mx <- ceiling(20/price)
g <- expand.grid(ana = 0:mx[1], ban = 0:mx[2], cook = 0:mx[3])
gg <- transform(g, total = as.matrix(g) %*% price, objective = sqrt(ana * ban * cook))
best <- subset(gg[order(-gg$objective), ], total <= 20)
giving:
> head(best) # 1st row is best soln, 2nd row is next best, etc.
ana ban cook total objective
1643 3 9 5 19.96 11.61895
1929 3 7 6 19.80 11.22497
1346 3 10 4 19.37 10.95445
1611 4 6 5 19.88 10.95445
1632 3 8 5 19.21 10.95445
1961 2 10 6 19.88 10.95445
2) dplyr This can also be nicely expressed using the dplyr package. Using g and price from above:
library(dplyr)
g %>%
mutate(total = c(as.matrix(g) %*% price), objective = sqrt(ana * ban * cook)) %>%
filter(total <= 20) %>%
arrange(desc(objective)) %>%
top_n(6)
giving:
Selecting by objective
ana ban cook total objective
1 3 9 5 19.96 11.61895
2 3 7 6 19.80 11.22497
3 3 10 4 19.37 10.95445
4 4 6 5 19.88 10.95445
5 3 8 5 19.21 10.95445
6 2 10 6 19.88 10.95445
If you do not mind using a "by hand" solution:
uf=function(x)prod(x)^.5
bf=function(x,pr){
if(!is.null(dim(x)))apply(x,1,bf,pr) else x%*%pr
}
budget=20
df <- data.frame(product=c("ananas","banana","cookie"),
price=c(2.17,0.75,1.34),stringsAsFactors = F)
an=0:(budget/df$price[1]) #include 0 for all possibilities
bn=0:(budget/df$price[2])
co=0:(budget/df$price[3])
X=expand.grid(an,bn,co)
colnames(X)=df$product
EX=apply(X,1,bf,pr=df$price)
psX=X[which(EX<=budget),] #1st restrict
psX=psX[apply(psX,1,function(z)sum(z==0))==0,] #2nd restrict
Ux=apply(psX,1,uf)
cbind(psX,Ux)
(sol=psX[which.max(Ux),])
uf(sol) # utility
bf(sol,df$price) #budget
> (sol=psX[which.max(Ux),])
ananas banana cookie
1444 3 9 5
> uf(sol) # utility
[1] 11.61895
> bf(sol,df$price) #budget
1444
19.96
I think this problem is very similar in nature to this question (Solve indeterminate equation system in R). The answer by Richie Cotton was the basis to this possible solution:
df <- data.frame(product=c("ananas","banana","cookie"),
price=c(2.17,0.75,1.34),stringsAsFactors = F)
FUN <- function(w, price=df$price){
total <- sum(price * w)
errs <- c((total-20)^2, -(sqrt(w[1]) * sqrt(w[2]) * sqrt(w[3])))
sum(errs)
}
init_w <- rep(10,3)
res <- optim(init_w, FUN, lower=rep(0,3), method="L-BFGS-B")
res
res$par # 3.140093 9.085182 5.085095
sum(res$par*df$price) # 20.44192
Notice that the total cost (i.e. price) for the solution is $ 20.44. To solve this problem, we can weight the error terms to put more emphasis on the 1st term, which relates to the total cost:
### weighting of error terms
FUN2 <- function(w, price=df$price){
total <- sum(price * w)
errs <- c(100*(total-20)^2, -(sqrt(w[1]) * sqrt(w[2]) * sqrt(w[3]))) # 1st term weighted by 100
sum(errs)
}
init_w <- rep(10,3)
res <- optim(init_w, FUN2, lower=rep(0,3), method="L-BFGS-B")
res
res$par # 3.072868 8.890832 4.976212
sum(res$par*df$price) # 20.00437
As LyzandeR remarked there is no nonlinear integer programming solver available in R. Instead, you can use the R package rneos that sends data to one of the NEOS solvers and returns the results into your R process.
Select one of the solvers for "Mixed Integer Nonlinearly Constrained Optimization" on the NEOS Solvers page, e.g., Bonmin or Couenne. For your example above, send the following files in the AMPL modeling language to one of these solvers:
[Note that maximizing the product x1 * x2 * x3 is the same as maximising the product sqrt(x1) * sort(x2) * sqrt(x3).]
Model file:
param p{i in 1..3};
var x{i in 1..3} integer >= 1;
maximize profit: x[1] * x[2] * x[3];
subject to restr: sum{i in 1..3} p[i] * x[i] <= 20;
Data file:
param p:= 1 2.17 2 0.75 3 1.34 ;
Command file:
solve;
display x;
and you will receive the following solution:
x [*] :=
1 3
2 9
3 5
;
This approach will work for more extended examples were solutions "by hand" are not reasonable and rounded optim solutions are not correct.
To look at a more demanding example, let me propose the following problem:
Find an integer vector x = (x_i), i=1,...,10, that maximizes x1 * ... * x10, such that p1*x1 + ... + p10*x10 <= 10, where p = (p_i), i=1,...,10, is the following price vector
p <- c(0.85, 0.22, 0.65, 0.73, 0.91, 0.11, 0.31, 0.47, 0.93, 0.71)
Using constrOptim for this nonlinear optimization problem with a linear inequality constraint, I get solutions like 900 for different starting points, but never the optimal solutions that is 960 !
I have a dataset of species and their rough locations in a 100 x 200 meter area. The location part of the data frame is not in a format that I find to be usable. In this 100 x 200 meter rectangle, there are two hundred 10 x 10 meter squares named A through CV. Within each 10 x 10 square there are four 5 x 5 meter squares named 1, 2, 3, and 4, respectively (1 is south of 2 and west of 3. 4 is east of 2 and north of 3). I want to let R know that A is the square with corners at (0 ,0), (10,0), (0,0), and (0,10), that B is just north of A and has corners (0,10), (0,20), (10,10), and (10,20), and K is just east of A and has corners at (10,0), (10,10), (20,0), and (20,10), and so on for all the 10 x 10 meter squares. Additionally, I want to let R know where each 5 x 5 meter square is in the 100 x 200 meter plot.
So, my data frame looks something like this
10x10 5x5 Tree Diameter
A 1 tree1 4
B 1 tree2 4
C 4 tree3 6
D 3 tree4 2
E 3 tree5 3
F 2 tree6 7
G 1 tree7 12
H 2 tree8 1
I 2 tree9 2
J 3 tree10 8
K 4 tree11 3
L 1 tree12 7
M 2 tree13 5
Eventually, I want to be able to plot the 100 x 200 meter area and have each 10 x 10 meter square show up with the number of trees, or number of species, or total biomass
What is the best way to turn the data I have into spatial data that R can use for graphing and perhaps analysis?
Here's a start.
## set up a vector of all 10x10 position tags
tags10 <- c(LETTERS,
paste0("A",LETTERS),
paste0("B",LETTERS),
paste0("C",LETTERS[1:22]))
A function to convert (e.g.) {"J",3} to the center of the corresponding sub-square.
convpos <- function(pos10,pos5) {
## convert letters to major (x,y) positions
p1 <- as.numeric(factor(pos10,levels=tags10)) ## or use match()
p1.x <- ((p1-1) %% 10) *10+5 ## %% is modulo operator
p1.y <- ((p1-1) %/% 10)*10+5 ## %/% is integer division
## sort out sub-positions
p2.x <- ifelse(pos5 <=2,2.5,7.5) ## {1,2} vs {3,4} values
p2.y <- ifelse(pos5 %%2 ==1 ,2.5,7.5) ## odd {1,3} vs even {2,4} values
c(p1.x+p2.x,p1.y+p2.y)
}
usage:
convpos("J",2)
convpos(mydata$tenbytenpos,mydata$fivebyfivepos)
Important notes:
this is a proof of concept, I can pretty much guarantee I haven't got the correspondence of x and y coordinates quite right. But you should be able to trace through this line-by-line and see what it's doing ...
it should work correctly on vectors (see second usage example above): I switched from switch to ifelse for that reason
your column names (10x10) are likely to get mangled into something like X10.10 when reading data into R: see ?data.frame and ?check.names
Similar to what #Ben Bolker has done, here's a lookup function (though you may need to transpose something to make the labels match what you describe).
tenbyten <- c(LETTERS[1:26],
paste0("A",LETTERS[1:26]),
paste0("B",LETTERS[1:26]),
paste0("C",LETTERS[1:22]))
tenbyten <- matrix(rep(tenbyten, each = 2), ncol = 10)
tenbyten <- t(apply(tenbyten, 1, function(x){rep(x, each = 2)}))
# the 1234 squares
squares <- matrix(c(rep(c(1,2),10),rep(c(4,3),10)), nrow = 20, ncol = 20)
# stick together into a reference grid
my.grid <- matrix(paste(tenbyten, squares, sep = "-"), nrow = 20, ncol = 20)
# a lookup function for the site grid
coordLookup <- function(tbt, fbf, .my.grid = my.grid){
x <- col(.my.grid) * 5 - 2.5
y <- row(.my.grid) * 5 - 2.5
marker <- .my.grid == paste(tbt, fbf, sep = "-")
list(x = x[marker], y = y[marker])
}
coordLookup("BB",2)
$x
[1] 52.5
$y
[1] 37.5
If this isn't what you're looking for, then maybe you'd prefer a SpatialPolygonsDataFrame, which has proper polygon IDs, and you attach data to, etc. In that case just Google around for how to make one from scratch, and manipulate the row() and col() functions to get your polygon corners, similar to what's given in this lookup function, which only returns centroids.
Edit: getting SPDF started:
This is modified from the function example and can hopefully be a good start:
library(sp)
# really you have a 20x20 grid, counting the small ones.
# c(2.5,2.5) specifies the distance in any direction from the cell center
grd <- GridTopology(c(1,1), c(2.5,2.5), c(20,20)))
grd <- as.SpatialPolygons.GridTopology(grd)
# get centroids
coords <- coordinates(polys)
# make SPDF, with an extra column for your grid codes, taken from the above.
# you can add further columns to this data.frame(), using polys#data
polys <- SpatialPolygonsDataFrame(grd,
data=data.frame(x=coords[,1], y=coords[,2], my.ID = as.vector(my.grid),
row.names=getSpPPolygonsIDSlots(grd)))