Apply formula for between species comparison - r

I have a data frame laid out in the follwing manner:
Species Trait.p Trait.y Trait.z
a 20.1 7.2 14.1
b 20.4 8.3 15.2
b 19.2 6.8 13.9
I would like to apply, for each species combination, (Xa) - (Xb) where is X is the trait value and the letter is the species and Xa > Xb. I.e has to be such that the larger value of each respective species combination has to come first, calculated for every trait
Would this be a multi-step process?
An example output could be
Combination Trait.p Trait.y Trait.z
a/b 0.3 1.1 1.1

I assumed you choose the largest value but David brings up a good point. I doubt this is the best approach but I think it gives you what you're after. Note I added a c as I'm sure your problem is a bit more complex that just a and b:
dat <- read.table(text="Species Trait.p Trait.y Trait.z
a 20.1 7.2 14.1
b 20.4 8.3 15.2
b 19.2 6.8 13.9
c 14.2 3.8 11.9", header=T)
li <- lapply(split(dat, dat$Species), function(x) apply(x[, -1], 2, max))
com <- expand.grid(names(li), names(li))
inds <- com[com[, 1] != com[, 2], ]
inds <- t(apply(inds, 1, sort))
inds <- inds[!duplicated(inds), ]
ans <- lapply(1:nrow(inds), function(i) {
abs(li[[inds[i, 1]]]-li[[inds[i, 2]]])
})
cbind(Combination = paste(inds[, 1], inds[, 2], sep="/"),
as.data.frame(do.call(rbind, ans)))
This gives us:
Combination Trait.p Trait.y Trait.z
1 a/b 0.3 1.1 1.1
2 a/c 5.9 3.4 2.2
3 b/c 6.2 4.5 3.3
Sorry for the lack of annotation but I'm heading to class.

Related

Trying to use a variable as label in ggplots

I'm not sure what's going on here, but when I try to run ggplots, it tells me that u and u1 are not valid lists. Did I enter u and u1 incorrectly, that it thinks these are functions, did I forget something, or did I enter things wrong into ggplots?
u1 <- function(x,y){max(utilityf1(x))}
utilityc1 <- data.frame("utilityc1" =
u(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20),
c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
utilityc1 <- data.frame("utilityc1" =
u1(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20),
c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
hhcomp <- data.frame(
pqx, pqy, utility, hours, p1qx, p1qy, utilit, utilityc1,
utilityc, u,u1, o, o1, o2
)
library(ggplot2)
ggplot(hhcomp, aes(x=utility, y=consumption))+
coord_cartesian(xlim = c(0, 16) )+
ylim(0,20)+
labs(x = "leisure(hours)",y="counsumption(units)")+
geom_line(aes(x = u, y = consumption))+
geom_line(aes(x = u1, y = consumption))
I'm not sure what else to explain, so if someone could provide some help on providing code to stack overflow that would be useful. I'm also not sure how much of a description to have, I should have enough code to be reproducible, but there is a problem that Stack Overflow only allows so much code, so it would be good to know the right amount to add.
I think you may need to read the documentation for ggplot2 and maybe r in general.
data.frame
For starters, a data.frame object is a collection of vectors appended together column wise. Most of what you have defined as inputs for hhcomp are functions, which cannot be stored as a data.frame. A canonical example of a data frame in r is iris
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
str(iris) #print the structure of an r object
#'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
functions
There is a lot going on with your functions. Nested functions are fine, but it seems as though you are failing to pass all values on. This probably means you are trying to apply R's scoping rules but this makes code ambiguous of where values are found.
With the currently defined functions, calling u(1:2,3:4) passes 1:2 to utilityf but utilityf's y argument is never assigned (but with r's lazy evaluation we reach a different error before r realizes that this value is missing). The next function that gets evaluated in this nest is p1qyf which is defined as follows
p1qyf <- function(y){(w1*16)-(w1*x)}
with this definition, it does not matter what you pass to the argument y it will never be used and will always return the same thing.
#with only the function defined
p1qyf()
#Error in p1qyf() : object 'w1' not found
#defining w1
w1 <- 1.5
p1qyf()
#Error in p1qyf() : object 'x' not found
x <- 10:20
#All variables defined in the function
#can now be found in the global environment
#thus the function can be called with no errors because
#w1 and x are defined somewhere...
p1qyf() #nothing assigned to y
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
p1qyf(y = iris) #a data.frame assigned to y
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
p1qyf(y = foo_bar) #an object that hasn't even been assigned yet
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
I imagine you actually intend to define it this way
p1qyf <- function(y){(w1*16)-(w1*y)}
#Now what we pass to it affects the output
p1qyf(1:10)
#[1] 22.5 21.0 19.5 18.0 16.5 15.0 13.5 12.0 10.5 9.0
head(p1qyf(iris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 16.35 18.75 21.90 23.7 NA
#2 16.65 19.50 21.90 23.7 NA
#3 16.95 19.20 22.05 23.7 NA
#4 17.10 19.35 21.75 23.7 NA
#5 16.50 18.60 21.90 23.7 NA
#6 15.90 18.15 21.45 23.4 NA
You can improve this further by defining more arguments so that R doesn't need to search for missing values with it's scoping rules
p1qyf <- function(y, w1 = 1.5){(w1*16)-(w1*y)}
#w1 is defaulted to 1.5 and doesn't need to be searched for.
I would spend some time looking into your functions because they are unclear and some, such as your p1qyf, do not fully use the arguments they are passed.
ggplot
ggplot takes some type of structured data object such as data.frame tbl_df, and allows plotting. The aes mappings can take the symbol names of the column headers you wish to map. Continuing with iris as an example.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species))+
geom_point() +
geom_line()
I hope this helps clears up why you may be getting some errors. Honestly though, if you were actually able to declare a data.frame then the problem here is that your post is still not that reproducible. Good luck
pqxf <- function(x){(1)*(y)} # replace 1 with py and assign a value to py
pqyf <- function(y){(w * 16)-(w * x)} #
utilityf <- function(x, y) { (pqyf(x)) * ((pqxf(y)))} # the utility function C,l
hours <- c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)
w1 <- 1.5
p1qxf <- function(x){(1)*(y)} # replace 1 with py and assign a value to p1y
p1qyf <- function(y){(w1 * 16)-(w1 * x)} #
utilityf1 <- function(x, y) { (p1qyf(x)) * ((p1qxf(y)))} # the utility function (C,l)
utilitycf <- function(x,y){max(utilityf(x))/((pqyf(y)))}
utilityc1f <- function(x,y){max(utilityf1(x))/((pqyf(y)))}
u <- function(x,y){max(utilityf(x))}
u1 <- function(x,y){max(utilityf1(x))}```

Contingency tables after multiply imputated data with MICE in R

After imputation in R with the MICE package, I want to generate contingency tables. The fit shows the tables in a list, but if I pool() them, the following error is thrown: Error in pool(fit) : Object has no coef() method. What am I doing wrong?
This basic example reproduces the error:
library("mice")
imp <- mice(nhanes)
fit <- with(imp, table(bmi, hyp))
est <- pool(fit)
The function mice::pool(object) simply calculated estimates and standard errors for scalar estimands using "Rubin's rules", for which it relies on the fact that the estimates are often extracted using coef(object), and the standard errors of these estimates are usually available in the diagonal of vcov(object). It is intended to be used with objects of classes like lm, which have coef and vcov methods neatly defined.
In your example, Rubin's rules do not apply. What are the "estimates" and "standard errors" of the entries in a contingency table? For this reason, pool complains that there is no method available for extracting the coefficients from your fit.
So if your "estimate" is simply supposed to be the "average" contingency table, try this:
library("mice")
imp <- mice(nhanes)
fit <- with(imp, table(bmi, hyp))
est <- pool(fit)
# dimensions
nl <- length(fit$analyses)
nr <- nrow(fit$analyses[[1]])
nc <- ncol(fit$analyses[[1]])
# names
rnames <- rownames(fit$analyses[[1]])
cnames <- colnames(fit$analyses[[1]])
# cast list to array
fit.arr <- array(unlist(fit$analyses), dim=c(nr,nc,nl),
dimnames=list(rnames,cnames))
# get "mean" contingency table
apply(fit.arr, 1:2, mean)
# 1 2
# 20.4 1.8 0.0
# 21.7 1.4 0.0
# 22 1.4 0.2
# 22.5 1.8 0.4
# 22.7 1.2 0.4
# 24.9 1.2 0.0
# 25.5 1.0 1.6
# 26.3 0.0 1.0
# 27.2 0.4 1.0
# 27.4 1.4 0.4
# 27.5 1.6 0.2
# 28.7 0.0 1.0
# 29.6 1.0 0.2
# 30.1 1.8 0.2
# 33.2 1.0 0.0
# 35.3 1.2 0.2
Whether or not the "average" table is of any use, however, is probably debatable.

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?
1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))
You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

Make a Consecutive row calculation in R

Suppose I have the following data:
a<- c(1:10)
b<- c(10:1)
Now I want to make a Consecutive calculation (of variable length in this example 2)on both rows (a and b) and save the output in two separate lists(a and b).
The calculation should look like the following:
for a:
(1+2)/2; (2+3)/2; (3+4)/2;...; (9+10)/2
for b(the same):
(10+9)/2; (9+8)/2; (8+7)/2;...;(2+1)/2
a
1,5 2,5 3,5 ... 9,5
b
9,5 8,5 7,5 ... 1,5
I found this function in StackOverflow:
v <- c(1, 2, 3, 10, 20, 30)
grp <- 3
res <- sapply(1:(length(v)-grp+1),function(x){sum(v[x:(x+grp-1)])})
Which pretty much does what i need but I would prefer a function which does that without using sapply and just base R.
Any Help would be appreciated!
You can do base R:
f = function(x) (head(x,-1) + tail(x,-1))/2
list(a=f(a), b=f(b))
#$a
#[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
#$b
#[1] 9.5 8.5 7.5 6.5 5.5 4.5 3.5 2.5 1.5
Or if you want to use the apply family:
library(zoo)
list(a=rollapply(a,2, mean), b=rollapply(b,2, mean))
sapply is really not recommended but if you want to use it (just for test!):
sapply(1:(length(a)-1), function(i) mean(a[i:(i+1)]))
#[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
#same for b
na.omit(filter(a, c(1,1))/2)
na.omit(filter(b, c(1,1))/2)
You could try this:
d1 <- ((a + a[seq(a)+1])/2)[-length(a)]
#[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
and
d2 <- ((b + b[seq(b)+1])/2)[-length(b)]
#[1] 9.5 8.5 7.5 6.5 5.5 4.5 3.5 2.5 1.5
The last part [-length(a)] and [-length(b)] removes NA entries at the end of the sequence.
If the length of a and b is same
for(i in 1:(length(a) - 1))
{
list1[i] <- (a[i] + a[i+1])/2
list2[i] <- (b[i] + b[i+1])/2
}
> list1
#[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
> list2
#[1] 9.5 8.5 7.5 6.5 5.5 4.5 3.5 2.5 1.5
Or else write two different loops for both
for(i in 1:(length(a) - 1))
{
list1[i] <- ((a[i] + a[i+1])/2)
}
for(i in 1:(length(b) - 1))
{
list2[i] <- ((b[i] + b[i+1])/2)
}

referencing an xts object with a matrix

I have a 207x7 xts object (called temp). I have a 207x3 matrix (called ac.topn), each row of which contains the columns I'd like from the corresponding row in the xts object.
For example, given the following top two rows of temp and ac.topn,
temp
v1 v2 v3 v4 v5 v6 v7
1997-09-30 14.5 8.7 -5.8 2.6 4.7 1.9 17.2
1997-10-31 6.0 -2.0 -25.7 2.9 4.9 9.6 8.4
head(ac.topn)
Rank1 Rank2 Rank3
1997-09-30 7 4 2
1997-10-31 6 5 7
I would like to get the result:
1997-09-30 17.2 2.6 8.7 (elements 7, 4, and 2 from the first row of temp)
1997-10-31 9.6 4.9 8.4 (elements 6, 5, 7 from the second row of temp)
My first attempt was temp[,ac.topn]. I've browsed for help, but am struggling to word my request effectively.
Thank you.
Well, this works, but I've got to think there's a better way...
result <- do.call(rbind,lapply(index(temp),function(i)temp[i,ac.topn[i]]))
colnames(result) <- colnames(as.topn)
result
# Rank1 Rank2 Rank3
# 1997-09-30 17.2 2.6 8.7
# 1997-10-31 9.6 4.9 8.4
You may subset a matrix version of the xts object, using indexing via a numeric matrix:
m <- as.matrix(temp)
cols <- as.vector(ac.topn)
rows <- rep(1:nrow(ac.topn), ncol(ac.topn))
vals <- m[cbind(rows, cols)]
xts(x = matrix(vals, nrow = nrow(temp)), order.by = index(temp))
# [,1] [,2] [,3]
# 1997-09-30 17.2 2.6 8.7
# 1997-10-31 9.6 4.9 8.4
However, I say the same as #jlhoward: I've got to think there's a better way...

Resources