I have 2000 wheat plants, growing over the course of 40 days.
I'd like to perform the coeff function on each plant to find the coefficients of the quadratic equation the 3 time points make. (a, b, and c)
(1) The coef(lm(y~poly(x,2,raw=TRUE)) function works exactly the way I want it to.
(2) However, the way my data is presented, requires me to manually set x and y.
(3) Thus, I melted my data, and ordered it.
(4) I'd like to make a loop that will take the first three in column "Day" and set that as x. Then I'd like it to take the first three in column "Height" and set that as y.
Then I'd like to perform the coeff function.
Last I'd like it to present the coefficient outputs I need, preferably in a new data table.
Then repeat for every three rows, which represent each wheat ID, for all wheat plants.
1) This function works, giving me coefficients: a, b, c
x<-c(1,2,3)
y<-c(1,10,4)
coef(lm(y~poly(x,2,raw=TRUE)))
2) This is what my data originally looked like
A = matrix(c(5, 4, 2, 10, 10, 4, 5, 15, 6),nrow=3, ncol=3)
colnames(A)<-c("10", "25", "40")
rownames(A)<-c("Wheat 1", "Wheat 2", "Wheat 3")
A
3) This is my melted format
A.melted<-as.data.frame(melt(A, id.vars="ID"))
A.melted<-A.melted[with(A.melted,order(Var1)),]
colnames(A.melted) <- c("WheatID", "Day", "Height")
A.melted$Day<-as.numeric(as.character(A.melted$Day))
A.melted
#
4) This is what I am trying to do with my loop....
for every 3 rows,
x<-A.melted[,2]
y<-A.melted[,3]
coef(lm(y~poly(x,2,raw=TRUE)))
something to compile the coefficients: a, b, c
I am just not familiar with the syntax of loops, and I'd love any tips and suggestions. Perusing Google tells me that one should not do loops unless it is absolutely required since I may run into more problems- thus I am open to non loop techniques as well.
If you want to do it in a loop try this. The crucial part is to use seq together with a by = argument to let the index take the steps you need.
library(tibble)
df <- tibble(
WheatID = rep(NA_character_, nrow(A)),
Intercept = rep(NA_real_, nrow(A)),
poly1 = rep(NA_real_, nrow(A)),
poly2 = rep(NA_real_, nrow(A))
)
cnt <- 1
for (i in seq(1, nrow(A.melted), by = 3)) {
x <- A.melted$Day[i + 0:2]
y <- A.melted$Height[i + 0:2]
df$WheatID[cnt] <- as.character(A.melted$WheatID[i])
df[cnt, 2:4] <- coef(lm(y~poly(x,2,raw=TRUE)))
cnt <- cnt + 1
}
df
Note: I am not a data.table guy. Therefore, I present you with a tibble.
We can do this with the help of data.table, see ?data.table:
library(data.table)
A.models = A.melted[, model := list(.(lm(Height ~ poly(Day, 2),
data = list(.(.SD[WheatID == .BY[[1]]]))))),
by = WheatID]
A.models[, coefs := list(.(coefficients(model[[1]]))),
by = WheatID]
You can access each model like this:
A.models[WheatID == "Wheat 1", model[[1]]]
and even
A.models[WheatID == "Wheat 1", summary(model[[1]])]
The magic here happens because data.table takes in J expressions, not only functions.
This is something you can do with data.table package.
data.list <- split(A.melted, f = (1:nrow(A.melted) - 1) %/% 3)
coefs <- lapply(data.list, function(x) {
coefs <- coef(lm(Day ~ poly(Height, raw=TRUE), data = x))
data.table(
intercept = coefs[1],
poly.height = coefs[2]
)
})
coefs <- rbindlist(coefs)
Or you could perform apply() directly on the original matrix:
x <- as.numeric(colnames(A))
apply(A, 1, function(y) coef(lm(y~poly(x,2,raw=TRUE))))
Wheat 1 Wheat 2 Wheat 3
(Intercept) -3.88888889 -0.555555556 6.666667e-01
poly(x, 2, raw = TRUE)1 1.11111111 0.477777778 1.333333e-01
poly(x, 2, raw = TRUE)2 -0.02222222 -0.002222222 -2.417315e-18
Or you could transpose the data and use the coef(...) call directly:
x <- as.numeric(colnames(A))
coef(lm(t(A) ~ poly(x, 2, raw = TRUE)))
Related
I tried to create a matrix from a list which consists of N unequal matrices...
The reason to do this is to make R individual bootstrap samples.
In the example below you can find e.g. 2 companies, where we have 1 with 10 & 1 with just 5 observations.
Data:
set.seed(7)
Time <- c(10,5)
xv <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2);
y <- matrix( c(rnorm(10,5,2), rnorm(5,20,1)));
z <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2)
# create data frame of input variables which helps
# to conduct the rowise bootstrapping
data <- data.frame (y = y, xv = xv, z = z);
rows <- dim(data)[1];
cols <- dim(data)[2];
# create the index to sample from the different panels
cumTime <- c(0, cumsum (Time));
index <- findInterval (seq (1:rows), cumTime, left.open = TRUE);
# draw R individual bootstrap samples
bootList <- replicate(R = 5, list(), simplify=F);
bootList <- lapply (bootList, function(x) by (data, INDICES = index, FUN = function(x) dplyr::sample_n (tbl = x, size = dim(x)[1], replace = T)));
---------- UNLISTING ---------
Currently, I try do it incorrectly like this:
Example for just 1 entry of the list:
matrix(unlist(bootList[[1]], recursive = T), ncol = cols)
The desired output is just
bootList[[1]]
as a matrix.
Do you have an idea how to do this & if possible reasonably efficient?
The matrices are then processed in unfortunately slow MLE estimations...
i found a solution for you. From what i gather, you have a Dataframe containing all observations of all companies, which may have different panel lengths. And as a result you would like to have a Bootstap sample for each company of same size as the original panel length.
You mearly have to add a company indicator
data$company = c(rep(1, 10), rep(2, 5)) # this could even be a factor.
L1 = split(data, data$company)
L2 = lapply(L1, FUN = function(s) s[sample(x = 1:nrow(s), size = nrow(s), replace = TRUE),] )
stop here if you would like to have saperate bootstap samples e.g. in case you want to estimate seperately
bootdata = do.call(rbind, L2)
Best wishes,
Tim
I have a collection of sixth-degree polynomial regression models from which I want to gather only the coefficients.
I have a large dataset that contains 3 columns: the first one is an arbitrary parameter that acts as a flag, the second is the input, and third is the output.
I subsetted my dataset according to my parameters, so I have 10 smaller datasets. My models arose from these subsets.
As an example:
#-----"Dummy" Dataset-----
a = seq(1:100) #act as input
b = a + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = as.data.frame(cbind(a,b))
colnames(df) = c("input", "output")
#
#-----Subsets-----
df_1_XlessThen50 = subset(df, x< 50) #example of subsetting. In this
#case I used the x values itself as threshold
#for subsetting just for simplicity.
#In reality, I use the first column of my dataframe(parameter).
df_2_XmoreThen50 = subset(df, x >= 50) #second subset. In other words,
#for every parameter, I will divide that subset
#into two smaller ones.
#
#-----Models-----
model_3_ab.1 = lm(output ~ poly(input, 6, raw = T), data = df_1_XlessThen50)
model_3_ab.2 = lm(output ~ poly(input, 6, raw = T), data = df_2_XmoreThen50)
My models's names follow a pattern: "model" + parameter + "_ab." + id number.
I should clarify that the "id number" indicates which of the two models for every parameter I will consider. (Theses smaller datasets within every parameter are the results of subsetting according to a pre-determined threshold.)
What I have now is a collection of models like these two above for every parameter in my dataset. I have 10 parameters, hence, 20 models.
I want to gather only the coefficients of every model and store them into a matrix or dataframe. To achieve that, I tried:
parameter = c(2,4,6,7,9,11,33,35,37,50)
myData = array()
for (i in parameter){ #Loop over all parameters
for (j in 1:2){ #Loop over the pair of models for each parameter
for ( k in 1:6){ #Loop over my model's coefficient
aux = paste("model",i,"ab.",j, sep = "")
aux = get(aux)
myData[i,j,k] = aux$coefficients[k]
}
}
}
However, I keep getting the same error:
Error in myData[i, j, k] = aux$coefficients[k] :
incorrect number of subscripts
With this error, I can't advance into my goal, which is to write a .txt with one single column formatted as such:
A(2,1,1) = first order coefficient for the first model related to parameter 2
B(2,2,1) = second order coefficient for the first model related to parameter 2
C(2,3,1)
...
G(2,7,1)
A(2,1,2)
where in (M, N, O): M is the parameter, N is the the coefficient of the N-th degree (N = 7 is the intercept), and O is either 1 or 2, respectively, the first or second model in each pair of models for every parameter.
It'd be nice to get help/guidance for the whole problem, but I'll already be grateful if I can get past the part where I want to store my coefficients in a matrix using for-loops. Thanks
Here is what I mean:
set.seed(42)
a1 = seq(1:100) #act as input
a2 <- runif(100)
b = a1 + a2 + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = data.frame(input1 = a1,
input2 = a2,
output = b)
df$flag <- a1 <= 50
library(reshape2)
df <- melt(df, id.vars = c("output", "flag"))
library(lme4)
df$flag_par <- interaction(df$flag, df$variable)
fits <- lmList(output ~ poly(value, 2, raw = TRUE) | flag_par, data = df)
coef(fits)
# (Intercept) poly(value, 2, raw = TRUE)1 poly(value, 2, raw = TRUE)2
#FALSE.input1 125.957730 -2.434849 0.022137337
#TRUE.input1 2.842223 1.216113 -0.006686362
#FALSE.input2 68.807752 -7.429319 26.486493218
#TRUE.input2 31.791633 -18.595105 16.608600876
I know similar questions have been asked in this site here, here, and here, but none of them tackles my problem.
I've a data frame which I want to apply the rdirichlet function (from gtools) to each line. So, each line shall be consider as aplha.
data = NULL
data <- data.frame(rbind(
oct = c(60, 32, 8),
sep = c(53, 35, 12),
ago = c(54, 40, 6)
))
data <- data/100*1000
library(gtools) # contains the function
sim <- 10000 # simulation
My first attenpt was to use apply, it does work, but the output is not that clear for conducting further analysis; each row computation becomes a vector:
p = apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
I also try in a loop without success:
p = NULL
for(i in 1:length(data)) {
p[i] <- rdirichlet(sim, alpha = data[i] + 1)
}
Any tip how can I solve this?
Well firstly you might want to change the data in your anonymous function in the apply to x to match the x in function(x)
apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
This works for me, as in it provides an output with three columns and 30000 rows.
Two important things here. First, vectorizing is the best way to go:
ans <- apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
By doing this, you'll receive each row computations as vector, essentially k vs sim like.
Then you'll need to subsample things like:
margin <- ans[1:100000,1] - ans[100001:200000,1]
This is a follow up question from R: t-test over all columns
Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).
Here is what I have come up with based on previous answer. This one stops with an error.
C <- c("c1","c1","c1","c1","c1",
"c2","c2","c2","c2","c2",
"c3","c3","c3","c3","c3",
"c4","c4","c4","c4","c4",
"c5","c5","c5","c5","c5",
"c6","c6","c6","c6","c6",
"c7","c7","c7","c7","c7",
"c8","c8","c8","c8","c8",
"c9","c9","c9","c9","c9",
"c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)
Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)
Data.Subsets = c("Data.c1",
"Data.c2",
"Data.c3",
"Data.c4",
"Data.c5")
library(plyr)
combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {
combo2 <- combn(ncol(Data.Subsets[x]),2)
adply(combo2, 2, function(y) {
test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)
out <- data.frame("Subset" = rownames(Data.Subsets[x]),
, "Row" = colnames(x)[y[1]]
, "Column" = colnames(x[y[2]])
, "t.value" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
} )
} )
First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.
Data <- data.frame(
C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
X = rnorm(n = 50, mean = 10, sd = 5),
Y = rnorm(n = 50, mean = 15, sd = 6),
Z = rnorm(n = 50, mean = 20, sd = 5)
)
Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)
longData <- melt(Data, id.vars = "C")
Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.
with(longData, pairwise.t.test(value, interaction(C, variable)))
Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)
In general, pairwise t tests are inferior to a regression so be careful about their usage.
You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.
Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like
conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})
That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.
However it would be a much better idea as #Richie Cotton points out, to reshape your data frame and use pairwise t tests.
I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.
I'm trying to run apply a function to each row of a dataset. The function looks up matching rows in a second dataset and computes a similarity score for the product details passed to it.
The function works if I just call it with test numbers but I can't figure out how to run it on all rows of my dataset. I've tried using apply but can't get it working.
I'm going to be iterating different parameter settings to find those that best fit historical data so speed is important... meaning that a loop is out. Any help you can provide would be hugely appreciated.
Thanks! Alan
GetDistanceTest <- function(SnapshotDate, Cand_Type, Cand_Height, Cand_Age) {
HeightParam <- 1/5000
AgeParam <- 1
Stock_SameType <- HistoricalStock[!is.na(HistoricalStock$date) & !is.na(HistoricalStock$Type) & as.character(HistoricalStock$date)==as.character(SnapshotDate) & HistoricalStock$Type==Cand_Type,]
Stock_SameType$ED <- (HeightParam*(Stock_SameType$Height - Cand_Height))^2 + (AgeParam*(Stock_SameType$Age - Cand_Age))^2
return(sqrt(sum(Stock_SameType$ED)))
}
HistoricalStock <- HistoricalAQStock[,c(1, 3, 4, 5)]
colnames(HistoricalStock) <- c("date", "Age", "Height", "Type")
Sales <- AllSales[,c(2,10,11,25)]
colnames(Sales) <- c("date", "Age", "Height", "Type")
GetDistanceTest("2010-04-01", 5261, 12, 7523) #works and returns a single number
res1 <- transform(Sales, ClusterScore=GetDistanceTest(date, Type, Height, Age))
# returns Error in `$<-.data.frame`(`*tmp*`, "ED", value = c(419776714.528591, 22321257.0276852, : replacement has 4060 rows, data has 54
# also 4 warnings, one for each variable. e.g. 1: In as.character(HistoricalStock$date) == as.character(SnapshotDate) : longer object length is not a multiple of shorter object length
res2 <- apply(Sales, 1, GetDistanceTest, Sales$Type, Sales$Height, Sales$Age)
# `$<-.data.frame`(`*tmp*`, "ED", value = c(419648071.041523, 22325941.2704261, : replacement has 4060 rows, data has 13
# also same 4 warnings as res1
I took some liberties with your code b/c I try to vectorize vice use loops whenever I can... With the merge function, you merge the two data frames, and operate on the "columns", which allows you to use the vectorization built into R. I think this will do what you want (in the second line I'm just making sure that A and B don't have the same values for height and age so that your distance isn't always zero):
A <- B <- data.frame(date=Sys.Date()-9:0, stock=letters[1:10], type=1:10, height=1:10, age=1:10)
B$height <- B$age <- 10:1
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1/5000
age.param <- 1
temp <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Use mapply, the multivariate form of apply:
res1 <- mapply(GetDistanceTest, Sales$date, Sales$Type, Sales$Height, Sales$Age)
Code as per above comment:
A <- data.frame(date=rep(Sys.Date()-9:0,100), id=letters[1:10], type=floor(runif(1000, 1, 10)), height=runif(1000, 1, 100), age=runif(1000, 1, 100))
B <- data.frame(date=rep(Sys.Date()-9:0,1000), type=floor(runif(10000, 1, 10)), height=runif(10000, 1, 10), age=runif(10000, 1, 10))
AB <- merge(x=A, y=B, by=c("date", "type"), suffixes=c(".A", ".B"))
height.param <- 1
age.param <- 1
AB$ClusterScore <- sqrt( height.param * (AB$height.A - AB$height.B)^2 + age.param * (AB$age.A - AB$age.B)^2 )
Scores <- ddply(AB, c("id"), function(df)sum(df$ClusterScore))