Consider a minimum working example (for, e.g. a binomial model):
test.a.tset <- rnorm(10)
test.b.tset <- rnorm(10)
c <- runif(10)
c[c < 0.5] <- 0
c[c >= 0.5] <- 1
df <- data.frame(test.a.tset,test.b.tset,c)
Using a regex, I want to regress c on all variables with the structure test."anything".tset:
summary(glm(paste("c ~ ",paste(colnames((df[, grep("test\\.\\w+\\.tset", colnames(df))])),
collapse = "+"), sep = ""), data = df, family=binomial))
So far, no problems. Now we get to the part where cbind comes into play. Suppose I want to use a different statistical model (e.g. rbprobitGibbs from the bayesm package), which requires a design matrix as input.
Thus, I need to transform the data frame into the appropriate format.
X <- cbind(df$test.a.tset,df$test.b.tset)
Or, alternatively, if I want to use regex again (where I even add a second grep to ensure that only the part inside the quotation marks is selected):
X2 <- cbind(grep("[^\"]+",paste(paste("df$", colnames((df[, grep("test\\.\\w+\\.tset", colnames(df))])),
sep = ""), collapse = ","), value = TRUE))
But there is a difference:
> X
[,1] [,2]
[1,] -0.4525601 -1.240484170
[2,] 0.3135625 1.240519383
[3,] -0.2883953 -0.554670224
[4,] -1.3696994 -1.373690426
[5,] 0.8514529 -0.063945537
[6,] -1.1804205 -0.314132743
[7,] -1.0161170 -0.001605679
[8,] 1.0072168 0.938921869
[9,] -0.8797069 -1.158626865
[10,] -0.9113297 1.641201924
> X2
[,1]
[1,] "df$test.a.tset,df$test.b.tset"
From my point of view the problem seems to be that grep returns the selected value as a string inside quotation marks and that, while glm sort of ignores the quotation marks in "df$test.a.tset,test.b.tset", cbind does not.
I.e. the call for X2 after the paste is actually read as:
X2 <- cbind("df$test.a.tset,df$test.b.tset")
Question: Is there a way to get the same result for X2 as for X using a regex?
The code grep("test\\.\\w+\\.tset", colnames(df)) will return the indexes of columns that match your pattern. If you wanted to build a matrix using just those columns, you could just use:
X3 <- as.matrix(df[,grep("test\\.\\w+\\.tset", colnames(df))])
Related
I have created a list whose elements are themselves a list of matrices. I want to be able to extract the vectors of observations for each variable
p13 = 0.493;p43 = 0.325;p25 = 0.335;p35 = 0.574;p12 = 0.868
std_e2 = sqrt(1-p12^2)
std_e3 = sqrt(1-(p13^2+p43^2))
std_e5 = sqrt(1-(p25^2+p35^2+2*p25*p35*(p13*p12)))
set.seed(1234)
z1<-c(0,1)
z2<-c(0,1)
z3<-c(0,1)
z4<-c(0,1)
z5<-c(0,1)
s<-expand.grid(z1,z2,z3,z4,z5); s
s<-s[-1,];s
shift<-3
scenari<-s*shift;scenari
scenario_1<-scenari[1];scenario_1
genereting_fuction<-function(n){
sample<-list()
for (i in 1:nrow(scenario_1)){
X1=rnorm(n)+scenari[i,1]
X4=rnorm(n)+scenari[i,4]
X2=X1*p12+std_e2*rnorm(n)+scenari[i,2]
X3=X1*p13+X4*p43+std_e3*rnorm(n)+scenari[i,3]
X5=X2*p25+X3*p35+std_e5*rnorm(n)+scenari[i,5]
sample[[i]]=cbind(X1,X2,X3,X4,X5)
colnames(sample[[i]])<-c("X1","X2","X3","X4","X5")
}
sample
}
set.seed(123)
dati_fault<- lapply(rep(10, 100), genereting_fuction)
dati_fault[[1]]
[[1]]
X1 X2 X3 X4 X5
[1,] 2.505826 1.736593 1.0274581 -0.6038358 1.9967656
[2,] 4.127593 3.294344 2.8777777 1.2386725 3.0207723
[3,] 1.853050 1.312617 1.1875699 0.5994921 1.0471564
[4,] 4.481019 3.330629 2.1880050 -0.1087338 2.7331061
[5,] 3.916191 3.306036 0.7258404 -1.1388570 1.0293168
[6,] 3.335131 2.379439 1.2407679 0.3198553 1.6755424
[7,] 3.574675 3.769436 1.1084120 -1.0065481 2.0034434
[8,] 3.203620 2.842074 0.6550587 -0.8516120 -0.1433508
[9,] 2.552959 2.642094 2.5376430 2.0387860 3.5318055
[10,] 2.656474 1.607934 2.2760391 -1.3959822 1.0095796
I only want to save the elements of X1 in an object, and so for the other variables. .
Here you have a list of matrix with scenario in row and n columns.
genereting_fuction <- function(n, scenario, scenari){
# added argument because you assume global variable use
nr <- nrow(scenario)
sample <- vector("list", length = nr) # sample<-list()
# creating a list is better than expanding it each iteration
for (i in 1:nr){
X1=rnorm(n)+scenari[i,1]
X4=rnorm(n)+scenari[i,4]
X2=X1*p12+std_e2*rnorm(n)+scenari[i,2]
X3=X1*p13+X4*p43+std_e3*rnorm(n)+scenari[i,3]
X5=X2*p25+X3*p35+std_e5*rnorm(n)+scenari[i,5]
sample[[i]]=cbind(X1,X2,X3,X4,X5)
colnames(sample[[i]])<-c("X1","X2","X3","X4","X5")
}
sample
}
set.seed(123)
dati_fault<- lapply(rep(3, 2), function(x) genereting_fuction(x, scenario_1, scenari))
dati_fault
lapply(dati_fault, function(x) {
tmp <- lapply(x, function(y) y[,"X1"])
tmp <- do.call(rbind, tmp)
})
If you want to assemble this list of matrix, like using cbind, I suggest you just use a single big n value and not the lapply with rep inside it.
Also I bet there is easier way to simulate this number of scenari, but it's difficult to estimate without knowing the context of your code piece.
Also, try to solve your issue with a minimal example, working with a list of 100 list of 32 matrix of 5*10 is a bit messy !
Good luck !
I'm trying to get a matrix of values using replicate(), but when the if() statement returns a NULL, that creates a list instead of a matrix. I've spent time reading web pages and Questions here but just can't seem to get something that works. I've tried variations on invisible and sink, but still haven't been able to get an output that doesn't return a NULL value.
Here is an example that gives a NULL value for the 5th entry.
How do I get the if() statement to not return anything, including NULL?
set.seed(10)
reps <- 10
f_myfun2 <- function(i){
x1 <- rep(1:4, each=5)
x2 <- rep(1:5, times=4)
n <- length(x1)
y <- 0.20 + 0.30*x1 + 0.7*x2 + 0.50*rnorm(n)
cis <- confint(lm( y ~ x1*x2 ))
int_lower <- cis[4,1]
int_upper <- cis[4,2]
if(int_lower > 0 | int_upper < 0){
# I don't want to return anything, including a NULL value
# tried various things including invisible, sink, etc
}
else{
cis2 <- confint(lm( y ~ x1 + x2))
c(cis2[2,], cis2[3,])
}
}
sims <- replicate(reps, f_myfun2(1))
sims # [[5]] is NULL rather than just missing
str(sims) # now it's a list rather than a matrix without the NULL
is(sims)
If your final goal is to combine all of them into one dataframe/matrix you don't have to worry about those NULL values when you combine they are automatically removed.
set.seed(10)
sims <- replicate(reps, f_myfun2(1))
result <- do.call(rbind, sims)
result
# 2.5 % 97.5 % 2.5 % 97.5 %
# [1,] 0.26428935 0.5953288 0.5766971 0.8384067
# [2,] 0.20815417 0.5763875 0.5750148 0.8661288
# [3,] 0.16616864 0.5437533 0.5694641 0.8679710
# [4,] 0.05366132 0.5055326 0.5076891 0.8649247
# [5,] 0.26292580 0.6246576 0.6441824 0.9301565
# [6,] 0.21173249 0.5972766 0.5297051 0.8345045
# [7,] -0.01499442 0.5077043 0.5399975 0.9532272
# [8,] 0.23639931 0.5871463 0.5270991 0.8043890
# [9,] -0.09686737 0.2529322 0.4378804 0.7144212
I am wondering whether there is a way to use apply-family function to evaluate two different dataframes at once? Or is there a better way to solve this problem? I can only think of a loop, and that is too slow:
# example data
df_model <- data.frame(DY = c(93,100,107), CC=rnorm(1:3, mean = 0.1))
df_data <- data.frame(DY = rep(c(93,100,107),each = 3), CC = c(rnorm(1:3),rnorm(1:3),rnorm(1:3)))
In this example, I would like to have a vector of three elements as output, processed as follows ( here for the first case)
#example procedure case 1
collect <- matrix(0,ncol=3,nrow=3)
collect[1,] <- dnorm( df_data[which(df_data$DY == df_model$DY[1]),]$CC, df_model[1,]$CC, log=TRUE )
as Input, I envisage
a list/vector of CCs in df_data, subsetted for by the corresponding day DY (0.07624536 1.32623789 0.92921693)
evaluated against one value (0.00049671) of df_model, on that corresponding day DY
In the end I would like to collect the vectors in example(collect) a matrix of the number of rows of three df_model$DY, and three columns, which contains the evaluation of df_data against df_model on day DY.
[,1] [,2] [,3]
[1,] -0.9218075 -1.7977334 -1.3501992
[2,] -0.9356356 -0.9850012 -1.1753341
[3,] -1.2152926 -0.9195071 -2.4127840
This needs to be done as efficiently as possible.
I can do it in a loop (above you see the first case for the loop), but I am sure there are better ways.
I looked into the apply function family, but I get confused, as I have two different dataframes which I evaluate. Any help/pointers would be much appreciated!
We can use mapply or Map
mapply(function(x, y) dnorm(df_data$CC[df_data$DY == x], y,
log = TRUE), df_model$DY, df_model$CC)
-output
# [,1] [,2] [,3]
#[1,] -1.5031401 -2.7449464 -1.734319
#[2,] -0.9237629 -0.9243094 -1.115875
#[3,] -4.9848319 -1.1494313 -1.187122
I have several columns of data that I want to use a for loop (specifically a for loop. Please, no answers that don't involve a for loop) to run a function for each column in a matrix.
x <- runif(10,0,10)
y <- runif(10,10,20)
z <- runif(10,20,30)
tab <- cbind(x,y,z)
x y z
[1,] 9.5262742 16.22999 21.93228
[2,] 5.8183264 14.53771 21.81774
[3,] 3.9509342 17.36694 22.46594
[4,] 3.0245614 19.46411 25.80411
[5,] 5.0284351 13.89636 21.61767
[6,] 3.0291715 17.50267 26.28110
[7,] 8.4727471 16.77365 27.60535
[8,] 3.3816903 15.23395 22.01265
[9,] 0.3182083 13.97575 29.25909
[10,] 2.6499290 16.71129 27.05160
for (i in 1:ncol(tab)){
print(mean(i))
}
I have almost no familiarity with R and have had trouble finding a solution that specifically uses a for loop to run a function and output a result per column.
Well, strictly using a for loop, I think this would do what you want to!
x <- runif(10,0,10)
y <- runif(10,10,20)
z <- runif(10,20,30)
tab <- cbind(x,y,z)
for (i in 1:ncol(tab)){
print(mean(tab[, i]))
}
You need to index the matrix by using [row, column]. When you want to select all rows for a specific column (which is your case), just leave the row field empty. So that's why you have to use [, i], where i is your index.
EDIT: For anyone interested, I completed my little project here and it can be seen at this link http://fdrennan.net/pages/myCurve.html
Scroll down to "I think it's been generalized decently" to see the curve_fitter function. If you find it useful, steal it and I don't need credit. I still have ncol as an input but it isn't required anymore. I just didn't delete it.
I am writing a script that will do some least squares stuff for me. I'll be using it to fit curves but want to generalize it. I want to be able to write in "x, x^2" in a function and have it pasted into a matrix function and read. Here is what I am talking about.
expressionInput <- function(func = "A written function", x = "someData",
nCol = "ncol") {
# Should I coerce the sting to something in order to make...
func <- as.SOMETHING?(func)
# ...this line to be equivalent to ...
A <- matrix(c(rep(1, length(x)), func), ncol = nCol)
# .... this line
# A <- matrix(c(rep(1, length(x)), x, x^2), ncol = 3)
A
}
expressionInput(func = "x, x^2", x = 1:10, nCol = 3)
Returns 10 x 3 matrix with 1's in one column, x in second, and squared values in third column.
The link below will show a few different functions for curve fitting. The idea behind this post is to be able to write in "x + x^2" or "x + sin(x)" or "e^x" etc., and return the coefficients for curve.
http://fdrennan.net/pages/myCurve.html
I think you are looking for something like this
f <- function(expr="", x=NULL, nCol=3) {
expr <- unlist(strsplit(expr,","))
ex <- rep("rep(1,length(x))", nCol)
ex[1:length(expr)] <- expr
sapply(1:length(ex), function(i) eval(parse(text=ex[i])))
}
f("x, x^2", 1:10, 3)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 4 1
[3,] 3 9 1
[4,] 4 16 1
[5,] 5 25 1
[6,] 6 36 1
[7,] 7 49 1
[8,] 8 64 1
[9,] 9 81 1
[10,] 10 100 1
Note that, in your example, you separate the expressions to evaluate using a comma (,). Accordingly, I have used a comma to split the string into expressions. If you try passing expressions, which themselves contain commas this will fail. So, either restrict to using simple expressions without commas. Or if this is not possible then use a different character to separate the expressions (and escape it, if you need that character to be evaluated).
However, I would also reiterate the warnings in the comments to your question that depending on what you are trying to achieve, there are likely better ways to do it.