I am currently trying to write my first loop for lagged regressions on 30 variables. Variables are labeled as rx1, rx2.... rx3, and the data frame is called my_num_data.
I have created a loop that looks like this:
z <- zoo(my_num_data)
for (i in 1:30)
{dyn$lm(my_num_data$rx[i] ~ lag(my_num_data$rx[i], 1)
+ lag(my_num_data$rx[i], 2))
}
But I received an error message:
Error in model.frame.default(formula = dyn(my_num_data$rx[i] ~ lag(my_num_data$rx[i], :
invalid type (NULL) for variable 'my_num_data$rx[i]'
Can anyone tell me what the problem is with the loop?
Thanks!
This produces a list, L, whose ith component has the name of the ith column of z and whose content is the regression of the ith column of z on its first two lags. Lag is same as lag except for a reversal of argument k's sign.
library(dyn)
z <- zoo(anscombe) # test input using builtin data.frame anscombe
Lag <- function(x, k) lag(x, -k)
L <- lapply(as.list(z), function(x) dyn$lm(x ~ Lag(x, 1:2)))
First problem, I'm pretty sure the function you're looking for is dynlm(), without the $ character. Second, using $rx[i] doesn't concatenate rx and the contents of i, it selects the (single) element in $rx with index i. Try this... edited I don't have your data, so I can't test it on my machine:
results <- list()
for (i in 1:30) {
results[[i]] <- dynlm(my_num_data[,i] ~ lag(my_num_data[,i], 1)
+ lag(my_num_data[,i], 2))
}
and then list element results[[1]] will be the results from the first regresssion, and so on.
Note that this assumes your my_num_data data.frame ONLY consists of columns rx1, rx2, etc.
I am not super familiar with R, but it appears you are trying to increase the index of rx. Is rx a vector with values at different indices?
If not the solution my be to concatenate a string
for (i in 1:30){
varName <-- "rx"+i
dyn$lm(my_num_data$rx[i] ~ lag(my_num_data$rx[i], 1)
+ lag(my_num_data$varName, 2))
}
Again, I may be way off here, as this if my first post and R is still pretty new to me.
Related
I am trying to analyse a dataframe using hierarchical clustering hclust function in R.
I would like to pass in a vector of p values I'll write beforehand (maybe something like c(5/4, 3/2, 7/4, 9/4)) and be able to have these specified as the different p value options with Minkowski distance when I use expand.grid. Ideally, when hyperparams is viewed, it would also be clear which value of p has been used for each minkowski, i.e. they should be labelled. So for example, where (if you run my code for hyperparams) there would currently just be one minkowski under Dists, for each of the methods in Meths, there would be, if I supplied the p vector as c(5/4, 3/2, 7/4, 9/4), now instead 4 rows for Minkowski distance: minkowski, p=5/4, minkowski, p=3/2, minkowski, p=7/4, minkowski, p=9/4 (or looking something like that, making the p values clear). Any ideas?
(Note: no packages please, only base R!)
Edit: I worded it poorly before, now rewritten. Let's take the following example instead:
acc <- function(x){
first = sum(x)
second = sum(x^2)
return(list(First=first,Second=second))
}
iris0 <- iris
iris1 <- cbind(log(iris[,1:4]),iris[5])
iris2 <- cbind(sqrt(iris[,1:4]),iris[5])
Now the important bit:
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This will work. But now if I want to include a term like "minkowski",p=3 in expand.grid, how would I do it?
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary","minkowski,p=3"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This gives an error.
In reality there should be no p argument unless the method="minkowski". I have tried to use strsplit to get the first part of the expression into ds, and a switch with strsplit to get the second part and then use parse (it would return NULL if the length of the strsplit was not 2 -- this should pass no argument, I think). The issue seems to be that strsplit is not strsplit(x,",") fails to evaluate the vectorized x but rather tries to evaluate the character x which is not a string. Can anyone suggest any workaround/fix or other method for including the minkowski,p=1.6 terms and the like?
We can create a 'p' value column
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary",
"minkowski3", "minkowski4", "minkowski5"),
DS=c("iris0","iris1","iris2"))
Suppose, we have another column of 'p' values in 'tests', the above solution can be changed to
tests$p <- as.list(args(dist))$p # default value
i1 <- grepl("minkowski", tests$Dists)
tests$Dists <- sub("[0-9.]+$", "", tests$Dists)
tests$p[i1] <- rep(3:5, length.out = sum(i1))
Map(function(x, ds, p){
dist1 <- dist(get(ds)[, 1:4], method = x, p = p)
ct <- cutree(hclust(dist1), 3)
acc(table(get(ds)$Species, ct))},
as.character(tests[[1]]), as.character(tests[[2]]), tests$p )
I have a series of lines of code that replace the contents of an existing column based on the contents of another column (i.e. I am creating a categorical variable where the 'cut' function is not applicable). I am new to R and want to write a function that will perform this task on all data.frames without having to insert and customize 50 lines of code each time.
X is the data frame, Y is the categorical variable, and Z is the other (string) variable. This code works:
X$Y <- ""
X <- transform(X, Y=ifelse(Z=="Alameda",20,""))
... (many more lines)
For example I do:
d.f$loc <- ""
d.f <- transform(d.f, loc=ifelse(county=="Alameda",20,""))
# ... and so on
Now I want to do this for several dataframes and different columns instead of loc and county.
However, neither of these functions produces the desired results:
ab<-function(Y,Z,env=X) {
env$Y<-transform(env,Y=ifelse(Z=="Alameda",20,""))
...
}
abc<-function(X,Y,Z) {
X<-transform(X,Y=ifelse(Z=="Alameda",20,""))
...
}
Both of these functions run without error but do not alter the data frame X in any way. Am I doing something wrong in calling the environment or using a function within another function? It seems like a simple question and I would not post if I had not already spent 5+ hours trying to learn this. Thanks in advance!
R uses "call by value" for all objects. Only the return value goes back to the calling enviroment. parameter passing mechanism in R
You can do
ab <- function(X, Y, Z) {
X <- transform(X, Y=ifelse(Z=="Alameda",20,""))
...
return(X)
}
If your dataframes are in a list L you can do lapply(L, ab) or eventually lapply(L, ab, Y=..., Z=...) As a result you will get a list of the modified dataframes. BTW: Have also a look at with() and within(), e.g. X$Y <- with(X, ifelse(Z=="Alameda",20,""))
implicit returning the value
There is no need for an explicit call of return(...) - you can do it implicit, i.e. using the issue that a function returns the value of its last calculated expression:
ab <- function(X, Y, Z) {
X <- transform(X, Y=ifelse(Z=="Alameda",20,""))
...
X ### <<<<< last expression
}
Here is example how you can do it for your situation:
ab <- function(X, Y, Z) {
X[, Y] <- ifelse(X[,Z]>12,20,99)
# ...
X ### <<<<< last expression
}
B <- BOD # BOD is one of the dataframes which come with R
ab(B, "loc", "demand")
I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.
Ran a bunch of regressions and now I am trying to collect their p values and put them into a vector.
x=summary(reg2)$coefficients[4,4] #p value from the first regression, p-val is in row 4, col 4
for (i in 3:1000){
currentreg=summary(paste("reg",i,sep=""))
assign(x,c(x,currentreg$coefficients[4,4]))
}
I also tried eval(parse(currentreg)) and eval(parse(summary(paste("reg",i,sep="")))) with no luck. I always have this problem with telling R "Hey don't treat this as a string, treat it as a variable" and vice versa.
While it would be better to store the objects in a list and loop over that, you're asking for get:
currentreg <- summary(get(paste("reg", i, sep="")))
If you had a list of objects, models <- list(reg2, reg3, reg4, ...). You can then loop over this list with sapply to achieve the desired result (looping, collecting the results into a vector):
x <- sapply(models, function(z) { summary(z)$coeficients[4,4] })
You can use
sapply(mget(ls(pattern = "^reg\\d+$")), function(x) summary(x)$coefficients[4,4])
to create a vector with all p-values.
For example, I have a matrix k
> k
d e
a 1 3
b 2 4
I want to apply a function on k
> apply(k,MARGIN=1,function(p) {p+1})
a b
d 2 3
e 4 5
However, I also want to print the rowname of the row being apply so that I can know which row the function is applied on at that time.
It may looks like this:
apply(k,MARGIN=1,function(p) {print(rowname(p)); p+1})
But I really don't do how to do that in R.
Does anyone has any idea?
Here's a neat solution to what I think you're asking. (I've called the input matrix mat rather than k for clarity - in this example, mat has 2 columns and 10 rows, and the rows are named abc1 through to abc10.)
In the code below, the result out1 is the thing you wanted to calculate (the outcome of the apply command). The result out2 comes out identically to out1 except that it prints out the rownames that it is working on (I put in a delay of 0.3 seconds per row so you can see it really does do this - take this out when you want the code to run full speed obviously!)
The trick I came up with was to cbind the row numbers (1 to n) onto the left of mat (to create a matrix with one additional column), and then use this to refer back to the rownames of mat. Note the line x = y[-1] which means that the actual calculation within the function (here, adding 1) ignores the first column of row numbers, which means it's the same as the calculation done for out1. Whatever sort of calculation you want to perform on the rows can be done this way - just pretend that y never existed, and formulate your desired calculation using x. Hope this helps.
set.seed(1234)
mat = as.matrix(data.frame(x = rpois(10,4), y = rpois(10,4)))
rownames(mat) = paste("abc", 1:nrow(mat), sep="")
out1 = apply(mat,1,function(x) {x+1})
out2 = apply(cbind(seq_len(nrow(mat)),mat),1,
function(y) {
x = y[-1]
cat("Doing row:",rownames(mat)[y[1]],"\n")
Sys.sleep(0.3)
x+1
}
)
identical(out1,out2)
You can use a variable outside of the apply call to keep track of the row index and pass the row names as an extra argument to your function:
idx <- 1
apply(k, 1, function(p, rn) {print(rn[idx]); idx <<- idx + 1; p + 1}, rownames(k))
This should work. The cat() function is what you want to use when printing results during evaluation of a function. paste(), conversely, just returns a character vector but doesn't send it to the command window.
The solution below uses a counter created as a closure, allowing it to "remember" how many times the function has been run before. Note the use of the global assign <<-. If you really want to understand what's going on here, I recommend reading through this wiki https://github.com/hadley/devtools/wiki/
Note there may be an easier way to do this; my solution assumes that there is no way to access the rownumber or rowname of a current row using typical means within an apply function. As previously mentioned, this would be no problem in a loop.
k <- matrix(c(1,2,3,4),ncol=2)
rownames(k) <- c("a","b")
colnames(k) <- c("d","e")
make.counter <- function(x){
i <- 0
function(){
i <<- i+1
i
}
}
counter1 <- make.counter()
apply(k,MARGIN=1,function(p){
current.row <- rownames(k)[counter1()]
cat(current.row,"\n")
return(p+1)
})
As far as I know you cannot do that with apply, but you could loop through the rownames of your data frame. Lame example:
lapply(rownames(mtcars), function(x) sprintf('The mpg of %s is %s.', x, mtcars[x, 1]))