Suppose I have a nice little data frame
df <- data.frame(x=seq(1,5),y=seq(5,1),z=c(1,2,3,2,1),a=c(1,1,1,2,2))
df
## x y z a
## 1 1 5 1 1
## 2 2 4 2 1
## 3 3 3 3 1
## 4 4 2 2 2
## 5 5 1 1 2
and I want to aggregate a part of it:
aggregate(cbind(x,z)~a,FUN=sum,data=df)
## a x z
## 1 1 6 6
## 2 2 9 3
How do I go about making it programmatic? I want to pass:
The list of variables to be aggregated cbind(x,z)
The grouping variable a (I will be using it in several other parts of the program, so passing the whole thing cbind(x,z)~a is not helpful)
The environment within which the things are happening
My starting point is
blah <- function(varlist,groupvar,df) {
# I kinda like to see what I am doing here
cat(paste0(deparse(substitute(varlist)),"~",deparse(substitute(groupvar))),"\n")
cat(is.data.frame(df),"\n")
cat(dim(df),"\n")
# but I really need to aggregate this
return( aggregate(eval(deparse(substitute(varlist))~deparse(substitute(groupvar)),df),
FUN=sum,data=df) )
}
and it works halfway:
blah(cbind(x,z),a,df)
## [1] "cbind(x, z)~a"
## TRUE
## 5 4
## Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
So I am kind of able to build the character representation of the formula that I need, but putting it into aggregate() fails.
Related
This is my data frame:
time<-rep(c(1:5),4)
sim1<-rep(c(paste("sim",1)),5)
sim2<-rep(c(paste("sim",2)),5)
sim3<-rep(c(paste("sim",3)),5)
sim4<-rep(c(paste("sim",4)),5)
sim<-c(sim1,sim2,sim3,sim4)
id<-as.vector(replicate(4,sample(1:5)))
df<-data.frame(time,sim,id)
df$simnu<-as.numeric(df$sim)
Which should look something like this:
time sim id simnu
1 1 sim 1 1 1
2 2 sim 1 3 1
3 3 sim 1 2 1
4 4 sim 1 4 1
5 5 sim 1 5 1
6 1 sim 2 1 2
7 2 sim 2 5 2
8 3 sim 2 4 2
9 4 sim 2 2 2
10 5 sim 2 3 2
11 1 sim 3 2 3
12 2 sim 3 3 3
13 3 sim 3 4 3
14 4 sim 3 1 3
15 5 sim 3 5 3
16 1 sim 4 3 4
17 2 sim 4 5 4
18 3 sim 4 2 4
19 4 sim 4 1 4
20 5 sim 4 4 4
I have created this loop that subsets the data by simulation and then calculates the output I want:
surveillance<-5
n<-1
simsub<-df[which(df$simnu==1),names(df)%in%c("time","sim","id")]
while (n<=surveillance){
print (n)
rndid<-df[sample(nrow(simsub),1),]
print(rndid)
if(n<rndid$time){
n<-n+1
} else {
tinf<-sum(length(df[which(simsub$time<=n),1]))
prev<-tinf/length(simsub[,1])
print(paste(prev,"prevalence"))
break
}
}
My question is how do I run this loop for each simulation and return the values of this as a vector?
My suggestion for you is to take a look at the lapply function (resp. sapply and vapply), and avoid using while, to be honest it's a bit tricky to help without really knowing what is happening in your code, but in any case here's an example how you can use lapply, however since I don't know what your code should return I can't be sure that the output is correct
I added comments and questions with your original lines, hope this helps
# first define a function that takes one simnu and returns whatever you want it to return
my_calc_fun <- function(sim_nr){
## you can subset the DF without which, names, or %in%
# simres[[i]] <- my_df[which(my_df$simnu==i),names(my_df)%in%c("time","sim","id")]
sim_df <- my_df[my_df$simnu == sim_nr, c("time","sim","id")]
for(n in 1:surveillance){
## I'm not sure that is what you meant to do,
## you are sampling the full DF, but you want a sample
## from the subset i.e., simres[[i]]
# rndid<-my_df[sample(nrow(simres[[i]]),1),]
row_id <- sample(nrow(sim_df), 1)
rndid <- sim_df[row_id, ]
if(n >= rndid$time){
## what are you trying to sum here?
## because you are giving the function one number length(....)
## and just like above you are subsetting the full DF here
# tinf<-sum(length(my_df[which(simres[[i]]$time<=n),1]))
tinf <- length(sim_df[sim_df$time<=n, 1])
# is this the value you want to return for each simnu?
prev <- tinf/length(sim_df["time"])
break
}
}
return(c('simnu'=sim_nr, 'prev' = prev))
}
# apply this function on all values of simnu and save to list
result_all <- lapply(unique(my_df$simnu), my_calc_fun)
result_all
Sorry, i have a question on For loop.
Now there're two different loop coding, and my goal is to create a factorial via a function of for loop.
----------------------------------
Method 1
s<-function(input){
stu<-1
for(i in 1:input){
stu<-1*((1:input)[i])
}
return(stu)
}
----------------------------------------
Method 2
k <- function(input){
y <- 1
for(i in 1:input){
y <-y*((1:input)[i])
}
return(y)
}
But 1 result is
> s(1)
[1] 1
> s(4)
[1] 4
> s(8)
[1] 8
and 2 result is
> k(1)
[1] 1
> k(4)
[1] 24
> k(8)
[1] 40320
-------------------------------
It's obviously that 2 is correct, and 1 is incorrect. But why? what's different between 1 and 2? Why i can't use stu<-1*((1:input)[i]) instead of stu<-stu*((1:input)[i])?
it's because the variable stu is not updating within the for loop.
s<-function(input){
stu<-1
for(i in 1:input){
stu<-1*((1:input)[i])
message(paste(i,stu,sep="\t"))
}
return(stu)
}
s(5)
1 1 # at the first loop, 1 x 1 is calculated
2 2 # at the 2nd loop, 1 x 2 is calculated
3 3 # at the 3rd loop, 1 x 3 is calculated
4 4 # at the 4th loop, 1 x 4 is calculated
5 5 # at the 5th loop, 1 x 5 is calculated
[1] 5
However, if you use stu<-stu*((1:input)[i]) instead of stu<-1*((1:input)[i]) then the result shows following :
s(5)
1 1 # at the first loop, 1 x 1 is calculated.
2 2 # at the second loop, 1 x 2 is calculated.
3 6 # at the third loop, 2 x 3 is calculated.
4 24 # at the fourth loop, 6 x 4 is calculated.
5 120 # at the fifth loop, 24 x 5 is calculated.
Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)
I am trying to use paste0 with merge, so that I can merge a bunch of stuff in a loop. However, I'm having trouble with calling specific columns from data.frames
To illustrate, I'll use head
Example:
df <- data.frame(x=1:10,y=1:10)
head(df)
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
head(get("df"))
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
head(df$x)
[1] 1 2 3 4 5 6
head(get("df$x"))
Error in get("df$x") : object 'df$x' not found
Is there a way to get a specific column?
The function get looks for objects defined in an environment. If you do not specify the environment, it defaults to your global workspace.
You need to coerce df into an environment using as.environment, and then call get using this environment, e.g.:
get("x", as.enviroment(get("df")))
In R, in a vector, i.e. a 1-dim matrix, I would like to change components with value 3 to with value 1, and components with value 4 with value 2. How shall I do that? Thanks!
The idiomatic r way is to use [<-, in the form
x[index] <- result
If you are dealing with integers / factors or character variables, then == will work reliably for the indexing,
x <- rep(1:5,3)
x[x==3] <- 1
x[x==4] <- 2
x
## [1] 1 2 1 2 5 1 2 1 2 5 1 2 1 2 5
The car has a useful function recode (which is a wrapper for [<-), that will let you combine all the recoding in a single call
eg
library(car)
x <- rep(1:5,3)
xr <- recode(x, '3=1; 4=2')
x
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
xr
## [1] 1 2 1 2 5 1 2 1 2 5 1 2 1 2 5
Thanks to #joran for mentioning mapvalues from the plyr package, another wrapper for [<-
x <- rep(1:5,3)
mapvalues(x, from = c(3,1), to = c(1,2))
plyr::revalue is a wrapper for mapvalues specifically factor or character variables.