I have datasets called example1,example2,example3,example4 of which variable is SEX(1 or 2) in working library
and I've made datasets called exampleS1,exampleS2,exampleS3,exampleS4 restricted to SEX=1 by using MACRO in SAS
like this way.
%macro ms(var=);
data exampleS&var.;
set example&var.; IF SEX=1;
run;
%mend ms;%ms(var=1);%ms(var=2);%ms(var=3);%ms(var=4);
Now, I want to do this job in R
It's bit not easy to do this in R to me. How can I do it? (assuming example1,example2, example3,example4 are data.frames)
Thank you in advance.
Having variables with numeric index in the name is a very SAS thing to do, and not at all very R like. If you have related data.frames, in R, you keep them in a list. There are many ways to read in many files into a list (see here). So say you have a list of data.frames
examples <- list(
data.frame(id=1:3, SEX=c(1,2,1)),
data.frame(id=4:6, SEX=c(1,1,2)),
data.frame(id=7:9, SEX=c(2,2,1))
)
Then you can get all the SEX=1 values with
exampleS <- lapply(examples, subset, SEX==1)
and you access them with
exampleS[[1]]
exampleS[[2]]
exampleS[[3]]
You should program R the R-way, not the SAS-way, because this will lead to endless pain. SAS-macro-language and R don't mix imo, but this is how:
# create example df's
for (i in 1:4) {
assign(paste0("example", i), data.frame(sex = sample(0:1, 10, replace = T)))
}
example1; example2; example3; example4
# filter and store result in a list of df's
l <- list(example1 = example1, example2 = example2, example3 = example3, example4 = example4)
want <- lapply(l, function(x) subset(x, sex == 1))
want$example1; want$example2; want$example3; want$example4 # get list of data frames
# almost certainly what you should do
# in principle possible to this too, but advise against it
list2env(lapply(l, function(x) subset(x, sex == 1)), .GlobalEnv)
example1; example2; example3; example4
Related
Incredibly basic question. I'm brand new to R. I feel bad for asking, but also like someone will crush it:
I'm trying to generate a number of vectors with a for loop. Each with an unique name, numbered by iteration. The code I'm attaching throws an error, but I think it explains what I'm trying to do in principle fairly well.
Thanks in advance.
vectorBuilder <- function(num){
for (x in num){
paste0("vec",x) <- rnorm(10000, mean = 0, sd = 1)}
}
numSeries <- 1:10
vectorBuilder(numSeries)
You can write the function to return a named list :
create_vector <- function(n) {
setNames(replicate(n, rnorm(10000), simplify = FALSE),
paste0('vec', seq_len(n)))
}
and call it as :
data <- create_vector(10)
data will have list of length 10 with each element having a vector of size 10000. It is better to keep data in this list instead of creating lot of vectors in global environment. However, if you still want separate vectors you can use list2env :
list2env(data, .GlobalEnv)
I am trying to create such list with value from different data frame called kc2 to kc10. anyone provide me some advice how to formulate this for loop?
sum_square=append(sum_square,weighted.mean(x=kc2$withinss,w=kc2$size, na.rm=TRUE))
I tried something like this but didnt work:
for (i in 2:10){
nam1 = paste0("kc",i,"$withinss")
nam2 = paste0("kc",i,"$size")
sum_square = append(sum_square, lapply(c(as.numeric(nam1),as.numeric(nam2)), weighted.mean))
}
There are a lot of problems with the code you posted, so I'll just cut right to the point. In R, when you want to apply a function to multiple objects and collect the result, you should be thinking of using lapply. lapply loops through a list of objects (you can put your data frames into a list), applies the chosen function to each, and then returns the result of each as a list. The below code is in the form of what you want:
# Add data frames to list by name
list_of_data_frames <- list(kc2, kc3, kc4, kc5, kc6, kc7, kc8, kc9, kc10)
# OR add them programatically
list_of_data_frames <- mget(paste0('kc', seq.int(from = 2, to = 10)))
result <- lapply(list_of_data_frames,
function(x) weighted.mean(x = x$withiniss, w = x$size, na.rm=TRUE))
since I am fairly new to R I am struggling for days to come to the right solution. All the internet and stackoverflow search could not bring me ahead so far.
All tries with rbind, cbind, lapply, sapply did not work. So here is the problem:
I have a data frame given wich a time series in column "value X"
I want to calculate single and exponential moving averages on this column (SMA and EMA).
Since you can change the parameter "n" as window size in SMA/EMA calculation I want to change the parameter in a loop starting from 5 to 150 in steps of 5. And then write the result into a data frame.
So the data frame should look like.
SMA_5 | SMA_10 | SMA_15 .... EMA_5 | EMA_10 | EMA_15 ...
Ideally the column names are also created in this loop.
Can you help me out?
Thank you in advance
As far as I know, the loops are seen as a non-optimal solution in R and should be avoided if possible. It seems to me that in-built R functions sapply and colnames may provide quite a simple solution for your problem:
library("TTR")
# example of data
test <- data.frame(moments = 101:600, values = 1:500)
seq_of_windows_size <- seq(from = 5, to = 150, by = 5)
col_names_of_sma <- paste("SMA", seq_of_windows_size, sep = "_")
SMA_columns <- sapply(FUN = function(i) SMA(x = test$values, n = i),
X = seq_of_windows_size)
colnames(SMA_columns) <- col_names_of_sma
Then you'll have just to add the SMA_columns to your original dataframe. The steps for EMA may be much the same.
Hope, it helps :)
I'm working with a set of results of INLA package in R. These results are stored in objects with meaningful names so I can have, for instance, model_a, model_b... in current environment. For each of these models I'd like to do several processing tasks including extracting of the data to separate data frame, which can then be used to merge to spatial data to create map, etc.
Turning to simpler, reproducible example let's assume two results
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
model_a <- lm(weight ~ group)
model_b <- lm(weight ~ group - 1)
I can handle the steps for an individual model, for instance:
model_a_sum <- data.frame(var = character(1), model_a_value = numeric(1))
model_a_sum$var <- "Intercept"
model_a_sum$model_a_value <- model_a$coefficients[1]
png("model_a_plot.png")
plot(model_a, las = 1)
dev.off()
Now, I'd like to reuse this code for each of the models, essentially constructing correct names depending on the model I'm using. I'm more Stata than R person and inside Stata that would be a trivial task to use the stub of a name (model_a, or even a only..) and construct foreach loop that would implement all the steps, adapting names for each of the models.
In R, for loops have been bashed all over the internet so I presume I shouldn't attempt to venture into the territory of:
models <- c("model_a", "model_b", "model_c")
for (model in models) {
...
}
What would be the better solution for such scenario?
Update 1: Since comments suggested that for might indeed be an option I'm trying to put all the tasks into a loop. So far I manged to name the data frame correctly using assign and get correct data plotted under correct name using get:
models <- c("model_a", "model_b")
for (i in 1:length(models)) {
# create df
name.df <- paste0(models[i], "_sum")
assign(name.df, data.frame(var = character(1), value = numeric(1)))
# replace variables of df with results from the model
# plot and save
name.plot <- paste0(models[i], "_plot.png")
png(name.plot)
plot(get(models[i]), which = 1, las = 1)
dev.off()
}
Is this reasonable approach? Any better solutions?
One thing I cannot solve is having the second variable of the df named according to the model (ie. model_a_value instead of current value. Any ideas how to solve that?
Some general tips/advice:
As mentioned in comments, don't believe much of the negativity about for loops in R. The issue is not that they are bad, but more that they are correlated with some bad code patterns that are inefficient.
More important is to use the right data organization. Don't keep the models each in a separate object!. Put them in a list:
l <- vector("list",3)
l[[1]] <- lm(...)
l[[2]] <- lm(...)
l[[3]] <- lm(...)
Then name the list:
names(l) <- paste0("model_",letters[1:3])
Now you can loop over the list without resorting to awkward and unnecessary tools like assign and get, and more importantly when you're ready to step up from for loops to tools like lapply you're all good to go.
I would use similar strategies for your data frames as well.
See #joran answer, this one is to show use of assign and get but should be avoided when possible.
I would go this way for the for loop:
for (model in models) {
m <- get(model) # to get the real model object
# create the model_?_sum dataframe
assign(paste0(model,"_sum"), data.frame(var = "Intercept", value = m$coefficients[1]))
assign(paste0(model,"_sum"), setNames( get(paste0(model,"_sum")), c("var",paste0(model,"_value"))) ) # per comment to rename the value column thanks to #Franck in chat for the guidance
# paste0 to create the text
png(paste0(model,"_plot.png"))
plot(m, las = 1) # use the m object to graph
dev.off()
}
which give the two images and this:
> model_a_sum
var value
(Intercept) Integer 5.032
> model_b_sum
var value
groupCtl Integer 5.032
>
I'm unsure of why you wish this dataframe, but I hope this give clues on how to makes variables names and how to access them.
I am having trouble optimising a piece of R code. The following example code should illustrate my optimisation problem:
Some initialisations and a function definition:
a <- c(10,20,30,40,50,60,70,80)
b <- c(“a”,”b”,”c”,”d”,”z”,”g”,”h”,”r”)
c <- c(1,2,3,4,5,6,7,8)
myframe <- data.frame(a,b,c)
values <- vector(length=columns)
solution <- matrix(nrow=nrow(myframe),ncol=columns+3)
myfunction <- function(frame,columns){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
The problematic for-loop looks like this:
columns = 6
for(i in 1:nrow(myframe){
values <- myfunction(as.matrix(myframe[i,]), columns)
values[columns+2] = i
values[columns+3] = myframe[i,3]
#more columns added with simple operations (i.e. sum)
solution <- rbind(solution,values)
#solution is a large matrix from outside the for-loop
}
The problem seems to be the rbind function. I frequently get error messages regarding the size of solution which seems to be to large after a while (more than 50 MB).
I want to replace this loop and the rbind with a list and lapply and/or foreach. I have started with converting myframeto a list.
myframe_list <- lapply(seq_len(nrow(myframe)), function(i) myframe[i,])
I have not really come further than this, although I tried applying this very good introduction to parallel processing.
How do I have to reconstruct the for-loop without having to change myfunction? Obviously I am open to different solutions...
Edit: This problem seems to be straight from the 2nd circle of hell from the R Inferno. Any suggestions?
The reason that using rbind in a loop like this is bad practice, is that in each iteration you enlarge your solution data frame and then copy it to a new object, which is a very slow process and can also lead to memory problems. One way around this is to create a list, whose ith component will store the output of the ith loop iteration. The final step is to call rbind on that list (just once at the end). This will look something like
my.list <- vector("list", nrow(myframe))
for(i in 1:nrow(myframe)){
# Call all necessary commands to create values
my.list[[i]] <- values
}
solution <- rbind(solution, do.call(rbind, my.list))
A bit to long for comment, so I put it here:
If columns is known in advance:
myfunction <- function(frame){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
apply(myframe, 2, myfunction)
If columns is not given via environment, you can use:
apply(myframe, 2, myfunction, columns) with your original myfunction definition.