Looping dataset R - r

I'm trying to make a loop to automate a lot of actions in R. The code I have looks like this:
datA <- droplevels(datSUM[datSUM$Conc=="a",])
datB <- droplevels(datSUM[datSUM$Conc=="b",])
datC <- droplevels(datSUM[datSUM$Conc=="c",])
datD <- droplevels(datSUM[datSUM$Conc=="d",])
datE <- droplevels(datSUM[datSUM$Conc=="e",])
datX <- droplevels(datSUM[datSUM$Conc=="x",])
datY <- droplevels(datSUM[datSUM$Conc=="y",])
datAf <- droplevels(datA[datA$Sex=="f",])
datAf1 <- droplevels(datAf[datAf$rep=="1",])
datAf2 <- droplevels(datAf[datAf$rep=="2",])
datAf3 <- droplevels(datAf[datAf$rep=="3",])
datAm <- droplevels(datA[datA$Sex=="m",])
datAm1 <- droplevels(datAm[datAm$rep=="1",])
datAm2 <- droplevels(datAm[datAm$rep=="2",])
datAm3 <- droplevels(datAm[datAm$rep=="3",])
So since I have to do this 7 times, it seems like making a loop for this operation is the best way to do it. Can someone help me make that? I'm new to R so please bear that in mind.

Well I will have a stab at this.
concs <- c(a='a',b='b',c='c',d='d',e='e',x='x',y='y')
sex <- c(m='m',f='f')
reps <- c(rep1='1',rep2='2',rep3='3')
# By using m='m' we can label the objects within the list, making it
# easier to navigate the final object, otherwise use:
# concs <- c('a','b','c','d','e','x','y')
# sex <- c('m','f')
# reps <- c('1','2','3')
dfs <- lapply(concs, function(x){
droplevels(datSUM[datSUM$Conc==x,])}
)
sdfs <- lapply(sex, function(x){
lapply(dfs, function(y){
droplevels(y[y$Sex==x,])}
)}
)
rsdfs <- lapply(reps, function(x){
lapply(sdfs, function(y){
lapply(y, function(z){
droplevels(z[z$rep==x,])}
)}
)}
)
There is probably a better way to do this, that may involve using more lapplys but I think this "should" do the trick.
The only downside to this method you will have to access certain objects with rsdfs[[1]][[1]][[1]] or rsdfs[['rep1']][['m']][['a']] e.t.c
And applying functions to these would in itself require a bunch of lapplys
Let me know if this helps.
This is one method to do so - I will work on a more elegant solution later.

Related

Plot function by condition and subsample for factor data

I'm working on a plotting function for the likert data from a survey and I'm trying to optimize it to be as automated as possible since I have to make quite a lot of plots and make it as user-friendly as possible, but I'm having some problems and really need help finishing this function...
These are the data:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
Columns A and B pertain to the "Technology" section of my survey, while C, D and E are in "Social".
I have transformed my data using the likertpackage and compiled them in a list to be more easily called in my function (don't know if it's the best way to go about it, I'm still quite new to R, so feel free to make suggestions even concerning this point):
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
So far, here is the function I could come up with (with great help of user #gaut):
mynames <- sapply(names(tbls), function(x) {
paste("How do they rank? -",gsub("\\.",": ",x))
})
myfilenames <- names(tbls)
plot_likert <- function(x, myname, myfilename){
p <- plot(likert(x),
type ="bar",center=3,
group.order=names(x))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank"))+
ggtitle(myname)
p
I then lapply the function to get a list of plots:
list_plots <- lapply(1:length(tbls),function(i) {
plot_likert(tbls[[i]], mynames[i], myfilenames[i])
})
And then save them all as .png
sapply(1:length(list_plots), function(i) ggsave(
filename = paste0("plots ",i,".png"),
plot = list_plots[[i]],
width = 15, height = 9
))
Now, there are 3 main things I want my function to do but don't really know how to approach:
1) Right now I can export all the plots in one batch, but I would also like to be able to export a single plot, for example obtaining the above graph by writing:
plot_likert(tbls$dummy1.no)
2) In my mind, my ideal plotting function would also take into account the sections of my data mentioned above, so that if I specify the section Technology only get a Likert plot considering only columns A and B, and specifying the subsample gets me the dummy. Like so:
plot_likert(section=Technology, subsample=dummy1.no)
3) As you maybe have already noted, I need the titles of the plot to be fully automatic, so that by changing section or subsample they too change accordingly.
Apologies for the long/intricate question but I've been stuck on this function for quite some time and really need help finalizing it. For any further clarification/info, do not hesitate to ask!
Thank you in advance for any advice!
There are many ways to get what you want. Essentially, you need to add a few arguments to your function.
I agree with Limey though (and of course Hadley) - generally better to have a few simple functions that do a little step and then you can collate everything in one bigger function.
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
## this can be shortened
df1 <- data.frame(lapply(df1, factor))
## the rest of dummy data creation probably too, but I won't dig too much into this now
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
library(ggplot2)
library(likert)
#> Loading required package: xtable
## no need for sapply, really!
mynames <- paste("How do they rank? -", gsub("\\.",": ",names(tbls)))
myfilenames <- names(tbls)
## defining arguments with NULL makes it possible to not specify it without giving it a value
plot_likert <- function(x, myname, myfilename, section = NULL, subsample = NULL){
## first take only the tbl of interest
if(!is.null(subsample)) x <- x[subsample]
## then filter for your section and subsample
if(!is.null(section)) x <- lapply(x, function(y) y[, section])
## you can run your lapply within the function -
## ideally make a separate funciton and call the smaller function in the bigger one
## use seq_along
lapply(seq_along(x), function(i) {
plot(likert(x[[i]]),
type ="bar",center=3,
group.order=names(x[[i]]))+
labs(x = "Theme", subtitle=paste("Number of observations:",nrow(x)))+
guides(fill=guide_legend("Rank")) +
## programmatic title
ggtitle(names(x)[i])
})
}
## you need to pass character vectors to your arguments
patchwork::wrap_plots(plot_likert(tbls))
patchwork::wrap_plots(plot_likert(tbls, section = LETTERS[1:2], subsample = paste("dummy1", c("no", "yes"), sep = ".")))
Created on 2022-08-17 by the reprex package (v2.0.1)

Object not found - nested function - R

I am still getting used with functions. I had a look in environments documentation but I can't figure out how to solve the error. Lets see what I tried until now:
I have a list of documents. Lets suppose it is "core"
library(dplyr)
table_1 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
table_2 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
core <- list(table_1, table_2)
Then, I have to run the function documents_ for each element of the list. This function gives some parameters to execute in another nested function:
documents_ <- function(i) {
core_processed <- as.data.frame(core[[i]])
x <- 1:nrow(core_processed)
y <- 1:ncol(core_processed)
temp <- sapply(x, function(x) mapply(calc_dens_,x,y))
return(temp)
}
Inside that, there is the function calc_dens, which is:
calc_dens_ <- function(x, y) {
core_temp <- core_processed %>%
filter(X2 == x & X3 == y)
return(core_temp)
}
Then, for iterate for each element of the list, I tried without success:
calc <- lapply(c(1:2), function(i) documents_(i))
Error in eval(lhs, parent, parent) : object 'core_processed' not found
The calc_dens function doesn't get the results of the documents_ (environment problem. Is there a way to solve this, or another better approach? My function is more complex than this, but the main elements are in this example. Thank you in advance.
As the other commenters have said, the problem is that you are referring to a variable, core_processed that is not in scope. You can make it a global variable, but it might be more sensible just to use it in a closure like this:
table_1 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
table_2 <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
cores <- list(table_1, table_2)
documents_ <- function(core_processed) {
x <- 1:nrow(core_processed)
y <- 1:ncol(core_processed)
calc_dens <- function(x, y) core_processed %>% filter(X2 == x & X3 == y)
sapply(x, function(x) mapply(calc_dens, x, y))
}
calc <- lapply(cores, documents_)
If cores is a list of data frames, you do not need to to use as.data.frame and since you use lapply, there is no need to apply over indices and then index into the list. So the code I wrote here is simplified but does the same as your code.
I have to wonder, though, is this really what you want? The sapply over x and then mapply over x and y -- where x is the one from the sapply and not the ist you built in documents_ -- looks mighty strange to me.

Creating a for loop to plot multiple data series in ggplot

I'm pretty sure this should be really straightforward but I cannot find a solution and cannot see the answer in other questions on for loops in r. I have a dataset datDET that contains 21 data sets of different 'Gels', and I want to make a plot where I have a series from each dataset plotted altogether. I have the following code, however, I just get the error that there is an unexpected symbol in my code, which is the ] after the i. Any help solving this would be greatly appreciated! Here is my current code!
G1.dat <- datDET[datDET$Gel==1,]
G2.dat <- datDET[datDET$Gel==2,]
G3.dat <- datDET[datDET$Gel==3,]
G4.dat <- datDET[datDET$Gel==4,]
G5.dat <- datDET[datDET$Gel==5,]
G6.dat <- datDET[datDET$Gel==6,]
G7.dat <- datDET[datDET$Gel==7,]
G8.dat <- datDET[datDET$Gel==8,]
G9.dat <- datDET[datDET$Gel==9,]
G10.dat <- datDET[datDET$Gel==10,]
G11.dat <- datDET[datDET$Gel==11,]
G12.dat <- datDET[datDET$Gel==12,]
G13.dat <- datDET[datDET$Gel==13,]
G14.dat <- datDET[datDET$Gel==14,]
G15.dat <- datDET[datDET$Gel==15,]
G16.dat <- datDET[datDET$Gel==16,]
G17.dat <- datDET[datDET$Gel==17,]
G18.dat <- datDET[datDET$Gel==18,]
G19.dat <- datDET[datDET$Gel==19,]
G20.dat <- datDET[datDET$Gel==20,]
G21.dat <- datDET[datDET$Gel==21,]
library(ggplot2)
p <- ggplot(datDET, aes(x = NO3, y = Depth))
for (i in c(1:21)){
p1 <- p + geom_point(data=Gi.dat)
}
data=Gi.dat is looking for an object named Gi.dat which you don't have. If you want to be able to replace the i with the looped value, you'll have to use get and paste
data=get(paste0("G",i,".dat"))

Repeating sequence of operations without for loop

I have a simulation in R that involves executing several lines of code.
I would like to replicate this process 1000 times.
Is there any way to do this withouth a for-loop?
I know there is replicate() but that can only replicate 1 process at a time.
Here's an example:
for (r in 1:reps){
first<-sapply(1:100, function(x) sample(c(1,2),100,prob=c(0.45,0.55),replace=T))
second<-sapply(2:100, function(i) length(which(apply(sapply(1:100, function(x) sample(easy[x,],i)),2,max)==2)) )
third[r,]<-second
}
Can this be done withouth a for loop?
The command replicate is useful for you (it's really just a wrapper of sapply, but makes your code more readable). I've also made the inside of your loop slightly more readable:
set.seed(123)
for (r in 1:reps){
# first <- matrix(sample(c(1,2),100*100,prob=c(0.45,0.55),replace=T), nrow=100)
second <- sapply(2:100, function(i) length(which(apply(sapply(1:100, function(x) sample(easy[x,],i)),2,max)==2)) )
third[r,]<-second
}
set.seed(123)
third.2 <- t(replicate(reps, sapply(2:100, function(i)
sum(apply(easy[1:100, ], 1, function(x) max(sample(x, i))==2)))))
all.equal(third, third.2)
By the way, even tough you didn't ask for this, here is a faster way to calculate first, which does not need sapplyat all.
set.seed(123)
first <- sapply(1:100, function(x) sample(c(1,2),100,prob=c(0.45,0.55),replace=T))
set.seed(123)
first.2 <- matrix(sample(c(1,2), 100*100, prob=c(0.45,0.55), replace=T), nrow=100)
all.equal(first, first.2)
As I mentioned in the comment, something like this would enable you to avoid the loop.
foo = function (dummy) {
first<-sapply(1:100, function(x) sample(c(1,2),100,prob=c(0.45,0.55),replace=T))
second<-sapply(2:100, function(i) length(which(apply(sapply(1:100, function(x) sample(easy[x,],i)),2,max)==2)) )
third[r,]<-second
}
sapply(1:reps, foo)

How to avoid writing the same line several times in R?

I'm writing a program in R and I need to select variables based in a particular value of one of the variable. The program is the next:
a1961 <- base[base[,5]==1961,]
a1962 <- base[base[,5]==1962,]
a1963 <- base[base[,5]==1963,]
a1964 <- base[base[,5]==1964,]
a1965 <- base[base[,5]==1965,]
a1966 <- base[base[,5]==1966,]
a1967 <- base[base[,5]==1967,]
a1968 <- base[base[,5]==1968,]
a1969 <- base[base[,5]==1969,]
a1970 <- base[base[,5]==1970,]
a1971 <- base[base[,5]==1971,]
a1972 <- base[base[,5]==1972,]
a1973 <- base[base[,5]==1973,]
a1974 <- base[base[,5]==1974,]
a1975 <- base[base[,5]==1975,]
a1976 <- base[base[,5]==1976,]
a1977 <- base[base[,5]==1977,]
a1978 <- base[base[,5]==1978,]
a1979 <- base[base[,5]==1979,]
a1980 <- base[base[,5]==1980,]
a1981 <- base[base[,5]==1981,]
a1982 <- base[base[,5]==1982,]
a1983 <- base[base[,5]==1983,]
a1984 <- base[base[,5]==1984,]
a1985 <- base[base[,5]==1985,]
a1986 <- base[base[,5]==1986,]
a1987 <- base[base[,5]==1987,]
a1988 <- base[base[,5]==1988,]
a1989 <- base[base[,5]==1989,]
...
a2012 <- base[base[,5]==2012,]
Is there a way (like modules in SAS) in which I can avoid writing the same thing over and over again?
In general, coding/implementation questions really belong on StackOverflow. That said, my recommendation is instead of naming individual variables for each result, just throw them all into a list:
a = lapply(1961:1989, function(x) base[base[,5]==x,]
You can also use the assign command.
years <- 1961:2012
for(i in 1:length(years)) {
assign(x = paste0("a", years[i]), value = base[base[,5]==years[i],])
}

Resources