I want to adjust my function so that my if and else if statements recognize the name of the dataframe used and execute the correct plotting function. These are some mock data structured the same as mine:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
A<-lapply(tbls,"[", c(1,2))
B<-lapply(tbls,"[", c(3,4))
C<-lapply(tbls,"[", c(3,4))
list<-list(A,B,C)
names(list)<-c("A","B","C")
And this is my function:
plot_1<-function (section, subsample) {
data<-list[grep(section, names(list))]
data<-data[[1]]
name=as.character(names(data))
if(section=="A" && subsample=="None"){plot_likert_general_section(df1[c(1:2)],"A")}
else if (section==name && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the",name,"topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the",name,"topics rank?"))}
}
Basically what I want it to do is plot a certain graph by specifying section and subsample I'm interested in if, for example, I want to plot section C and subsample dummy.1, I just write:
plot_1(section="C", subsample="dummy1)
I want to avoid writing this:
else if (section=="A" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the A topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the A topics rank?"))}
else if (section=="B" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the B topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the B topics rank?"))}
else if (section=="C" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the c topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the C topics rank?"))}
else if (section=="C" && subsample=="dummy2")...
.
.
}
So I tried to extract the dataframe used from the list so that it matches the string of the section typed in the function (data<-list[grep(section, names(list))]) and store its name as a character (name=as.character(names(data))), because I thought that in this way the function would have recognized the string "A", "B" or "C" by itself, without the need for me to specify each condition.
However, if I run it, I get this error: Warning message: In section == name && subsample == "dummy1" : 'length(x) = 4 > 1' in coercion to 'logical(1)', that, from what I understand, is due to the presence of a vector in the statement. But I have no idea how to correct for this (I'm still quite new to R).
How can I fix the function so that it does what I want? Thanks in advance!
Well, I can't really test your code without the plot_likert_general_section function or the plot_likert function, but I've done a bit of simplifying and best practices--passing list in as an argument, consistent spaces and assignment operators, etc.--and this is my best guess as to what you want:
plot_1 = function(list, section, subsample) { ## added `list` as an argument
data = list[[grep(section, names(list))]] # use [[ to extract a single item
name = as.character(names(data))
if(subsample == "None"){
plot_likert_general_section(df1[c(1:2)], section)
} else {
yesno = paste(subsample, c("yes", "no"), sep = ".")
plot_likert(data[[yesno[1]]], title = paste("How do the", name, "topics rank?"))
plot_likert(data[[yesno[2]]], title = paste("How do the", name, "topics rank?"))
}
}
plot_1(list, section = "C", subsample = "dummy1)
I'm not sure if your plot_likert functions use base or grid graphics--but either way you'll need to handle the multiple plots. With base, probably use mfrow() to display both of them, if grid I'd suggest putting them in a list to return them both, and then maybe using gridExtra::grid.arrange() (or similar) to plot both of them.
You're right that the error is due to passing a vector where a single value is expected. Try inserting print statements before the equality test to diagnose why this is.
Also, be careful with choosing variable names like name which are baseR functions (e.g. ?name). I'd also recommend following the tidyverse style guide here: https://style.tidyverse.org/.
Related
Here are the set of circumstances that have gotten me stuck:
The Problem
I have written a function in R that will need to execute within a for loop with the goal of manually adjusting some values and then updating a large nested list in the global environment. I have 2 functions more.points() and get.num.pts() that ask for user input. These are part of a larger function add.points() which runs everything and will be wrapped in a for loop. Unfortunately, I cannot figure out how to update the nested list to the correct place in the list's hierarchy from within the function. I must do it from within the function in order to make sure I dont run lines of code in the for loop after the function because this will cause readlines() to fail and take the next line of code as user input. How do I update the hierarchical list object in the correct place from within the add.points() function? assign() does not appear to be up to the task, at least to my limited knowledge. Any help is greatly appreciated. I am making a pipeline for aligning an atlas to brain images so I can localize cells that fluoresce to their respective brain regions.
more.points <- function(){
more.pts <- readline(prompt = "Do you need to add correspondence points to adjust the atlas registration? (y/n): ")
}
get.num.pts <- function(){
num.pts <- readline(prompt = "How many additional points are required? (You will be able to add additional points later if you need to): ")
}
add.points <- function(){
mo.pts <- as.character(more.points());
if(mo.pts == "y" || mo.pts == "Y" || mo.pts == "Yes" || mo.pts == "yes"){
while(mo.pts == "y" || mo.pts == "Y" || mo.pts == "Yes" || mo.pts == "yes") {
#ask for user input about number of new points to be created
n.pts <- as.integer(get.num.pts());
reg.fun.obj <- paste0(n.pts," updated!");
print(reg.fun.obj)
#do other stuff
#assign totally works here just fine because it isnt a hierarchical list being updated
assign("reg.obj", reg.fun.obj, envir = .GlobalEnv);
#Need to update the correct position in the list object hierarchy with new info.
assign(i.data[[reference.df$i[[i]]]][[reference.df$j[[i]]]][[reference.df$k[[i]]]], reg.obj, envir = .GlobalEnv);
#But this cannot take `i.data[[reference.df$i[[i]]]][[reference.df$j[[i]]]][[reference.df$k[[i]]]]` for the name argument. it must be a string.
mo.pts = as.character(more.points())
}
}
}
Reproducible example:
Here is an example of the global environment hierarchical list I need to update from an object within the add.points() function:
#Hierarchical List Object Example
#The image objects have more complexity in my real implementation i.e. image_1 is itself a list object with multiple attributes.
list.i <- c("channel1", "channel2", "channel3")
list.j <- c("m1", "m2", "m3")
list.k <- c("image_1", "image_2", "image_3")
k.tmp <- list()
j.tmp <- list()
i.data <- list()
for(i in seq_along(list.i)){
for(j in seq_along(list.j)){
for(k in seq_along(list.k)){
k.tmp[[k]] <- list.k[[k]]
names(k.tmp)[[k]] <- paste0("img", k)
}
j.tmp[[j]] <- k.tmp
names(j.tmp)[[j]] <- paste0("m", j)
k.tmp <- list()
}
i.data[[i]] <- j.tmp
names(i.data)[[i]] <- paste0("channel", i)
j.tmp <- list
}
remove(k.tmp,j.tmp)
#Additional example list I am using to know which elements of the hierarchy need to be updated/adjusted as the for loop cycles.
reference.df <- data.frame(i = c(rep(1, 9), rep(2, 9), rep(3, 9)), j = c(rep(c(1, 1, 1, 2, 2, 2, 3, 3, 3),3)), k = c(rep(c(1, 2, 3),9)))
Code to run function:
reg.obj <- i.data[[reference.df$i[[i]]]][[reference.df$j[[i]]]][[reference.df$k[[i]]]]
for(i in seq_along(reference.df$k)){
add.points()
}
Remember: I am unable to run anything after the function within the for loop because R will interpret the next line as the user input being fed to readlines(). Thus, the whole point of this loop and function - getting user input, saving, and cycling to the next image for the user to provide input on - will not occur.
For anyone else who runs into an issue like this. Don't be stupid like me. Use the return() function within your function to convert your variable into an output that you can feed into your nested list thusly:
in the function:
myfun(){
#do stuff to make object containing update
return(update.obj)
}
#run the function:
list[[x]][[y]][[z]] <- myfun()
#is equivalent to below occurring outside a function:
list[[x]][[y]][[z]] <- update.obj
Yes this was stupid but hopefully I helped someone avoid my fundamental mistake here. If you can avoid it, don't use assign() in a function.
I've built a prediction function in R, but when I run it's very slow, and I'm only using a sample of 1% of the data I'll be using in production. The function is intended to predict the next word given a series of ngrams (two-word, three-word, or four-word combinations - created from my corpus).
I pass the words to the function, for example "i can", and the series of three-word combinations. The output ranked in order decreasing would be "i can read", count of 4.
Here is the two-word ngram passed is a matrix, the dim and example data from position 100.
dim(bigram_index)
[1] 46201 3
bigram_index[,1][100]
[1] "abandon"
bigram_index[,2][100]
[1] "contemporary"
bigram_index[,3][100]
[1] "1"
Here is the prediction function:
predict.next.word <- function(word, ng_matrix){
ngram_df <- data.frame(predicted=character(), count = numeric(), stringsAsFactors=FALSE)
col_ng_matrix <- nrow(bigram_index)
if(ncol(ng_matrix)==3){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
count_word <- ng_matrix[,3][i]
if (word[1] == first_word && !is.na(first_word)){
matched_factor <- structure(c(second_word, count_word), .Names = c("predicted", "count"))
ngram_df[i,] <- as.list(matched_factor)
}
}
} else if(ncol(ng_matrix)==4){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
third_word <- ng_matrix[,3][i]
count_word <- ng_matrix[,4][i]
if (word[1] == first_word && !is.na(first_word) && word[2] == second_word && !is.na(second_word)){
matched_factor <- structure(c(third_word, count_word), .Names = c("predicted", "count"))
ngram_df[i,] <- as.list(matched_factor)
}
}
} else if(ncol(ng_matrix)==5){
for (i in 1:col_ng_matrix){
first_word <- ng_matrix[,1][i]
second_word <- ng_matrix[,2][i]
third_word <- ng_matrix[,3][i]
fourth_word <- ng_matrix[,4][i]
count_word <- ng_matrix[,5][i]
if (word[1] == first_word && !is.na(first_word) && word[2] == second_word
&& !is.na(second_word) && word[3] == third_word && !is.na(third_word)){
ngram_df[i,] <- as.list(matched_factor)
}
}
}
ngram_df <- transform(ngram_df, count = as.numeric(count))
return (ngram_df[order(ngram_df$count, decreasing = TRUE),])
}
Using the smallest ngram (only two-word) here is the time results:
system.time(predict.next.word(c("abandon"), bigram_index))
user system elapsed
92.125 59.395 152.149
Again, the ngram passed again is only 1% of production data, and when I get into three and four-word, it takes much longer. Please provide your insight on how to improve this function's speed.
Instead of looping through columns, I would writing a function that performs the key actions of the for() loop, and use apply() (with MARGIN=2 for columns, 1 for rows; I think you'll be using latter) to apply that function to each column (FUN= argument set equal to your function). Depending on the output format, apply might not be suitable. At that point you could look into plyr package, dplyr, or, my favorite (but somewhat of a learning curve, as is dplyr) the data.table package.
In general, take a look at Hadley's book chapter on the topic: http://adv-r.had.co.nz/Performance.html
Currently, your code doesn't take advantage of the fact that so-call "vectorized" R code performs loops in C, making them much faster (forgive me if this description is technically incorrect; just getting the idea across).
For a more specific example, it might be helpful to see input (use dput(data)) and desired output. Then I'd have an easier time digesting what you want your function to accomplish.
Some general points that could help, at least a little:
You do ncol(ng_matrix) several times; instead, do nc.ngm < - ncol(ng_matrix) once at the start. Savings will be minimal, but the idea still useful.
Instead of defining first_word second, etc., just do something like words <- ng_matrix[i,]. Then use the previously-mentioned object to get the count_word by doing count_word <- words[nc.ngm] and get the other words as numbered_words <- words[nc.ngm]. To compare the word object elements to the words elements, you could even make use of mapply to get your logic. Again, this is all a little hard to follow without an example. But in general, do things "in bulk" (vectorize).
My problem is as follows: I'm trying to write a function that sets a collection of attributes on an object in a given environment. I'm trying to mimic a metadata layer, like SAS does, so you can set various attributes on a variable, like label, decimal places, date format, and many others.
Example:
SetAttributes(object = "list$dataframe$column", label="A label", width=20, decDigits=2,
dateTimeFormat="....", env=environment())
But I have to set attributes on different levels of objects, say:
comment(list$dataframe$column) <- "comment on a column of a dataframe in a list"
comment(dataframe$column) <- "comment on a column of a dataframe"
comment(list) <- "comment on a list/dataframe/vector"
Alternatively it can be done like this:
comment("env[[list]][[dataframe]][[column]]) <- "text"
# (my function recognizes both formats, as a variable and as a string with chain of
# [[]] components).
So I have implemented it this way:
SetAttributes <- function(varDescription, label="", .........., env=.GlobalEnv) {
parts <- strsplit( varDescription, "$", fixed=TRUE)[[1]]
if(length(parts) == 3) {
lst <- parts[1]
df <- parts[2]
col <- parts[3]
if(!is.na(label)) comment(env[[lst]][[df]][[col]]) <- label
if(!is.na(textWidth)) attr(env[[lst]][[df]][[col]], "width") <- textWidth
....
} else if(length(parts) == 2) {
df <- varTxtComponents[1]
col <- varTxtComponents[2]
if(!is.na(label)) comment(env[[df]][[col]]) <- label
if(!is.na(textWidth)) attr(env[[df]][[col]], "width") <- textWidth
....
} else if(length(parts) == 1) {
....
You see the problem now: I have three blocks of similar code for length(parts) == 3, 2 and 1
When I tried to automatize it this way:
path <- c()
sapply(parts, FUN=function(comp){ path <<- paste0(path, "[[", comp, "]]") )}
comment(eval(parse(text=paste0(".GlobalEnv", path)))) <- "a comment"
I've got an error:
Error in comment(eval(parse(text = paste0(".GlobalEnv", path)))) <- "a comment" :
target of assignment expands to non-language object
Is there any way to get an object on any level and set attributes for it not having a lot of repeated code?
PS: yes, I heard thousand times that changing external variables from inside a function is an evil, so please don't mention it. I know what I want to achieve.
Just to make sure you hear it 1001 times, it's a very bad idea for a function to have side effects like this. This is a very un R-like way to program something like this. If you're going to write R code, it's better to do things the R way. This means returning modified objects that can optionally be reassigned. This would make life much easier.
Here's a simplified version which only focuses on the comment.
SetComment <- function(varDescription, label=NULL, env=.GlobalEnv) {
obj <- parse(text= varDescription)[[1]]
eval(substitute(comment(X)<-Y, list(X=obj, Y=label)), env)
}
a<-list(b=4)
comment(a$b)
# NULL
SetComment("a$b", "check")
comment(a$b)
# [1] "check"
Here, rather than parsing and splitting the string, we build an expression that we evaluate in the proper context. We use substitute() to pop in the values you want to the actual call.
uniq <- unique(file[,12])
pdf("SKAT.pdf")
for(i in 1:length(uniq)) {
dat <- subset(file, file[,12] == uniq[i])
names <- paste("Sample_filtered_on_", uniq[i], sep="")
qq.chisq(-2*log(as.numeric(dat[,10])), df = 2, main = names, pvals = T,
sub=subtitle)
}
dev.off()
file[,12] is an integer so I convert it to a factor when I'm trying to run it with by instead of a for loop as follows:
pdf("SKAT.pdf")
by(file, as.factor(file[,12]), function(x) { qq.chisq(-2*log(as.numeric(x[,10])), df = 2, main = paste("Sample_filtered_on_", file[1,12], sep=""), pvals = T, sub=subtitle) } )
dev.off()
It works fine to sort the data frame by this (now a factor) column. My problem is that for the plot title, I want to label it with the correct index from that column. This is easy to do in the for loop by uniq[i]. How do I do this in a by function?
Hope this makes sense.
A more vectorized (== cooler?) version would pull the common operations out of the loop and let R do the book-keeping about unique factor levels.
dat <- split(-2 * log(as.numeric(file[,10])), file[,12])
names(dat) <- paste0("IoOPanos_filtered_on_pc_", names(dat))
(paste0 is a convenience function for the common use case where normally one would use paste with the argument sep=""). The for loop is entirely appropriate when you're running it for its side effects (plotting pretty pictures) rather than trying to capture values for further computation; it's definitely un-cool to use T instead of TRUE, while seq_along(dat) means that your code won't produce unexpected results when length(dat) == 0.
pdf("SKAT.pdf")
for(i in seq_along(dat)) {
vals <- dat[[i]]
nm <- names(dat)[[i]]
qq.chisq(val, main = nm, df = 2, pvals = TRUE, sub=subtitle)
}
dev.off()
If you did want to capture values, the basic observation is that your function takes 2 arguments that vary. So by or tapply or sapply or ... are not appropriate; each of these assume that just a single argument is varying. Instead, use mapply or the comparable Map
Map(qq.chisq, dat, main=names(dat),
MoreArgs=list(df=2, pvals=TRUE, sub=subtitle))
Please excuse me if there are already answers to this, but I can't quite figure it out from the archives.
I have generated a list of very similar functions via a for-loop in R:
adoptint.fun=list()
for(i in 1:40) {
#function name for each column
func.name <- paste('adoptint',i,sep='')
#function
func = paste('function(yearenter, adoptyear, yearleave) {ifelse(is.na(yearenter) | yearenter >', i+1905, ' | is.na(adoptyear) | yearleave > ', i+1905, ', NA, ifelse(yearenter <= ', i+1905, ' & adoptyear <= ', i+1905, ', 1, 0))}', sep='')
adoptint.fun[[func.name]] = eval(parse(text=func))
}
I am now interested in applying this function to generate values for variables that have yet to be created in the dataframe. I want to do this using a loop or similar since the process is identical, though the specific values change, over the 40 iterations. The code would look something like:
#generate variables that will be inserted into dataframe, dfanal.reshape
var_names <- paste("dfanal.reshape$adopt", 1:40, sep="")
#run function i to obtain values for variable i, which should be appended to dataframe
for(i in 1:40){
var_names[i] <- eval(parse(paste("adoptint.fun[[" ,i, "]](dfanal.reshape$intoobsyear,dfanal.reshape$adoptyear,dfanal.reshape$yearleave)", sep="")))
}
I have played around with mget for the var_names segment, but that doesn't seem to work and the eval segment is also not working (i.e., not assigning the values determined by the function (which works fine) to the appropriate dataframe column.
Again, apologies if this has already been answered and thanks in advance for your help.
How about adding an extra argument to your function?
func <- function(yearenter, adoptyear, yearleave,i) {
ifelse(is.na(yearenter) | yearenter > i+1905 | is.na(adoptyear) | yearleave > i+1905 , NA,
ifelse(yearenter <= i+1905 & adoptyear <= i+1905, 1, 0))
}
This would allow you to do the replacement quite a lot easier, using the fact that a dataframe is a special kind of list. That was your original problem I believe :
for(i in 1:40){
varname <- paste('adopt',i,sep='')
dfanal.reshape[[varname]] <-
with(dfanal.reshape,
func(intoobsyear,adoptyear,yearleave,i)
)
}
Check also the help pages ?which and ?Extract
Now without reproducible example (see How to make a great R reproducible example? ), it's hard to guess what you want to do and how to do this more economical. You're still using a lot of calculation time. The following function might do what you want :
func <- function(df,j){
out <- matrix(0,nrow=nrow(df),ncol=j)
attach(df)
idna <- sapply(1:j,function(i)
is.na(yearenter) | yearenter > i+1905 | is.na(adoptyear) | yearleave > i+1905
)
out[idna] <- NA
id1 <- sapply(1:j,function(i)
yearenter <= i+1905 & adoptyear <= i+1905
)
out[id1] <- 1
detach(df)
colnames(out)<- paste('adopt',1:j,sep='')
cbind(df,out)
}
which allows you to simply do
dfanal.reshape <- func(dfanal.reshape,40)
to get the desired result. This is given that the names of your variables are yearenter, adoptyear and yearleave. As far as I can see, you have to change yearenter to intoobsyear in the function, but that's a detail.
Learning to use indices will save you a lot of frustration. And please, never ever make 40 identical functions again if adding one argument will do.