More efficient than for loop in R - r

I wonder are there more efficient ways to assign values to a new variable in a data frame, than using for loops. I have two recent example:
[1] Getting normalized Leveshtein distance using vwr package:
rst34$Levenshtein = rep(0, nrow(rst34))
for (i in 1:nrow(rst34)) {
rst34$Levenshtein[i] = levenshtein.distance(
as.character(rst34$target[i]), as.character(rst34$prime[i]))[[1]] /
max(nchar(as.character(rst34$target[i])), nchar(as.character(rst34$prime[i]))
)
}
[2] Extracting substring from another variable:
rst34$Experiment = 'rst4'
for (i in 1:nrow(rst34)) {
rst34$Experiment[i] = unlist(strsplit(as.character(rst34$subject[i]), '[.]'))[1]
}
Also, I think that there should be no difference between initializations in two examples:
rst34$Levenshtein = rep(0, nrow(rst34))
rst34$Experiment = 'rst4'
Many thanks!

Something like...
rst34$Experiment = sapply(rst34$subject, function(element){
unlist(strsplit(as.character(element), '[.]'))[1]
})
Should hopefully do the trick. I don't have your data, so I couldn't actually test it out.

It would only make sense to apply nchar to a character variable so the as.character calls are probably not needed:
rst34$Levenshtein <-
levenshtein.distance( rst34$target, rst34$prime) /
pmax(nchar(rst34$target),
nchar(rst34$prime ) )

Related

Calling a character string into object names within a function

I've currently got a very lengthy and repetitive bit of code for data normalisation and inversion ((x-min)/(max-min)*-1)+1) that I want to clean up a bit.
This is a small sample of what it currently looks like:
W3_E1_Norm_New <- W3_E1_Average%>%
mutate(W3_E1_Norm_New = ((W3_E1_zoo-W3_E1_Min)/(W3_E1_Max-W3_E1_Min)*-1)+1)
W3_E2_Norm_New <- W3_E2_Average%>%
mutate(W3_E2_Norm_New = ((W3_E2_zoo-W3_E2_Min)/(W3_E2_Max-W3_E2_Min)*-1)+1)
W3_E3_Norm_New <- W3_E3_Average%>%
mutate(W3_E3_Norm_New = ((W3_E3_zoo-W3_E3_Min)/(W3_E3_Max-W3_E3_Min)*-1)+1)
Each 'W3_E1' refers to a sample ID, and at present each sample ID requires the two lines of code to be written out each time.
Ideally I'd like to write a function which can call a character string (Sample_IDs) into the names of each data frame, so something like
a_Norm_New
would return
W3_E1_Norm_New
then
W3_E2_Norm_New
etc.
Is there a way to write a function that could accomplish this?
Many thanks
I don't have your data but this should work. Define a function:
my_fun <- function (x) {
norm_new <- paste0(x,"_Norm_New")
average <- paste0(x,"_Average")
zoo <- paste0(x, "_zoo")
min <- paste0(x, "_Min")
max <- paste0(x, "_Max")
df <- get(average) %>%
mutate(new_norm = ((zoo - min) / (max - min) * - 1) + 1)
assign(df, norm_new)
}
Then run a for loop:
Sample_IDs <- c("W3_E1", "W3_E2", "W3_E3")
for (i in Sample_IDs) {
my_fun(i)
}
With data.table, it is very easy to write functions that use quoted variable names (see a blog post I wrote on the subject).
Here, we paste the pattern of your column name everywhere with the sufx variable:
library(data.table)
normalize <- function(dt, sufx = "W3_E1"){
df <- as.data.table(dt)
df[, (paste0(sufx,"_Norm_New")) := (
(get(paste0(sufx,_zoo)) - get(paste0(sufx,"_Min"))
)/(
get(paste0(sufx,"_Max")) - get(paste0(sufx,"_Min"))
)*-1)+1)
}
Here the code is not easy to read because I wanted to show that this can be done in one line but you can give more readability easily.
In this solution, you use get to unquote your variable name.

User defined function - issue with return values

I regularly come up against the issue of how to categorise dataframes from a list of dataframes according to certain values within them (E.g. numeric, factor strings, etc). I am using a simplified version using vectors here.
After writing messy for loops for this task a bunch of times, I am trying to write a function to repeatedly solve the problem. The code below returns a subscripting error (given at the bottom), however I don't think this is a subscripting problem, but to do with my use of return.
As well as fixing this, I would be very grateful for any pointers on whether there are any cleaner / better ways to code this function.
library(plyr)
library(dplyr)
#dummy data
segmentvalues <- c('1_P', '2_B', '3_R', '4_M', '5_D', '6_L')
trialvec <- vector()
for (i in 1:length(segmentvalues)){
for (j in 1:20) {
trialvec[i*j] <- segmentvalues[i]
}
}
#vector categorisation
vcategorise <- function(categories, data) {
#categorises a vector into a list of vectors
#requires plyr and dyplyr
assignment <- list()
catlength <- length(categories)
for (i in 1:length(catlength)){
for (j in 1:length(data)) {
if (any(contains(categories[i], ignore.case = TRUE,
as.vector(data[j])))) {
assignment[[i]][j] <- data[j]
}
}
}
return (assignment)
}
result <- vcategorise(categories = segmentvalues, data = trialvec)
Error in *tmp*[[i]] : subscript out of bounds
You are indexing assignments -- which is ok, even if at an index that doesn't have a value, that just gives you NULL -- and then indexing into what you get there -- which won't work if you get NULL. And NULL you will get, because you haven't allocated the list to be the right size.
In any case, I don't think it is necessary for you to allocate a table. You are already using a flat indexing structure in your test data generation, so why not do the same with assignment and then set its dimensions afterwards?
Something like this, perhaps?
vcategorise <- function(categories, data) {
assignment <- vector("list", length = length(data) * length(categories))
n <- length(data)
for (i in 1:length(categories)){
for (j in 1:length(data)) {
assignment[(i-1)*n + j] <-
if (any(contains(categories[i],
ignore.case = TRUE,
as.vector(data[j])))) {
data[j]
} else {
NA
}
}
}
dim(assignment) <- c(length(data), length(categories))
assignment
}
It is not the prettiest code, but without fully understanding what you want to achieve, I don't know how to go further.

Delete data.frame columns and loop through data.frame assignment function

I found the following piece of code here at stackoverflow:
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
x <- x[,!names(x) %in% column.D]
}
return(x)
}
df <- columnFunction(df)
So i wanted to use it for my own proposes, but it did not work out as planned.
What i try to archive is to use it in a for loop or with lapply to use it with multiple data.frames. Amongst others I tried:
d.frame1 <- iris
d.frame2 <- cars
l.frames <- c("d.frame1","d.frame2")
for (b in l.frames){
columnFunction(b)
}
but it yields the following error message:
Error in dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res :
$ operator is invalid for atomic vectors
Well, what i need additionally is that I can loop though that function so that i can iterate through different data.frames.
Last but not least I would need something like:
for (xyz in l.frames){
xyz <- columnFunction(xyz)
}
to automate the saving step.
Does anyone have any idea how i could loop though that function or how i could change the function so that it performs all those steps and is loopable.
I`m quite new to R so perhaps Im missing something obvious.
lapply was designed for this task:
l.frames <- list(d.frame1, d.frame2)
l.frames <- lapply(l.frames, columnFunction)
If you insist on using a for loop:
for (i in seq_along(l.frames)) l.frames[[i]] <- columnFunction(l.frames[[i]])

Can a way in R be found out to do permutation for objects

I want to do
am<-0
an<-0
bm<-0
bn<-0
cm<-0
cn<-0
.....
.....
and son on till zn.Is there a way to do it without writing so much
You can use assign to create variable by name:
for (first in letters[1:3]) {
for (second in letters[13:14]) {
assign(paste(first, second, sep=""), 0)
}
}
Probably a better way would be to use dataframe like this:
df <- data.frame(
name=paste(rep(letters[1:3], each=2), rep(letters[13:14], 3), sep=""),
value=0
)
If you have already defined you am,an,... as a separate variable , one way to aggregate them in a vector is :
vars <- unlist(mget(ls(pattern='^[a-z](m|n)$')))
Then it is easy to initialize your vector like this :
vars <- 0

R checking a parameter is defined

I'm looking for a general practice how to check a parameter was defined in a function.
I came up with these three ideas. Which one is the proper way of doing it?
Unfortunately, the third one is not working. substitute() is working differently in a function and I couldn't figure it out how to use it properly.
file.names <- list(
cov.value <- "cov.rds",
plot.name <- "plot.pdf"
)
test1 <- function(file.names){
is.save <- !missing(file.names)
}
test2 <- function(file.names = NULL) {
is.save <- !is.null(file.names)
}
test3 <- function(file.names = NULL) {
is.save <- exists(as.character(substitute(file.names)))
}
I personally think the second approach with a default value is MUCH easier to use & understand. (And the third approach is just really bad)
...especially when you are writing a wrapper function that needs to pass an argument to it. How to pass a "missing"-value is not obvious!
wraptest1 <- function(n) {
file.names <- if (n > 0) sample(LETTERS, n)
else alist(a=)[[1]] # Hacky way of assigning 'missing'-value
print(test1(file.names))
}
wraptest1(2) # TRUE
wraptest1(0) # FALSE
wraptest2 <- function(n) {
file.names <- if (n > 0) sample(LETTERS, n)
else NULL # Much easier to read & understand
print(test2(file.names))
}
wraptest2(2) # TRUE
wraptest2(0) # FALSE
[granted, there are other ways to work around passing the missing value, but the point is that using a default value is much easier...]
Some default values to consider are NULL, NA, numeric(0), ''
It's generally a good idea to look at code from experienced coders---and R itself has plenty of examples in the R sources.
I have seen both your first and second example being used. The first one is pretty idiomatic; I personally still use the second more often by force of habit. The third one I find too obscure.

Resources