I wonder are there more efficient ways to assign values to a new variable in a data frame, than using for loops. I have two recent example:
[1] Getting normalized Leveshtein distance using vwr package:
rst34$Levenshtein = rep(0, nrow(rst34))
for (i in 1:nrow(rst34)) {
rst34$Levenshtein[i] = levenshtein.distance(
as.character(rst34$target[i]), as.character(rst34$prime[i]))[[1]] /
max(nchar(as.character(rst34$target[i])), nchar(as.character(rst34$prime[i]))
)
}
[2] Extracting substring from another variable:
rst34$Experiment = 'rst4'
for (i in 1:nrow(rst34)) {
rst34$Experiment[i] = unlist(strsplit(as.character(rst34$subject[i]), '[.]'))[1]
}
Also, I think that there should be no difference between initializations in two examples:
rst34$Levenshtein = rep(0, nrow(rst34))
rst34$Experiment = 'rst4'
Many thanks!
Something like...
rst34$Experiment = sapply(rst34$subject, function(element){
unlist(strsplit(as.character(element), '[.]'))[1]
})
Should hopefully do the trick. I don't have your data, so I couldn't actually test it out.
It would only make sense to apply nchar to a character variable so the as.character calls are probably not needed:
rst34$Levenshtein <-
levenshtein.distance( rst34$target, rst34$prime) /
pmax(nchar(rst34$target),
nchar(rst34$prime ) )
Related
I've currently got a very lengthy and repetitive bit of code for data normalisation and inversion ((x-min)/(max-min)*-1)+1) that I want to clean up a bit.
This is a small sample of what it currently looks like:
W3_E1_Norm_New <- W3_E1_Average%>%
mutate(W3_E1_Norm_New = ((W3_E1_zoo-W3_E1_Min)/(W3_E1_Max-W3_E1_Min)*-1)+1)
W3_E2_Norm_New <- W3_E2_Average%>%
mutate(W3_E2_Norm_New = ((W3_E2_zoo-W3_E2_Min)/(W3_E2_Max-W3_E2_Min)*-1)+1)
W3_E3_Norm_New <- W3_E3_Average%>%
mutate(W3_E3_Norm_New = ((W3_E3_zoo-W3_E3_Min)/(W3_E3_Max-W3_E3_Min)*-1)+1)
Each 'W3_E1' refers to a sample ID, and at present each sample ID requires the two lines of code to be written out each time.
Ideally I'd like to write a function which can call a character string (Sample_IDs) into the names of each data frame, so something like
a_Norm_New
would return
W3_E1_Norm_New
then
W3_E2_Norm_New
etc.
Is there a way to write a function that could accomplish this?
Many thanks
I don't have your data but this should work. Define a function:
my_fun <- function (x) {
norm_new <- paste0(x,"_Norm_New")
average <- paste0(x,"_Average")
zoo <- paste0(x, "_zoo")
min <- paste0(x, "_Min")
max <- paste0(x, "_Max")
df <- get(average) %>%
mutate(new_norm = ((zoo - min) / (max - min) * - 1) + 1)
assign(df, norm_new)
}
Then run a for loop:
Sample_IDs <- c("W3_E1", "W3_E2", "W3_E3")
for (i in Sample_IDs) {
my_fun(i)
}
With data.table, it is very easy to write functions that use quoted variable names (see a blog post I wrote on the subject).
Here, we paste the pattern of your column name everywhere with the sufx variable:
library(data.table)
normalize <- function(dt, sufx = "W3_E1"){
df <- as.data.table(dt)
df[, (paste0(sufx,"_Norm_New")) := (
(get(paste0(sufx,_zoo)) - get(paste0(sufx,"_Min"))
)/(
get(paste0(sufx,"_Max")) - get(paste0(sufx,"_Min"))
)*-1)+1)
}
Here the code is not easy to read because I wanted to show that this can be done in one line but you can give more readability easily.
In this solution, you use get to unquote your variable name.
I regularly come up against the issue of how to categorise dataframes from a list of dataframes according to certain values within them (E.g. numeric, factor strings, etc). I am using a simplified version using vectors here.
After writing messy for loops for this task a bunch of times, I am trying to write a function to repeatedly solve the problem. The code below returns a subscripting error (given at the bottom), however I don't think this is a subscripting problem, but to do with my use of return.
As well as fixing this, I would be very grateful for any pointers on whether there are any cleaner / better ways to code this function.
library(plyr)
library(dplyr)
#dummy data
segmentvalues <- c('1_P', '2_B', '3_R', '4_M', '5_D', '6_L')
trialvec <- vector()
for (i in 1:length(segmentvalues)){
for (j in 1:20) {
trialvec[i*j] <- segmentvalues[i]
}
}
#vector categorisation
vcategorise <- function(categories, data) {
#categorises a vector into a list of vectors
#requires plyr and dyplyr
assignment <- list()
catlength <- length(categories)
for (i in 1:length(catlength)){
for (j in 1:length(data)) {
if (any(contains(categories[i], ignore.case = TRUE,
as.vector(data[j])))) {
assignment[[i]][j] <- data[j]
}
}
}
return (assignment)
}
result <- vcategorise(categories = segmentvalues, data = trialvec)
Error in *tmp*[[i]] : subscript out of bounds
You are indexing assignments -- which is ok, even if at an index that doesn't have a value, that just gives you NULL -- and then indexing into what you get there -- which won't work if you get NULL. And NULL you will get, because you haven't allocated the list to be the right size.
In any case, I don't think it is necessary for you to allocate a table. You are already using a flat indexing structure in your test data generation, so why not do the same with assignment and then set its dimensions afterwards?
Something like this, perhaps?
vcategorise <- function(categories, data) {
assignment <- vector("list", length = length(data) * length(categories))
n <- length(data)
for (i in 1:length(categories)){
for (j in 1:length(data)) {
assignment[(i-1)*n + j] <-
if (any(contains(categories[i],
ignore.case = TRUE,
as.vector(data[j])))) {
data[j]
} else {
NA
}
}
}
dim(assignment) <- c(length(data), length(categories))
assignment
}
It is not the prettiest code, but without fully understanding what you want to achieve, I don't know how to go further.
I found the following piece of code here at stackoverflow:
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
x <- x[,!names(x) %in% column.D]
}
return(x)
}
df <- columnFunction(df)
So i wanted to use it for my own proposes, but it did not work out as planned.
What i try to archive is to use it in a for loop or with lapply to use it with multiple data.frames. Amongst others I tried:
d.frame1 <- iris
d.frame2 <- cars
l.frames <- c("d.frame1","d.frame2")
for (b in l.frames){
columnFunction(b)
}
but it yields the following error message:
Error in dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res :
$ operator is invalid for atomic vectors
Well, what i need additionally is that I can loop though that function so that i can iterate through different data.frames.
Last but not least I would need something like:
for (xyz in l.frames){
xyz <- columnFunction(xyz)
}
to automate the saving step.
Does anyone have any idea how i could loop though that function or how i could change the function so that it performs all those steps and is loopable.
I`m quite new to R so perhaps Im missing something obvious.
lapply was designed for this task:
l.frames <- list(d.frame1, d.frame2)
l.frames <- lapply(l.frames, columnFunction)
If you insist on using a for loop:
for (i in seq_along(l.frames)) l.frames[[i]] <- columnFunction(l.frames[[i]])
I want to do
am<-0
an<-0
bm<-0
bn<-0
cm<-0
cn<-0
.....
.....
and son on till zn.Is there a way to do it without writing so much
You can use assign to create variable by name:
for (first in letters[1:3]) {
for (second in letters[13:14]) {
assign(paste(first, second, sep=""), 0)
}
}
Probably a better way would be to use dataframe like this:
df <- data.frame(
name=paste(rep(letters[1:3], each=2), rep(letters[13:14], 3), sep=""),
value=0
)
If you have already defined you am,an,... as a separate variable , one way to aggregate them in a vector is :
vars <- unlist(mget(ls(pattern='^[a-z](m|n)$')))
Then it is easy to initialize your vector like this :
vars <- 0
I'm looking for a general practice how to check a parameter was defined in a function.
I came up with these three ideas. Which one is the proper way of doing it?
Unfortunately, the third one is not working. substitute() is working differently in a function and I couldn't figure it out how to use it properly.
file.names <- list(
cov.value <- "cov.rds",
plot.name <- "plot.pdf"
)
test1 <- function(file.names){
is.save <- !missing(file.names)
}
test2 <- function(file.names = NULL) {
is.save <- !is.null(file.names)
}
test3 <- function(file.names = NULL) {
is.save <- exists(as.character(substitute(file.names)))
}
I personally think the second approach with a default value is MUCH easier to use & understand. (And the third approach is just really bad)
...especially when you are writing a wrapper function that needs to pass an argument to it. How to pass a "missing"-value is not obvious!
wraptest1 <- function(n) {
file.names <- if (n > 0) sample(LETTERS, n)
else alist(a=)[[1]] # Hacky way of assigning 'missing'-value
print(test1(file.names))
}
wraptest1(2) # TRUE
wraptest1(0) # FALSE
wraptest2 <- function(n) {
file.names <- if (n > 0) sample(LETTERS, n)
else NULL # Much easier to read & understand
print(test2(file.names))
}
wraptest2(2) # TRUE
wraptest2(0) # FALSE
[granted, there are other ways to work around passing the missing value, but the point is that using a default value is much easier...]
Some default values to consider are NULL, NA, numeric(0), ''
It's generally a good idea to look at code from experienced coders---and R itself has plenty of examples in the R sources.
I have seen both your first and second example being used. The first one is pretty idiomatic; I personally still use the second more often by force of habit. The third one I find too obscure.