Function to abbreviate scientific names - r

Could you please help me?
I'm trying to modify an R function written by a colleague. This function receives a character vector with scientific names (Latin binomes), just like this one:
Name
Cerradomys scotti
Oligoryzomys sp
Philander frenatus
Byrsonima sp
Campomanesia adamantium
Cecropia pachystachya
Cecropia sp
Erythroxylum sp
Ficus sp
Leandra aurea
Then, it should abbreviate the scientific names, using only the first three letters of the genus (first term) and the epithet (second term) to make a short code. For instance, Cerradomys scotti should become Cersco.
This is the original function:
AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)) {
vector[i] <- if(splitnames[[i]][2] == "^sp") {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
vector
}
With a simple list like that one, the function works perfectly. However, when the list has some missing or extra elements, it does not work. The loop stops when it meets the first row that does not match the pattern. Let's take this more complex list as an example:
Name
Cerradomys scotti
Oligoryzomys sp
Philander frenatus
Byrsonima sp
Campomanesia adamantium
Cecropia pachystachya
Cecropia sp
Erythroxylum sp
Ficus sp
Leandra aurea
Morfosp1
Vismia cf brasiliensis
See that Morfosp1 has only 1 term. And Vismia cf brasiliensis has an additional term (cf) in the middle.
I've tried adapting the function, for instance, this way:
AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)) {
vector[i] <- if(splitnames[[i]][2] == "^sp" & is.na(splitnames[[i]][2]))) {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
vector
}
Nevertheless, it does not work. I get this error message:
Error in if (splitnames[[i]][2] == "^sp" & is.na(splitnames[[i]][2])) { :
valor ausente onde TRUE/FALSE necessário
How could I make the function:
Deal also with names that have only 1 term?
Expected outcome: Morfosp1 -> Morfosp1 (stays the same)
Deal also with names that have an additional term in the middle?
Expected outcome: Vismia cf brasiliensis -> Visbra (term in the middle is ignored)
Thank you very much!

Something like this is pretty concise:
test <- c("Cerradomys scotti", "Oligoryzomys sp", "Latingstuff", "Latin staff more")
# function to truncate a given name
trunc_str <- function(latin_name) {
# split it on a space
name_split <- unlist(strsplit(latin_name, " ", fixed = TRUE))
# if one name, just return it
if (length(name_split) == 1) return(name_split)
# truncate to first 3 letters
name_trunc <- substr(name_split, 1, 3)
# paste the first and last term together (skipping any middle ones)
paste0(head(name_trunc, 1), tail(name_trunc, 1))
}
# iterate over all
vapply(test, trunc_str, "")
# Cerradomys scotti Oligoryzomys sp Latingstuff Latin staff more
# "Cersco" "Olisp" "Latingstuff" "Latmor"
If you don't want a named vector output, you can use USE.NAMES = FALSE in vapply(). Or feel free to use a loop here.

AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)){
# One name
if(length(splitnames[[i]])==1){
vector[i] <- paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
# Two names
else if(length(splitnames[[i]])==2){
vector[i] <- if(splitnames[[i]][2] == "^sp") {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
# Three names
else if(length(splitnames[[i]])==3){
vector[i] <- paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][3],1,3), sep = "")
# Assuming that the unwanted word is always in the middle
}
}
return(vector)
}
I tested on the list you gave and it seems to work, tell me if you need a more general code

Thank you very much for the help, Ricardo and Adam! I've made the code available on GitHub to other people who work with interaction networks, and need to abbreviate scientific names to be used in graphs.

Related

Cant manipulate global/local variables inside a function in R

dna = c("A","G","C","T")
x =sample(dna,50,replace =TRUE)
dna_f = function(x){
dnastring <- ""
for (val in x){
paste(dnastring,val,sep="")
}
return(dnastring)
}
dna_f(x)
I'm trying to produce a single string that contains all the randomly sampled letters. x contains all 50 letters and im trying to combine them into one string using the paste function. but when i run this, the output is an empty string. I tried placing dnastring as a global variable because i thought maybe the scope of a function operates differently in R(I'm new to R) but i got the same output. some help would be appreciated thanks.
You don't need for loop here. Try paste with collapse argument.
dna_f = function(x){
paste0(x, collapse = '')
}
dna_f(x)
#[1] "CCTACCAACCCTTTCTAGCCCACTATGCATCACAACTGCGGTCTCATCAC"
You forgot the dnastring <-
dna = c("A","G","C","T")
x =sample(dna,50,replace =TRUE)
dna_f = function(x){
dnastring <- ""
for (val in x){
dnastring <- paste(dnastring,val,sep="")
}
return(dnastring)
}
Output:
> dna_f(x)
[1] "GGTCTGGCCGAACTACTGTACACCCCAAAGACAACGCCCCCGACGCTCTA"

Function Recursion in R

I'm writing a function (NextWordPrediction) in R to predict the next word given some words. The basic structure is as follows:
If input exists in dat such that nrow(dat) != 0 return input and answer
If input doesn't exist such that nrow(dat) == 0 call to recursion and atempt input-1 (eg. if input is "hello great world" try "great world" so on and so forth until nrow nrow(dat) != 0
If after step 2 nrow(dat) == 0 return string "Word not in dictionary. We added this to our database!" and add original input to dataset
Here is the full code:
NextWordPrediction <- function(input) {
dat <- training %>%
filter(., N_gram == str_count(input, "\\S+") + 1) %>%
filter(grepl(paste("^", tolower(str_squish(input)), sep = ""), Word)) %>%
arrange(., desc(Prop))
if (nrow(dat) != 0) {
assign("training",
training %>%
mutate(Frequency = ifelse(Word == input &
N_gram == str_count(input, "\\S+"),
Frequency + 1,
Frequency)) %>%
group_by(., N_gram) %>%
mutate(., Prop = Frequency/ sum(Frequency)) %>%
data.frame(.),
envir = .GlobalEnv)
val <- dat$Word_to_Predict[1]
ans <- paste(str_squish(input), val)
return(list(ans, head(dat,5)))
} else if (nrow(dat) == 0 & word(input, 1) != "NA") {
input_1 <- Reduce(paste, word(input, 2:str_count(input,"\\S+")))
return(NextWordPrediction(input_1))
} else if (nrow(dat) == 0 & word(input, 1) == "NA") {
assign("training",
training %>%
add_row(., Word = tolower(input), Frequency = 1, N_gram = str_count(input, "\\S+")),
envir = .GlobalEnv)
ans <- paste("Word not in dictionary. We added this to our database!")
return(ans)
}
}
The issue I'm having happens somewhere between step 2 and 3. If input is not found after the recursion call, the added input to the database is input-1 ("great world") where I'd like the original input ("hello great world"). This is my first attempt to implement recursion and would like to understand the mistake in my code.
Thanks :)
Update to be Reproducible:
library(dplyr); library(stringr)
training <- data.frame(Word = c("hello", "she was great", "this is", "long time ago in"), Frequency = c(4, 3, 10, 1),
N_gram = c(1, 3, 2, 4), Prop = c(4/18, 3/18, 10/18, 1/18), Word_to_Predict = c(NA, "great", "is", "in"))
NextWordPrediction("she was") ## returns "she was" & "great"
NextWordPrediction("hours ago") ## returns "hours ago" & "in"
NextWordPrediction("words not in data") ## returns "Word not in dictionary. We added this to our database!" after trying "not in data", "in data" and adds "words not in data" to dataset
Here is an imperfect and overly-complicated demonstration of a recursive function operating on strings. Ideally there are some more safeguards that could be put into place, and there are of course much faster, more efficient, smarter ways of doing this one task, but ... perhaps you'll get the point.
I'm going to change all es to as, one word at a time.
e_to_a <- function(strings) {
# unnecessarily complex
message("# Called : ", sQuote(strings))
if (!nzchar(strings)) return(strings)
word1 <- sub("^([^[:space:]]*)[[:space:]]?.*", "\\1", strings)
others <- sub("^[^[:space:]]*[[:space:]]?", "", strings)
message("# - word1 : ", sQuote(word1))
message("# - others: ", sQuote(others))
# operate on the first word
word1 <- gsub("e", "a", word1)
if (nzchar(others)) {
others <- e_to_a(others)
return(paste(word1, others))
} else {
return(word1)
}
}
In action:
e_to_a("hello great world")
# # Called : 'hello great world'
# # - word1 : 'hello'
# # - others: 'great world'
# # Called : 'great world'
# # - word1 : 'great'
# # - others: 'world'
# # Called : 'world'
# # - word1 : 'world'
# # - others: ''
# [1] "hallo graat world"
The key is that when you make the recursive call, what you're currently doing
return(NextWordPrediction(input_1))
is going to return just the recursive part, dismissing the first word. That would be analogous to me doing
if (nzchar(others)) {
others <- e_to_a(others)
# return(paste(word1, others))
return(others)
} else {
return(word1)
}
I hope you can apply this to your function.
Bottom line, since your question is not reproducible, I'll guess that your fix is something like:
} else if (nrow(dat) == 0 & word(input, 1) != "NA") {
input_vec <- str_split(input, "\\s+")
input_firstword <- input_vec[1]
input_otherwords <- paste(input_vec[-1], collapse = " ")
return(paste(input_firstword, NextWordPrediction(input_otherwords)))
} else if (nrow(dat) == 0 & word(input, 1) == "NA") {
Stream-of-consciousness answer. It doesn't solve anything, but it highlights some areas where code can or must be changed. Up front: == NA fails; you're always discarding the first word in recursion; NA (the object meaning "could be anything") is being coerced into "NA", the literal string.
Starting with a fresh training, I'll debug(NextWordPrediction) and trace line-by-line. It gets to input_1 <- ..., the first thing I notice is:
first time, input_1 is "great world";
next time, it is "world";
next time, it is "na world", fail.
This is a classic fail on two counts:
the code assumes that there are multiple words, even though str_count(input,"\\S+") returns 1 here; and
it is a common mistake to assume that 2:... is always increasing and will not go over a certain count, but unfortunately 2:1 returns c(2L, 1L) ... perhaps you should check the length of your vectors before arbitrarily counting past them.
I think you're trying to guard against this with your previous test of word(input,1) != NA (which is also a mistake), but the only time that's going to happen is when input is 0-length vector (character(0)), not empty-string "". You won't get that with the current code, and I think your intent is for it to reduce to "".
I'm going to change your word(input, 2:str_count(...)) to
input_1 <- sub("^\\S*\\s?", "", input)
You have word(input, 1) != "NA" (and ==), that is either mistaking R's native object for a string, or you think you should be checking for a literal string "NA"; granted, english doesn't use that much as a real word, some languages do. I'm not certain if you intend that to be the NA literal or if for some reason your function will convert NA to "NA" and you want to guard against that.
That last assumption is fixing a symptom, not a problem. Never allow your function to return "NA" (this happens here in a couple of places), you need to guard against it. To me, it is perfectly reasonable to see a word "NA" and differentiate it from the R native NA. Data missingness is important to differentiate.
Assuming you meant != NA instead ... word(input, 1) != NA will never work. Let's run through some examples:
word("hello", 1)
# [1] "hello"
word("", 1)
# [1] ""
word(c(), 1)
# Warning in rep(string, length.out = n) :
# 'x' is NULL so the result will be NULL
# Error in mapply(function(word, loc) word[loc, "start"], words, start) :
# zero-length inputs cannot be mixed with those of non-zero length
word(character(0), 1)
# [1] NA
Okay, so it can return an NA, when the input vector is a 0-length character vector, but ...
word(character(0), 1) == NA
# [1] NA
word(character(0), 1) == NA_character_
# [1] NA
That's right, you cannot check for NA-ness that way. (Did you know that there are over six kinds of NA? They are not the same, identical(NA, NA_real_).)
Use is.na(.):
is.na(word(character(0), 1))
# [1] TRUE
(That's assuming we can see it in normal operation.)
I'm going to change that if condition to:
} else if (nrow(dat) == 0 && nzchar(input) && !is.na(word(input, 1))) {
We're getting closer. Now I can get into the third call of the function, where input is finally "" and we go into the first conditional block, assigning the new content to training. Unfortunately, dat$Word_to_Predict[1] is NA, so your ans is " NA", which just doesn't seem logical. Granted, your default training dataset has this explicitly, and while I don't know what you mean to happen here, I suggest stringifying an R object of NA into " NA" seems wrong.
I don't have a fundamental fix to this flow, though: you want to concatenate the val found with the previous input string, but ... if Word_to_Predict is NA (not a normal string), then ... what do you do? For the sake of moving forward, I'll dismiss concatenating "NA" onto a string ... though it's producing results that are "wrong" from a linguistic standpoint, I believe. (I'll just interpret "NA" as "(I don't have a great value for this spot)" or similar :-)
You are always pasteing a squished input with val, but ... if input is "", then paste still adds a space between them, which seems unnecessary. You can always "patch" this later by repeatedly squishing the strings, but ... symptom/problem again. I suggest instead using
ans <- str_squish(paste(input, val))
And my original point ...
When you start with "she was", it will find something on the first invocation, and we paste the input with the val to get the answer. However, when you have to go into recursion, you call the function again with the rest of the sentence and perfect discard the first word. For instance:
NextWordPrediction("hello great world")
#1> `input` is "hello great world", second `if` block, `input_1` is "great world"
#2> `input` is "great world", second `if` block, `input_1` is "world"
#3> `input` is "world", second `if` block, `input_1` is `""`
#4> `input` is "", first `if` block, `val` is `NA`, and `ans` is "NA"
#3> blindly returns list("NA", head(dat)) (discarding "world")
#2> blindly returns list("NA", head(dat)) (discarding "great")
#1> blindly returns list("NA", head(dat)) (discarding "hello")
Do you see the problem now? Instead of return(NextWordPrediction(input_rest)), you need to capture the result, prepend the word you stripped from input, and continue passing the updated return value up the chain. I suggest
input_1 <- gsub("\\s\\S*", "", input)
input_rest <- sub("^\\S*\\s?", "", input)
out <- NextWordPrediction(input_rest)
out[[1]] <- str_squish(paste(input_1, out[[1]]))
return(out)
After all of that, I now see
NextWordPrediction("hello great world")
# [[1]]
# [1] "hello great world NA"
# [[2]]
# Word Frequency N_gram Prop Word_to_Predict
# 1 hello 4 1 1 <NA>
which, according to your initial training, is correct.
Unfortunately, this breaks something else.
"words not in data" always eventually matches something (as will anything not in training), since it reduces to an empty string "", and your first logic of grepl(paste("^", tolower(str_squish(input)), sep = ""), Word) will always match something with input of "".
We can fix this with a simple additional condition in your first filtering:
filter(nzchar(input) & grepl(paste("^", tolower(str_squish(input)), sep = ""), Word)) %>%
And finally, when you get to the final if block when you need to add data to training, if this is the first/outer call of the function, then input truly reflects the entire sentence, which is what you want. However, if you've done one or more calls of recursion, then input is merely one word in the chain, not the entire thing. And due to some of the assumptions above, at this stage input is "", so ... any addition would be useless.
There are two strategies for dealing with this:
Keep track of whether this is the outer (first) call or some inner call. When you recursively call, check the return value ... if empty and this is an inner call, return empty; if empty and this is the first/outer call, then append to training; or
Always pass the entire string along with the current input. This would reverse my recommendation in bullet 6 above, so your second if block would just call NextWordPrediction(input_rest, input_1) (using my variables) and not str_squish after it. The squishing/pasting would be handled in the first if block, where you would need to prepend the value (if any) of preceding).
NextWordPrediction <- function(input, preceding = "") {
Side notes, not wrong per se but still not good.
& (single) in an if condition works but is bad practice: & does vector logic, which means it can return vectors of length other than 1; if conditions must be length exactly 1, not 0 or 2 or more. Use && here.
Reduce(paste, ...) is just unnecessary. Use paste(...).
After understanding the implications of recursion in my function thanks to #r2evans I realized that a solution by means of recursion would be too complicated and as a result the following code meets all my conditions and works as expected:
NextWordPrediction <- function(input) {
dat <- training %>%
filter(., N_gram == str_count(input, "\\S+") + 1) %>%
filter(grepl(paste("^", tolower(str_squish(input)), sep = ""), Word)) %>%
arrange(., desc(Prop))
if (nrow(dat) != 0) {
assign("training",
training %>%
mutate(Frequency = ifelse(Word == input &
N_gram == str_count(input, "\\S+"),
Frequency + 1,
Frequency)) %>%
group_by(., N_gram) %>%
mutate(., Prop = Frequency/ sum(Frequency)) %>%
data.frame(.),
envir = .GlobalEnv)
val <- dat$Word_to_Predict[1]
ans <- paste(str_squish(input), val)
return(list(ans, head(dat,5)))
} else {
for (i in 2:str_count(input, "\\S+")) {
input_1 <- word(input, start = i, end = str_count(input,"\\S+"))
dat <- training %>%
filter(., N_gram == str_count(input_1, "\\S+") + 1) %>%
filter(grepl(paste("^", tolower(str_squish(input_1)), sep = ""), Word)) %>%
arrange(., desc(Prop))
if (nrow(dat) != 0) {
val <- dat$Word_to_Predict[1]
ans <- paste(str_squish(input), val)
return(list(ans, head(dat,5)))
} else if (nrow(dat) == 0 & i == str_count(input, "\\S+")) {
assign("training",
training %>%
add_row(., Word = tolower(input), Frequency = + 1, N_gram = str_count(input, "\\S+"),
Word_to_Predict = word(input, -1)) %>%
group_by(., N_gram) %>%
mutate(., Prop = Frequency/ sum(Frequency)) %>%
data.frame(.),
envir = .GlobalEnv)
ans <- paste("Word not in dictionary. We added this to our database!")
return(ans)
}
}
}
}
It loops through input-1 until a value is found in the dataframe and when this happens an answer is returned, otherwise we add the original input to the dataframe.

String splitting in R Programming

Currently the script below is splitting a combined item code into a specific item codes.
rule2 <- c("MR")
df_1 <- test[grep(paste("^",rule2,sep="",collapse = "|"),test$Name.y),]
SpaceName_1 <- function(s){
num <- str_extract(s,"[0-9]+")
if(nchar(num) >3){
former <- substring(s, 1, 4)
latter <- strsplit(substring(s,5,nchar(s)),"")
latter <- unlist(latter)
return(paste(former,latter,sep = "",collapse = ","))
}
else{
return (s)
}
}
df_1$Name.y <- sapply(df_1$Name.y, SpaceName_1)
Example,
Combined item code: Room 324-326 is splitting into MR324 MR325 MR326.
However for this particular Combined item code: Room 309-311 is splitting into MR309 MR300 MR301.
How should I amend the script to give me MR309 MR310 MR311?
You can try something along these lines:
range <- "324-326"
x <- as.numeric(unlist(strsplit(range, split="-")))
paste0("MR", seq(x[1], x[2]))
[1] "MR324" "MR325" "MR326"
I assume that you can obtain the numerical room sequence by some means, and then use the snippet I gave you above.
If your combined item codes always have the form Room xxx-yyy, then you can extract the range using gsub:
range <- gsub("Room ", "", "Room 324-326")
If your item codes were in a vector called codes, then you could obtain a vector of ranges using:
ranges <- sapply(codes, function(x) gsub("Room ", "", x))
We can also evaluate the string after replacing the - with : and then paste the prefix "MR".
paste0("MR", eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)", "\\1:\\2", range))))
#[1] "MR324" "MR325" "MR326"
Wrap it as a function for convenience
fChange <- function(prefixStr, RangeStr){
paste0(prefixStr, eval(parse(text=sub("\\S+\\s+(\\d+)-(\\d+)",
"\\1:\\2", RangeStr))))
}
fChange("MR", range)
fChange("MR", range1)
#[1] "MR309" "MR310" "MR311"
For multiple elements, just loop over and apply the function
sapply(c(range, range1), fChange, prefixStr = "MR")
data
range <- "Room 324-326"
range1 <- "Room 309-311"

Dynamic variable names in plots, files and compatibility with loop

I am trying to write a function that makes a plot and saves it into a file automatically.
The trick I struggle with it to do both dynamically [plotname=varname & filename=varname &],
and to make it compatible with calling it from a loop.
# Create data
my_df = cbind(uni=runif (100),norm=rnorm (100),bino=rbinom(100,20, 0.5)); head (my_df)
my_vec = my_df[,'uni'];
# How to make plot and file-name meaningful if you call the variable in a loop?
# if you call by name, the plotname is telling. It is similar what I would like to see.
hist(my_df[,'bino'])
for (plotit in colnames(my_df)) {
hist(my_df[,plotit])
print (plotit)
# this is already not meaningful
}
# step 2 write it into files
hist_auto <- function(variable, col ="gold1", ...) {
if ( length (variable) > 0 ) {
plotname = paste(substitute(variable), sep="", collapse = "_"); print (plotname); is (plotname)
# I would like to define plotname, and later tune it according to my needs
FnP = paste (getwd(),'/',plotname, '.hist.pdf', collapse = "", sep=""); print (FnP)
hist (variable, main = plotname)
#this is apparently not working: I do not get my_df[, "bino"] or anything similar
dev.copy2pdf (file=FnP )
} else { print ("var empty") }
}
hist_auto (my_vec)
# name works, and is meaningful [as much as the var name ... ]
hist_auto (my_df[,'bino'])
# name sort of works, but falls apart
assign (plotit, my_df[,'bino'])
hist_auto (get(plotit))
# name works, but meaningless
# Now in a loop
for (plotit in colnames(my_df)) {
my_df[,plotit]
hist(my_df[,plotit])
## name works, but meaningless and NOT UNIQUE > overwritten by next
}
for (plotit in colnames(my_df)) {
hist_auto(my_df[,plotit])
## name works, but meaningless and NOT UNIQUE > overwritten by next
}
for (plotit in colnames(my_df)) {
assign (plotit, my_df[,plotit])
hist_auto (get(plotit))
## name works, but meaningless and NOT UNIQUE > overwritten by next
}
My aim is to have a function that iterates over eg. columns of a matrix, plots and saves each with a unique and meaningful name.
The solution will probably involve a smart combination of substitute() parse() eval() and paste (), but lacking solid understanding I failed to figure out.
My basis of experimentation was:
how to dynamically call a variable?
How about something like this? You may need to install.packages("ggplot2")
library(ggplot2)
my_df <- data.frame(uni=runif(100),
norm=rnorm(100),
bino=rbinom(100, 20, 0.5))
get_histogram <- function(df, varname, binwidth=1, save=T) {
stopifnot(varname %in% names(df))
title <- sprintf("Histogram of %s", varname)
p <- (ggplot(df, aes_string(x=varname)) +
geom_histogram(binwidth=binwidth) +
ggtitle(title))
if(save) {
filename <- sprintf("histogram_%s.png", gsub(" ", "_", varname))
ggsave(filename, p, width=10, height=8)
}
return(p)
}
for(var in names(my_df))
get_histogram(my_df, var, binwidth=0.5) # If you want to save them
get_histogram(my_df, "uni", binwidth=0.1, save=F) # If you want to look at a specific one
So I ended up with 2 functions, one that can iterate over data frames, and another that takes a single vectors. Using parts of Adrian's [thanks!] solution:
hist_dataframe <- function(variable, col ="gold1", ...) {
stopifnot(colName %in% colnames(df))
variable = df[,colName]
stopifnot(length (variable) >1 )
plotname = paste(substitute(df),'__', colName, sep="")
FnP = paste (getwd(),'/',plotname, '.hist.pdf', collapse = "", sep=""); print (FnP)
hist (variable, main = plotname)
dev.copy2pdf (file=FnP )
}
And the one for simple vectors stays as in Q.

R: How to convert from loops and rbinds to efficient code?

I'm new to R. I have a problem to solve, and a working function below that solves it nicely (in decent time). But, from what I'm reading on R tutorials, and here on SO, I feel like I'm doing way too much work to solve it. Is there some fancy R way to collapse this all into a few lines?
The problem to solve: Given a CSV file of data of character data, and a "flag" argument, extract the value at position [row, 1]. "row" is calculated to be the minimum value from column "InterestingColumn" for "flag a", the maximum value from column "Interesting Column" for "flag b", or the n-th value defined by a numeric "flag". The output should be grouped by the unique values of "InterestingColumn". The returned result should be a data frame. The column schema is known, but the length of the file is not.
My instinct is that I should be able to get rid of the for loop altogether, and also that my reconstruction of the matrix with rbind each time is inefficient (like this?) Any tutelage would be appreciated, thanks!
myfunc <- function(flag = "a") {
csv <- read.csv("data.csv", colClasses = "character")
col <- unique(csv$InterestingColumn)
output <- NULL
for (i in 1:length(col)) {
sub <- subset(csv, InterestingColumn == col[i])
vals <- as.numeric(sub[, 12])
if (flag == "a") {
output <- rbind(output, matrix(c(sub[which.min(vals),1], col[i]), ncol = 2))
}
else if (flag == "b") {
output <- rbind(output, matrix(c(sub[which.max(vals),1], col[i]), ncol = 2))
}
else if (is.numeric(flag)) {
output <- rbind(output, matrix(c(sub[flag,1], col[i]), ncol = 2))
}
colnames(output) <- c("data", "col")
as.data.frame(output)
}
}
Say that column 12 is named Col12. Then aggregate may be in order. Everything after the read.csv call in the function should be handled by the following expression (but you may want to set the names of the resulting data frame):
aggregate(Col12 ~ InterestingColumn, data=csv, FUN=function(x) {
if (flag == "a") {
min(x);
} else if (flag == "b") {
max(x);
} else if (is.numeric(flag)) {
x[flag];
}
})

Resources