To make my code more readable, I like to avoid names of objects that already exist when creating new objects. Because of the package-based nature of R, and because functions are first-class objects, it can be easy to overwrite common functions that are not in base R (since a common package might use a short function name but without knowing what package to load there is no way to check for it). Objects such as the built-in logicals T and F also cause trouble.
Some examples that come to mind are:
One letter
c
t
T/F
J
Two letters
df
A better solution might be to avoid using short names altogether in favor of more descriptive ones, and I generally try to do that as a matter of habit. Yet "df" for a function which manipulates a generic data.frame is plenty descriptive and a longer name adds little, so short names have their uses. In addition, for SO questions where the larger context isn't necessarily known, coming up with descriptive names is well-nigh impossible.
What other one- and two-letter variable names conflict with existing R objects? Which among those are sufficiently common that they should be avoided? If they are not in base, please list the package as well. The best answers will involve at least some code; please provide it if used.
Note that I am not asking whether or not overwriting functions that already exist is advisable or not. That question is addressed on SO already:
In R, what exactly is the problem with having variables with the same name as base R functions?
For visualizations of some answers here, see this question on CV:
https://stats.stackexchange.com/questions/13999/visualizing-2-letter-combinations
apropos is ideal for this:
apropos("^[[:alpha:]]{1,2}$")
With no packages loaded, this returns:
[1] "ar" "as" "by" "c" "C" "cm" "D" "de" "df" "dt" "el" "F" "gc" "gl"
[15] "I" "if" "Im" "is" "lh" "lm" "ls" "pf" "pi" "pt" "q" "qf" "qr" "qt"
[29] "Re" "rf" "rm" "rt" "sd" "t" "T" "ts" "vi"
The exact contents will depend upon the search list. Try loading a few packages and re-running it if you care about conflicts with packages that you commonly use.
I loaded all the (>200) packages installed on my machine with this:
lapply(rownames(installed.packages()), require, character.only = TRUE)
And reran the call to apropos, wrapping it in unique, since there were a few duplicates.
one_or_two <- unique(apropos("^[[:alpha:]]{1,2}$"))
This returned:
[1] "Ad" "am" "ar" "as" "bc" "bd" "bp" "br" "BR" "bs" "by" "c" "C"
[14] "cc" "cd" "ch" "ci" "CJ" "ck" "Cl" "cm" "cn" "cq" "cs" "Cs" "cv"
[27] "d" "D" "dc" "dd" "de" "df" "dg" "dn" "do" "ds" "dt" "e" "E"
[40] "el" "ES" "F" "FF" "fn" "gc" "gl" "go" "H" "Hi" "hm" "I" "ic"
[53] "id" "ID" "if" "IJ" "Im" "In" "ip" "is" "J" "lh" "ll" "lm" "lo"
[66] "Lo" "ls" "lu" "m" "MH" "mn" "ms" "N" "nc" "nd" "nn" "ns" "on"
[79] "Op" "P" "pa" "pf" "pi" "Pi" "pm" "pp" "ps" "pt" "q" "qf" "qq"
[92] "qr" "qt" "r" "Re" "rf" "rk" "rl" "rm" "rt" "s" "sc" "sd" "SJ"
[105] "sn" "sp" "ss" "t" "T" "te" "tr" "ts" "tt" "tz" "ug" "UG" "UN"
[118] "V" "VA" "Vd" "vi" "Vo" "w" "W" "y"
You can see where they came from with
lapply(one_or_two, find)
Been thinking about this more. Here's a list of one-letter object names in base R:
> var.names <- c(letters,LETTERS)
> var.names[sapply(var.names,exists)]
[1] "c" "q" "t" "C" "D" "F" "I" "T" "X"
And one- and two-letter object names in base R:
one.letter.names <- c(letters,LETTERS)
N <- length(one.letter.names)
first <- rep(one.letter.names,N)
second <- rep(one.letter.names,each=N)
two.letter.names <- paste(first,second,sep="")
var.names <- c(one.letter.names,two.letter.names)
> var.names[sapply(var.names,exists)]
[1] "c" "d" "q" "t" "C" "D" "F" "I" "J" "N" "T" "X" "bc" "gc"
[15] "id" "sd" "de" "Re" "df" "if" "pf" "qf" "rf" "lh" "pi" "vi" "el" "gl"
[29] "ll" "cm" "lm" "rm" "Im" "sp" "qq" "ar" "qr" "tr" "as" "bs" "is" "ls"
[43] "ns" "ps" "ts" "dt" "pt" "qt" "rt" "tt" "by" "VA" "UN"
That's a much bigger list than I initially suspected, although I would never think of naming a variable "if", so to a certain degree it makes sense.
Still doesn't capture object names not in base, or give any sense of which functions are best avoided. I think a better answer would either use expert opinion to figure out which functions are important (e.g. using c is probably worse than using qf) or use a data mining approach on a bunch of R code to see what short-named functions get used the most.
Related
file_name <- 'I am a good boy who went to Africa, Brazil and India'
strsplit(file_name, ' ')
[[1]]
[1] "I" "am" "a" "good" "boy" "who" "went" "to" "Africa," "Brazil"
[11] "and" "India"
In the above implementation, I want to return all the strings individually. However, the function is returning 'Africa,' as a single entity whereas I want to return the , also separately.
The expected output should be. The , appears as a separate element
[[1]]
[1] "I" "am" "a" "good" "boy" "who" "went" "to" "Africa" "," "Brazil"
[11] "and" "India"
Perhaps this helps
strsplit(file_name, '\\s+|(?<=[a-z])(?=[[:punct:]])', perl = TRUE)
#[[1]]
#[1] "I" "am" "a" "good" "boy" "who" "went"
#[8] "to" "Africa" "," "Brazil" "and" "India"
Or use an extraction method
regmatches(file_name, gregexpr("[[:alnum:]]+|,", file_name))
I haven't worked with regular expressions for quite some time, so I'm not sure if what I want to do can be done "directly" or if I have to work around.
My expressions look like the following two:
crb_gdp_g_100000_16_16_ftv_all.txt
crt_r_g_25000_20_40_flin_g_2.txt
Only the parts replaced by a asterisk are "varying", the other stuff is constant (or irrelevant, as in the case of the last part (after "f*_"):
cr*_*_g_*_*_*_f*_
Is there a straightfoward way to get only the values of the asterisk-parts? E.g. in case of "r" or "gdp" I have to include underscores, otherwise I get the r at the beginning of the expression. Including the underscores gives "r" or "gdp", but I only want "r" or "gdp".
Or in short: I know a lot about my expressions but I only want to extract the varying parts. (How) Can I do that?
You can use sub with captures and then strsplit to get a list of the separated elements:
str <- c("crb_gdp_g_100000_16_16_ftv_all.txt", "crt_r_g_25000_20_40_flin_g_2.txt")
strsplit(sub("cr([[:alnum:]]+)_([[:alnum:]]+)_g_([[:alnum:]]+)_([[:alnum:]]+)_([[:alnum:]]+)_f([[:alnum:]]+)_.+", "\\1.\\2.\\3.\\4.\\5.\\6", str), "\\.")
#[[1]]
#[1] "b" "gdp" "100000" "16" "16" "tv"
#[[2]]
#[1] "t" "r" "25000" "20" "40" "lin"
Note: I replaced \\w with [[:alnum:]] to avoid inclusion of the underscore.
We can also use regmatches and regexec to extract these values like this:
regmatches(str, regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str))
[[1]]
[1] "crb_gdp_g_100000_16_16_ftv_all.txt" "b"
[3] "gdp" "100000"
[5] "16" "16"
[7] "tv"
[[2]]
[1] "crt_r_g_25000_20_40_flin_g_2.txt" "t" "r"
[4] "25000" "20" "40"
[7] "lin"
Note that the first element in each vector is the full string, so to drop that, we can use lapply and "["
lapply(regmatches(str,
regexec("^cr([^_]+)_([^_]+)_g_([^_]+)_([^_]+)_([^_]+)_f([^_]+)_.*$", str)),
"[", -1)
[[1]]
[1] "b" "gdp" "100000" "16" "16" "tv"
[[2]]
[1] "t" "r" "25000" "20" "40" "lin"
I want to use the blank row represented by "" that exists in my list so I can group all the rows in between into sublists.
For example I have a long list that looks like this:
> data
[1] "data science"
[2] "big data"
[3] "machine learning"
[4] "BI"
[5] "analytics"
[6] ""
[7] "SAS"
[8] "R"
[9] "Python"
[10] "Spark"
[11] ""
[12] "Hive"
[13] "PIG"
[14] "IMPALA"
....
And I want something like this:
> output
[[1]] [1] "data science" "big data" "machine learning" "BI" "analytics"
[[2]] [1] "SAS" "R" "Python" "Spark"
[[3]] [1] "Hive" "PIG" "IMPALA"
The indexation in my output is maybe wrong but overall it's what I want.
Maybe something with splitwould do it.
You are correct that split can help you. If you cumsum a logical vector it will break apart your original vector into groups. You then have to drop the first element because it is "". That's what tail does in the lapply:
set.seed(201)
x <- sample(letters, 20, replace = T)
x[c(6,12)] <- ""
> lapply(split(x, cumsum(x == "")), tail, -1)
$`0`
[1] "p" "p" "q" "r"
$`1`
[1] "v" "n" "g" "l" "t"
$`2`
[1] "p" "p" "n" "e" "t" "c" "j" "m"
For example, I have an element "computer" in a vector. I need to get a vector consisting of "c", "o", "m", "p", "u", "t", "e", "r".
And the second part of my question is optional. How can I create a vector containing letter combinations of the elements of the above mentioned vector and letters in the resulting combinations will be only in such order as in the original word? For instance, I want to get something like "puter" or "mpu" in this vector instead of "tumpo".
You can use
strsplit("computer", "\\b")
and
library("RWeka")
gsub(" ", "",
NGramTokenizer(paste(strsplit("computer", "\\b")[[1]], collapse=" "),
Weka_control(min=2,
max=5)),
fixed=TRUE)
# [1] "compu" "omput" "mpute" "puter" "comp"
# [6] "ompu" "mput" "pute" "uter" "com"
# [11] "omp" "mpu" "put" "ute" "ter"
# [16] "co" "om" "mp" "pu" "ut"
# [21] "te" "er"
to create n-grams with 2 <= n <=5.
For the first part of the question is really easy to get:
splits <- unlist(strsplit("computer",split=""))
> splits
[1] "c" "o" "m" "p" "u" "t" "e" "r"
For the second part you can use the following code:
subseqs <-
unlist(
lapply(1:length(splits),FUN=function(x){
lapply(1:(length(splits)+1-x),FUN=function(y){
paste(splits[y:(y+x-1)],collapse="") })
})
)
> subseqs
[1] "c" "o" "m" "p" "u" "t" "e"
[8] "r" "co" "om" "mp" "pu" "ut" "te"
[15] "er" "com" "omp" "mpu" "put" "ute" "ter"
[22] "comp" "ompu" "mput" "pute" "uter" "compu" "omput"
[29] "mpute" "puter" "comput" "ompute" "mputer" "compute" "omputer"
[36] "computer"
For three consecutive letter combinations:
x <- strsplit("computer", "\\b")
y <- combn(seq(x),3); m <- match(1:6,y[1,])
combn (x,3)[,m]
I have very simple question: How can I divide the following text into 3 in a single code
mycodes <- c("ATTTGGGCTAATTTTGTTTCTTTCTGGGTCTCTC")
strsplit(mycodes, split = character(3), fixed = T, perl = FALSE, useBytes = FALSE)
[[1]]
[1] "A" "T" "T" "T" "G" "G" "G" "C" "T" "A" "A" "T" "T" "T" "T" "G" "T" "T" "T" "C"
[21] "T" "T" "T" "C" "T" "G" "G" "G" "T" "C" "T" "C" "T" "C"
This is not what I want; I want three letters at a time:
[1] "ATT" "TGG", "GCT"...............and so on the final may be of one, two or three letters depending upon the letter availability.
Thanks;
I assume you want to work with codons. If that's the case, you might want to look at the Biostrings package from Bioconductor. It provides a variety of tools for working with biological sequence data.
library(Biostrings)
?codons
You can achieve what you want, with a little bit of clumsy coercion:
as.character(codons(DNAString(mycodes)))
Here is one approach using stringr package
require(stringr)
start = seq(1, nchar(mycodes), 3)
stop = pmin(start + 2, nchar(mycodes))
str_sub(mycodes, start, stop)
Output is
[1] "ATT" "TGG" "GCT" "AAT" "TTT" "GTT" "TCT" "TTC" "TGG"
[10] "GTC" "TCT" "C"
You can also use:
strsplit(data, '(?<=.{3})', perl=TRUE)
[[1]]
[1] "ATT" "TGG" "GCT" "AAT" "TTT" "GTT" "TCT" "TTC" "TGG" "GTC" "TCT" "C"
or
library(stringi)
stri_extract_all_regex(data, '.{1,3}')