I have a bunch of letters, and cannot for the life of me figure out how to convert them to their number equivalent.
letters[1:4]
Is there a function
numbers['e']
which returns
5
or something user defined (ie 1994)?
I want to convert all 26 letters to a specific value.
I don't know of a "pre-built" function, but such a mapping is pretty easy to set up using match. For the specific example you give, matching a letter to its position in the alphabet, we can use the following code:
myLetters <- letters[1:26]
match("a", myLetters)
[1] 1
It is almost as easy to associate other values to the letters. The following is an example using a random selection of integers.
# assign values for each letter, here a sample from 1 to 2000
set.seed(1234)
myValues <- sample(1:2000, size=26)
names(myValues) <- myLetters
myValues[match("a", names(myValues))]
a
228
Note also that this method can be extended to ordered collections of letters (strings) as well.
You could try this function:
letter2number <- function(x) {utf8ToInt(x) - utf8ToInt("a") + 1L}
Here's a short test:
letter2number("e")
#[1] 5
set.seed(123)
myletters <- letters[sample(26,8)]
#[1] "h" "t" "j" "u" "w" "a" "k" "q"
unname(sapply(myletters, letter2number))
#[1] 8 20 10 21 23 1 11 17
The function calculates the utf8 code of the letter that it is passed to, subtracts from this value the utf8 code of the letter "a" and adds to this value the number one to ensure that R's indexing convention is observed, according to which the numbering of the letters starts at 1, and not at 0.
The code works because the numeric sequence of the utf8 codes representing letters respects the alphabetic order.
For capital letters you could use, accordingly,
LETTER2num <- function(x) {utf8ToInt(x) - utf8ToInt("A") + 1L}
The which function seems appropriate here.
which(letters == 'e')
#[1] 5
Create a lookup vector and use simple subsetting:
x <- letters[1:4]
lookup <- setNames(seq_along(letters), letters)
lookup[x]
#a b c d
#1 2 3 4
Use unname if you want to remove the names.
thanks for all the ideas, but I am a dumdum.
Here's what I did. Made a mapping from each letter to a specific number, then called each letter
df=data.frame(L=letters[1:26],N=rnorm(26))
df[df$L=='e',2]
Related
I have three columns which are characters A, B, and C respectively. I am using is.numeric to convert them to numeric and then assign them values e.g. 1,2 and 3, but when I am using is.numeric(). it returns back NAs. In different data frames these orders vary, e.g. ABC or ACB, but A=i+0i, B=2+3i and C is also a complex number. I want to first convert the string to a complex number and then assign values to them.
LV$phase1 <- as.numeric(LV$phase1)
class(phase1)
A=1
print(phase1)
This is the error:
"Warning message:
NAs introduced by coercion "
It does not usually make sense to convert character data to numeric, but if the letters refer to an ordered sequence of events/phases/periods, then it may be useful. R uses factors for this purpose. For example
set.seed(42)
phase <- sample(LETTERS[1:4], 10, replace=TRUE)
phase
# [1] "A" "A" "A" "A" "B" "D" "B" "B" "A" "D"
factor(phase)
# [1] A A A A B D B B A D
Levels: A B D
as.numeric(factor(phase))
# [1] 1 1 1 1 2 3 2 2 1 3
If this is what you are trying to do
LV$phase1 <- as.numeric(factor(LV$phase1))
will convert the letters to an ordered sequence and assign numbers to represent those categories.
When I run the code:
library(vecsets)
p <- c("a","b")
q <- c( "a")
vunion(p,q, multiple = TRUE)
I get the result:
[1] "a" "b"
But I expect the result to be
vunion(p,q, multiple = TRUE)
[1] "a" "b" "a"
I also do not understand the result provided in the example of the vesect package. The example shows:
x <- c(1:5,3,3,3,2,NA,NA)
y <- c(2:5,4,3,NA)
vunion(x,y,multiple=TRUE)
[1] 2 3 3 4 5 NA 1 3 3 2 NA 4
But if we check
length(x)+length(y); length(vunion(x,y))
[1] 18
[1] 12
we get different lengths, but I think they should be the same. Note, for example, 5 appears only once.
What's going on here? Can someone explain?
I think the vecset package documentation (link) describes this behavior quite well:
The base::union function removes duplicates per algebraic set theory. vunion does not, and so returns as many duplicate elements as are in either input vector (not the sum of their inputs.) In short, vunion is the same as vintersect(x,y) + vsetdiff(x,y) + vsetdiff(y,x).
It's true that you have to read carefully, though. I've emphasized the important part. The issue is not with character versus numeric vectors, but rather whether elements are repeated within the same vector or not. Consider p1 versus p2 in the following example. The result from vunion will have as many a's as either p or q, so we expect 1 "a" in the first part and two a's in the second part; both times we expect only 1 "b":
library(vecsets)
q <- c("a", "b")
p1 <- c("a", "b")
vunion(p1, q, multiple = TRUE)
[1] "a" "b"
p2 <- c("a", "a", "b")
vunion(p2, q, multiple = TRUE)
[1] "a" "b" "a"
I need a regular expression that returns a specific letter and the following (one or two) digits until the next letter.
For example, I would like to extract how many carbons (C) are in a formula using regular expressions in R
strings <- c("C16H4ClNO2", "CH8O", "F2Ni")
I need an expression that returns the number of C which can be one or 2 digits and that does not return the number after chlorine (Cl).
substr(strings,regexpr("C[0-9]+",strings) + 1, regexpr("[ABDEFGHIJKLMNOPQRSTUVWXYZ]+",strings) -1)
[1] "16" "C" ""
but the answer I want to be returned is
"16","1","0"
Moreover, I would like the regular expression to automatically locate the next letter and stop before it, instead of having a final position which I specify as a letter not being a C.
makeup in the CHNOSZ package will parse a chemical formula. Here are some alternatives that use it:
1) Create a list L of such fully parsed formulas and then for each one check if it has a "C" component and return its value or 0 if none:
library(CHNOSZ)
L <- Map(makeup, strings)
sapply(L, function(x) if ("C" %in% names(x)) x[["C"]] else 0)
## C16H4ClNO2 CH8O F2Ni
## 16 1 0
Note that L is a list of the fully parsed formulas in case you have other requirements:
> L
$C16H4ClNO2
C H Cl N O
16 4 1 1 2
$CH8O
C H O
1 8 1
$F2Ni
F Ni
2 1
1a) By adding c(C = 0) to each list component we can avoid having to test for the existence of carbon yielding the following shorter version of the sapply line in (1):
sapply(lapply(L, c, c(C = 0)), "[[", "C")
2) This one-line variation of (1) gives the same answer as in (1) except for names. It appends "C0" to each formula to avoid having to test for the existence of carbon:
sapply(lapply(paste0(strings, "C0"), makeup), "[[", "C")
## [1] 16 1 0
2a) Here is a variation of (2) that eliminates the lapply by using the fact that makeup will accept a matrix:
sapply(makeup(as.matrix(paste0(strings, "C0"))), "[[", "C")
## [1] 16 1 0
If I understood your question correctly, you're looking for two things:
C + a number immediately afterwards => match this number
C followed by another UPPERCASE letter (another chemical element, that is) => count C
If you're able to install another library, you might get along with:
library("stringr")
strings <- c("C16H4ClNO2", "CH8O", "F2Ni")
str1 <- str_extract(strings, '(?<=C)\\d+')
str2 <- str_count(strings, 'C[A-Z]')
str2[!is.na(str1)] = str1[!is.na(str1)]
str2
# [1] "16" "1" "0"
This does a lot of fancy things, str1 looks for the first condition (C followed by another digits), while str2 looks for the second condition. The last line combines the two vectors
We can do this with base R
sub("C(\\d+).*", "\\1", sub("C([^0-9]+)",
"C1\\1", ifelse(!grepl("C", strings), paste0("C0", strings), strings)))
#[1] "16" "1" "0"
ifelse(str_extract(strings,'(?<=C)(\\d+|)')=='',1,str_extract(strings,'(?<=C)(\\d+|)'))
[1] "16" "1" NA
I have a string like this:
data <- c("A:B:C", "A:B", "E:F:G", "H:I:J", "B:C:D")
I want to convert this to a string of:
c("A:B:C:D", "E:F:G", "H:I:J")
The idea is that each element inside the string is another string of sub-elements (e.g. A, B, C) that have been pasted together (with sep=":"). Each element within the string is compared with all other elements to look for common sub-elements, and elements with common sub-elements are combined.
I don't care about the order of the string (or order of the sub-elements) FWIW.
Thanks for any help offered!
--
Answers so far...
I liked d.b's suggestion - not the least because it stayed in base R. However, with a more complicated larger set, it wasn't working perfectly until everything was run again. With an even more complicated dataset, re-running everything more than twice might be needed.
I had more difficulty with thelatemail's suggestion. I had to upgrade R to use lengths, and I then had to figure out how to get to the end point because the answer was incomplete. In any case, this was how I got to the end (I suspect there is a better way). This worked with a larger set without a hitch.
library(igraph)
spl <- strsplit(data,":")
combspl <- data.frame(
grp = rep(seq_along(spl),lengths(spl)),
val = unlist(spl)
)
cl <- clusters(graph.data.frame(combspl))$membership[-(1:length(spl))]
dat <- data.frame(cl) # after getting nowhere working with the list as formatted
dat[,2] <- row.names(dat)
a <- character(0)
for (i in 1:max(cl)) {
a[i] <- paste(paste0(dat[(dat[,1] == i),][,2]), collapse=":")
}
a
#[1] "A:B:C:D" "E:F:G" "H:I:J"
I'm going to leave this for now as is.
A possible application for the igraph library, if you think of your values as an edgelist of paired groups:
library(igraph)
spl <- strsplit(data,":")
combspl <- data.frame(
grp = rep(seq_along(spl),lengths(spl)),
val = unlist(spl)
)
cl <- clusters(graph.data.frame(combspl))$membership[-(1:length(spl))]
#A B C E F G H I J D
#1 1 1 2 2 2 3 3 3 1
split(names(cl),cl)
#$`1`
#[1] "A" "B" "C" "D"
#
#$`2`
#[1] "E" "F" "G"
#
#$`3`
#[1] "H" "I" "J"
Or as collapsed text:
sapply(split(names(cl),cl), paste, collapse=";")
# 1 2 3
#"A;B;C;D" "E;F;G" "H;I;J"
a = character(0)
for (i in 1:length(data)){
a[i] = paste(unique(unlist(strsplit(data[sapply(1:length(data), function(j)
any(unlist(strsplit(data[i],":")) %in% unlist(strsplit(data[j],":"))))],":"))), collapse = ":")
}
unique(a)
#[1] "A:B:C:D" "E:F:G" "H:I:J"
I am teaching myself the basics of R and have been encountering trouble using the function tapply when passing the sort function while trying to use non-default optional arguments for sort. Here is an example of the trouble I am facing:
Given the vectors
x <- c(1.1, 1.0, 2.1, NA_real_)
y <- c("a", "b", "c","d")
I find that
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
results in the same output regardless of the logical assignments I endow decreasing and na.last with. In fact, the output always defaults to the sort default values
decreasing = FALSE, na.last = NA
For the record, when inputing the above example, the output is
> tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
1 1.1 2.1
"b" "a" "c"
Let me also mention that if I define the alternate function
sort2 <- function(v) sort(v, decreasing=TRUE, na.last=TRUE);
and pass sort2 to tapply instead, I still encounter the same trouble.
I am using running this code on a Mac OS X 10.10.4, using R 3.2.0. Using sort standalone results in the desired behavior (calling sort on its own without passing through tapply, that is), since it acts appropriately when altering the decreasing and na.last arguments.
Thank you in advance for any help.
I don't think you're using tapply() correctly.
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
The above line of code basically says "sort vector y grouping by categorical vector x". Your vector x is not really a categorical vector at all, it's a numeric vector with only distinct values, plus an NA. tapply() ignores the NA index, and then treats each of the remaining three distinct numeric values in x as separate groups, so it passes each of the three corresponding character strings from y to three different calls of sort(), which obviously has no effect on anything (which explains why your customization arguments have no effect) and returns the result ordered by the x groups.
Here's an example of how to do what I think you're trying to do:
x <- c(NA,1,2,3,NA,2,1,3);
g <- rep(letters[1:2],each=4);
x;
## [1] NA 1 2 3 NA 2 1 3
g;
## [1] "a" "a" "a" "a" "b" "b" "b" "b"
tapply(x,g,sort,decreasing=T,na.last=T);
## $a
## [1] 3 2 1 NA
##
## $b
## [1] 3 2 1 NA
##
Edit: When you want to sort a vector by another vector, you can use order():
y[order(x,decreasing=T,na.last=T)];
## [1] "c" "a" "b" "d"
y[order(x,decreasing=F,na.last=T)];
## [1] "b" "a" "c" "d"