R get index of letters - r

I'm a noob with R. I want the alphabet index of each letter in a word. I don't understand what I'm doing wrong, since the individual command works perfectly...
word <- "helloworld"
l <- numeric(nchar(word))
for (i in 0:nchar(word)) {
l[i] <- match(substr(word,i,i+1), letters)
}
l
returns a weird [1] NA NA NA NA NA NA NA NA NA 4
when match(substr(word,0,1), letters) returns the appropriate [1] 8

The problem with your code is two fold.
First R indices start at 1 so when i = 0, l[i] is undefined.
Second, it doesn't just pull off a single letter at a time
i = 1
substr(word,i,i+1)
[1] "he"
A different approach
setNames(1:26, letters)[ strsplit("hello", NULL )[[1]] ]

The error lies in the i+1: you are getting two character strings and so no match is found. Use substring which is vectorized:
match(substring(word,1:nchar(word),1:nchar(word)),letters)
#[1] 8 5 12 12 15 23 15 18 12 4
Another (nerdy) way is to get the offset of the ASCII value of each character to the value of the char a:
as.integer(charToRaw(word))-as.integer(charToRaw("a"))+1
#[1] 8 5 12 12 15 23 15 18 12 4

You tested the only constellation that could work...
vectors start counting with 1 in R, your loop would go from 1 to
nchar(word)
have a look at ?substr. You have to define start and
stop, therefore you have to use substr(word, i, i)
But you don't have to use a loop here. as Richard Telford suggested, you can transform your string into a character-vector. Then you match every element of this vector to the letters vector
lapply(strsplit(word, ""), match, letters)

Related

R changes my list of character strings with "na" into the words as missing values (ex : BDNA3 --> NA) - How to deal with this?

I am struggling with R since 2 days without finding any solution !
Here is my problem :
I have a list of symbols extracted from one data-frame : annotation$"SYMBOL"
I would like to bind it to another data-frame, called "matrix", and to assign them as rownames.
I extracted the column, bound it without problems. However, I realized that once this was done, changing them into rownames doesn't work because ~ 5000 genes / 15000 are then changed as "NA"
I realize that actually it's all the genes with "NA" in their symbol that are seen as "missing values"
I try to change them as.character(annotation$"SYMBOL") but that doesn't change....
HERE:
X=as.character(annotation$"SYMBOL")
summary(X)
Length Class Mode
16978 character character
unique (unlist (lapply (as.character(annotation$"SYMBOL"), function (x) which (is.na (x)))))
[1] 1
Y=na.exclude(X)
summary(Y)
Length Class Mode
9954 character character
U=na.exclude(annotation$"SYMBOL")
Error in `$<-.data.frame`(`*tmp*`, "SYMBOL", value = c("SCYL3", "C1orf112", :
replacement has 9954 rows, data has 16978
And I know that they replace all the genes with "NA" in their names as NA....
Does someone have an idea how to go through this?
For example, Number 11 and number 15 in this image are deleted when I use "na.omit" function ....
To set your NA values, you should use the code df[df == "NA"] <- NA. I used this with your test dataset and produced the desired results. You can then use the na.omit() function on your df to remove the now set NA data. I don't have a working code from you, so I will supply the outline of what your code should look like:
df <- data.frame(lapply(df, as.character), stringAsFactors = FALSE)
df
X1 X2
1 1 SCYL3
2 2 C1orf112
3 3 FGR
4 4 CFH
5 5 STPG1
6 6 NIPAL3
7 7 AK2
8 8 KDM1A
9 9 TTC22
10 10 ST7L
11 11 DNAJC11
12 12 FMO3
13 13 E2F2
14 14 CDK11A
15 15 NADK
16 16 CSDE1
17 17 MASP2
df[df == "NA"] <- NA
The is.na(df) function will return FALSE for all results. If you add any data which is NA, you can omit that row using the na.omit(df) now.

R: match () only returns first occurrence

I have a dataframe
names2 <- c('AdagioBarber','AdagioBarber', 'Beethovan','Beethovan')
Value <- c(33,55,21,54)
song.data <- data.frame(names2,Value)
I would like to arrange it according to this character vector
names <- c('Beethovan','Beethovan','AdagioBarber','AdagioBarber')
I am using match() to achieve this
data.frame(song.data[match((names), (song.data$names2)),])
The problem is that match returns only first occurences
names2 Value
3 Beethovan 21
3.1 Beethovan 21
1 AdagioBarber 33
1.1 AdagioBarber 33
You can use order, as #zx8754 and #Evan Friedland have pointed out.
> name.order <- c('Beethovan','AdagioBarber')
> song.data$names2 <- factor(song.data$names2, levels= name.order)
> song.data[order(song.data$names2), ]
names2 Value
3 Beethovan 21
4 Beethovan 54
1 AdagioBarber 33
2 AdagioBarber 55
Basically, factor turns the strings into integers and creates a lookup table of what integers correspond to what strings. The levels argument specifies what you want that lookup table to be. Without that argument, it would just go by order of appearance.
So for example:
> as.numeric(factor(letters[1:5]))
[1] 1 2 3 4 5
> as.numeric(factor(letters[1:5], levels=c("d","b","e","a","c")))
[1] 4 2 5 1 3
Note: You'll need to be absolutely sure you get all your (correctly spelled) levels in that name.order vector, otherwise you'll end up with NA's in the output from order.
(I'm not sure why sort doesn't have the ability to sort factors, but it is what it is.)

Count number of short strings in a long string in R [duplicate]

This question already has answers here:
How to calculate the number of occurrence of a given character in each row of a column of strings?
(14 answers)
Closed 6 years ago.
suppose I have a long string such like:
c<-"abcabcdabcdeabcdefghijkabcdabcaba"
My question is how to quickly count the number of exact "abcd" in c.
1) gregexpr First paste "abcd" onto c so that there is at least 1 match. (This is needed because gregexpr returns -1 for any component of c having no matches rather than a zero length numeric vector.) Now, gregexpr returns a list whose components are numeric vectors of the starting positions of the matches one component per component of c -- in this case c only has one component but the code below works more generally. Now find the lengths of the components of the result of gregexpr and subtract 1 to take into account the extra abcd we added. No packages are used.
Example 1
lengths(gregexpr("abcd", paste(c, "abcd"))) - 1
## [1] 4
Note: If we knew that there was at least one match it could be slightly simplified to: lengths(gregexpr("abcd", c)) .
Example 2
Here is another example. Here DF has 3 rows and the corresponding components of c have 4, 4, and 0 occurrences of "abcd".
DF <- data.frame(c = c(c, c, "X")) # test input
lengths(gregexpr("abcd", paste(DF$c, "abcd"))) - 1
## [1] 4 4 0
2) regmatches
Here is an alternative approach. This approach has the advantage that no special code is needed for the no-match case. Again, no packages are used.
Here are the same two examples:
lengths(regmatches(c, gregexpr("abcd", c)))
## [1] 4
lengths(regmatches(DF$c, gregexpr("abcd", DF$c)))
## [1] 4 4 0
Using library stringr, you can do it as follows (on larger set, it will be fairly fast and efficient):
library(stringr)
c <- "abcabcdabcdeabcdefghijkabcdabcaba"
c
[1] "abcabcdabcdeabcdefghijkabcdabcaba"
str_count(c, 'abcd')
[1] 4
This will work on a column of a data frame as follows:
df <- data.frame(txt = rep(c, 10))
df$abcd_count <- str_count(df$txt, 'abcd')
df
txt abcd_count
1 abcabcdabcdeabcdefghijkabcdabcaba 4
2 abcabcdabcdeabcdefghijkabcdabcaba 4
3 abcabcdabcdeabcdefghijkabcdabcaba 4
4 abcabcdabcdeabcdefghijkabcdabcaba 4
5 abcabcdabcdeabcdefghijkabcdabcaba 4
6 abcabcdabcdeabcdefghijkabcdabcaba 4
7 abcabcdabcdeabcdefghijkabcdabcaba 4
8 abcabcdabcdeabcdefghijkabcdabcaba 4
9 abcabcdabcdeabcdefghijkabcdabcaba 4
10 abcabcdabcdeabcdefghijkabcdabcaba 4
Here is one method using base Rs gsub and strsplit:
# example
temp <- "abcabcdabcdeabcdefghijkabcdabcaba"
# substitute pattern for character not in string, here 9
temp2 <- gsub("abcd", "9", temp)
# split on 9, and count number of elements
length(strsplit(temp2, split="9")[[1]]) - 1
You need the [[1]] because strsplit is designed to operate over vectors of strings, here the vector is of length 1. An alternative to [[1]] in this case is unlist.
Also, 1 is subtracted because the number of elements are one larger than the number of abcd patterns by 1.

R: recursive function to give groups of consecutive numbers

Given a sorted vector x:
x <- c(1,2,4,6,7,10,11,12,15)
I am trying to write a small function that will yield a similar sized vector y giving the last consecutive integer in order to group consecutive numbers. In my case it is (defining groups 2, 4, 7, 12 and 15):
> y
[1] 2 2 4 7 7 12 12 12 15
I tried this this recursive idea (were x is the vector, and i an index that would start by 1 in most cases: if the content of the next index is one larger than the current i, then call the function with i+1; else return the content):
fun <- function(x,i){
ifelse(x[i]+1 == x[i+1],
fun(x,i+1),
return(x[i]))
}
However:
> sapply(x,fun,1)
[1] NA NA NA NA NA NA NA NA NA
How to get this to work.
Your sapply call is applying fun across all values of x, when you really want it to be applying across all values of i. To get the sapply to do what I assume you want to do, you can do the following:
sapply(X = 1:length(x), FUN = fun, x = x)
[1] 2 2 4 7 7 12 12 12 NA
Although it returns NA as the last value instead of 15. This is because I don't think your function is set up to handle the last value of a vector (there is no x[10], so it returns NA). You can probably edit your function to handle this fairly easily.
Maybe this helps:
find_non_consec <- function(x){ c(x[which(as.logical(diff(x)-1))],x[length(x)]) }
x <- c(1,2,4,6,7,10,11,12,15)
res <- find_non_consec(x)
The result is:
> res
[1] 2 4 7 12 15
This function identifies the numbers where the series ceases to be consecutive.

Obtain indices of a factor of characters with names contained in a character vector

I have a factor of characters, let's say:
A <- factor(c(rep("home", times=5), rep("work", times=3), rep("hobby", times=7), rep("friends", times=10)))
and I would like to get the indices of the characters equal to the ones contained in another vector, say:
B <- c("work", "hobby")
in this case I would like to obtain the vector 6:15.
I tried with which(A==B) but it does not work...
Any idea?
As akrun pointed out %in% should do the trick. Gives the output as:
[1] 6 7 8 9 10 11 12 13 14 15

Resources