R: Sorting a vector alphabetically after nth character - r

I would like to sort the elements (string) of a vector alphabetically, but only considering characters after the nth. The strings can contain both numbers and characters, for example:
> v <- c("ENCSR529JNJ_HNR35NPK_21_K562", "ENCSR529MBZ_AP22IG_11_K562", "ENCSR529MBZ_AP22IG_21_K562", "ENCSR530BOP_DUPT6H_11_K562", "ENCSR530BOP_DUPT6H_21_K562")
and after sorting after the 11th character, v would become:
"ENCSR529MBZ_AP22IG_11_K562", "ENCSR529MBZ_AP22IG_21_K562", "ENCSR530BOP_DUPT6H_11_K562", "ENCSR530BOP_DUPT6H_21_K562", "ENCSR529JNJ_HNR35NPK_21_K562"
Any help will be greatly appreciated! Thanks

v[order(substr(v, start = 12, stop = max(nchar(v))))]
# [1] "ENCSR529MBZ_AP22IG_11_K562" "ENCSR529MBZ_AP22IG_21_K562" "ENCSR530BOP_DUPT6H_11_K562" "ENCSR530BOP_DUPT6H_21_K562"
# [5] "ENCSR529JNJ_HNR35NPK_21_K562"
substr(v, start = 12, stop = max(nchar(v))) gives the substring omitting the first 11 characters. So we order by that.

Related

Getting last two digits of Sequence Date in R

I have sequence date:
names<-format(seq.Date(as.Date("2012-11-01"),as.Date("2012-12-01"),
by = 'months'),format = "%Y%m")
How can I get the last two digit, like the result for last two digits of names[1] is 11?
Using the stringr package you can just put
stringr::str_sub(string = names, start = -2, end = -1)
You could use substr():
names = substr(names, nchar(names)-1, nchar(names))
The result is:
[1] "11" "12"
Or as integer:
names = as.integer(substr(names, nchar(names)-1, nchar(names))
Result:
[1] 11 12
There can be tenths of ways. The simpliest I could invent was to find the remainder of intiger division:
as.integer(names) %% 100
that returns:
[1] 11 12
Technically these are integers. If you stricktly require characters apply as.character() to the result to cast the type.

Limiting word count in a character column in R and saving extra words in another variable [duplicate]

I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

Count string length and remove characters if a certain length [duplicate]

There are functions in Excel called left, right, and mid, where you can extract part of the entry from a cell. For example, =left(A1, 3), would return the 3 left most characters in cell A1, and =mid(A1, 3, 4) would start with the the third character in cell A1 and give you characters number 3 - 6. Are there similar functions in R or similarly straightforward ways to do this?
As a simplified sample problem I would like to take a vector
sample<-c("TRIBAL","TRISTO", "RHOSTO", "EUGFRI", "BYRRAT")
and create 3 new vectors that contain the first 3 characters in each entry, the middle 2 characters in each entry, and the last 4 characters in each entry.
A slightly more complicated question that Excel doesn't have a function for (that I know of) would be how to create a new vector with the 1st, 3rd, and 5th characters from each entry.
You are looking for the function substr or its close relative substring:
The leading characters are straight-forward:
substr(sample, 1, 3)
[1] "TRI" "TRI" "RHO" "EUG" "BYR"
So is extracting some characters at a defined position:
substr(sample, 2, 3)
[1] "RI" "RI" "HO" "UG" "YR"
To get the trailing characters, you have two options:
substr(sample, nchar(sample)-3, nchar(sample))
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
substring(sample, nchar(sample)-3)
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
And your final "complicated" question:
characters <- function(x, pos){
sapply(x, function(x)
paste(sapply(pos, function(i)substr(x, i, i)), collapse=""))
}
characters(sample, c(1,3,5))
TRIBAL TRISTO RHOSTO EUGFRI BYRRAT
"TIA" "TIT" "ROT" "EGR" "BRA"

Adding leading 0s in r

I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Resources