Generate vector of a repeated string with incremental suffix number - r

I would like to generate a vector based on repeating the string "FST" but with a number at the end which increments:
"Fst1" "Fst2" "Fst3" "Fst4" ... "Fst100"

An alternative to paste is sprintf, which can be a bit more convenient if, for instance, you wanted to "pad" your digits with leading zeroes.
Here's an example:
sprintf("Fst%d", 1:10) ## No padding
# [1] "Fst1" "Fst2" "Fst3" "Fst4" "Fst5"
# [6] "Fst6" "Fst7" "Fst8" "Fst9" "Fst10"
sprintf("Fst%02d", 1:10) ## Pads anything less than two digits with zero
# [1] "Fst01" "Fst02" "Fst03" "Fst04" "Fst05"
# [6] "Fst06" "Fst07" "Fst08" "Fst09" "Fst10"
So, for your question, you would be looking at:
sprintf("Fst%d", 1:100) ## or sprintf("Fst%03d", 1:100)

You can use the paste function to create a vector that combines a set character string with incremented numbers: paste0('Fst', 1:100)

Related

How can I give sequential names to items in a list?

I have a character string that I have split into a list of smaller strings using strsplit. For example:
> full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
> full.seq
[1] "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
> sequences <- strsplit(full.seq, "cg")
> sequences
[[1]]
[1] "FZp" "K3VdAQzEFZpcAVdV8QM8ZpsEFZpa" "GKi3VdVSQzEFZp"
[4] "GKAVdVRpEFKGIZpg13"
I would like to give each of these new strings a unique, sequential name that I can still use to identify that they were from the same original string (for a later analysis I will do on these strings). For example, "ID.seq1", "ID.seq2", "ID.seq3" etc. I have tried doing this manually but receive this error:
> names(sequences) <- c("ID.seq1", "ID.seq2", "ID.seq3", "ID.seq4")
Error in names(sequences) <- c("ID.seq1", "ID.seq2", "ID.seq3", "ID.seq4") :
'names' attribute [4] must be the same length as the vector [1]
I would also like an automated way of doing this though, as I will need to label up to 30 new strings from a number of original strings. Any suggestions?
First of all, if you want a character vector, you will have to subset the list, because strsplit returns a list. After doing that, you can easily assign names to that vector of terms.
full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
sequences <- strsplit(full.seq, "cg")[[1]]
names(sequences) <- paste0("ID.seq", c(1:4))
sequences
ID.seq1 ID.seq2
"FZp" "K3VdAQzEFZpcAVdV8QM8ZpsEFZpa"
ID.seq3 ID.seq4
"GKi3VdVSQzEFZp" "GKAVdVRpEFKGIZpg13"
Answer by Tim is perfect. I just want to add if you want to keep your list and name elements of each item:
full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
full.seq
sequences <- strsplit(full.seq, "cg")
names(sequences[[1]]) <- paste("ID.seq",1:4,sep="")

R-- Add leading zero to string, with no fixed string format

I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")

Adding leading 0s in r

I have a large data frame that is filled with characters such as:
x <- c("Y188","Y204" ,"Y221","EP121_1" ,"Y233" , "Y248" ,"Y268", "BB2","BB20",
"BB32" ,"BB044" ,"BB056" , "Y234" , "Y249" ,"Y271" ,"BB3", "BB21", "BB33",
"BB045","BB057" ,"Y236", "Y250", "Y272" , "BB4", "BB22" )
As you can see, certain tags such as BB20 only have two integers. I would like the entire list of characters to have at least 3 integers like this(the issue is only in the BB tags if that helps):
Y188, Y204, Y221, EP121_1, Y233, Y248, Y268, BB002, BB020, BB032, BB044,
BB056, Y234, Y249, Y271, BB003, BB021, BB033, BB045, BB057, Y236, Y250,
Y272, BB004, BB022
Ive looked into the sprintf and FormatC functions but still am having no luck.
A forceful approach with a nested gsub call:
gsub("(.*[A-Z])(\\d{1}$)", "\\100\\2",
gsub("(.*[A-Z])(\\d{2}$)", "\\10\\2", x))
# [1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020"
# [10] "BB032" "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033"
# [19] "BB045" "BB057" "Y236" "Y250" "Y272" "BB004" "BB022"
There is surely a more general way to do this, but for such a localized task, two simple sub can be enough: add one trailing zero for two-digit numbers, two trailing zeros for one-digit numbers.
x <- sub("^BB(\\d{1})$","BB00\\1",x)
x <- sub("^BB(\\d{2})$","BB0\\1",x)
This works, but will have edge case
# indicator for numeric of length less than three
num <- gsub("[^0-9]", "", x)
id <- nchar(num) < 3
# overwrite relevant values with the reformatted ones
x[id] <- paste0(gsub("[0-9]", "", x)[id],
formatC(as.numeric(num[id]), width = 3, flag = "0"))
[1] "Y188" "Y204" "Y221" "EP121_1" "Y233" "Y248" "Y268" "BB002" "BB020" "BB032"
[11] "BB044" "BB056" "Y234" "Y249" "Y271" "BB003" "BB021" "BB033" "BB045" "BB057"
[21] "Y236" "Y250" "Y272" "BB004" "BB022"
It can be done using sprintf and gsub function.This step would extract numeric values and change its format.
num=sprintf("%03d",as.numeric(gsub("[^[:digit:]]", "", x)))
Next step would be to paste back numbers with changed format
x=paste(gsub("[^[:alpha:]]", "", x),num,sep="")

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

How to format a number with specified level of precision?

I would like to create a function that returns a vector of numbers a precision reflected by having only n significant figures, but without trailing zeros, and not in scientific notation
e.g, I would like
somenumbers <- c(0.000001234567, 1234567.89)
myformat(x = somenumbers, n = 3)
to return
[1] 0.00000123 1230000
I have been playing with format, formatC, and sprintf, but they don't seem to want to work on each number independently, and they return the numbers as character strings (in quotes).
This is the closest that i have gotten example:
> format(signif(somenumbers,4), scientific=FALSE)
[1] " 0.000001235" "1235000.000000000"
You can use the signif function to round to a given number of significant digits. If you don't want extra trailing 0's then don't "print" the results but do something else with them.
> somenumbers <- c(0.000001234567, 1234567.89)
> options(scipen=5)
> cat(signif(somenumbers,3),'\n')
0.00000123 1230000
>
sprintf seems to do it:
sprintf(c("%1.8f", "%1.0f"), signif(somenumbers, 3))
[1] "0.00000123" "1230000"
how about
myformat <- function(x, n) {
noquote(sapply(a,function(x) format(signif(x,2), scientific=FALSE)))
}

Resources