find occurrence of string starting with a value in R - r

Is there a function for printing the total number of values contained in the dataset beginning with (a value)?
consider this dataset of 4 version numbers,
df <- c("1.20", "3.1.20", "2.45", "1.10", "1.67.4.3", "5.200.1", "70.1.2.7")
I need to only print version numbers 1.x.
My output would be:
1.20, 1.10, 1.67.4.3
(becasue these are version numbers starting with "1." I do not want to print 3.1.20 or 70.1.2.7 becasue they do not start with "1." eventhough they contain "1." as a substring

df <- c("1.20", "3.1.20", "2.45", "1.10", "1.67.4.3", "5.200.1", "70.1.2.7")
grep("^1\\.", df, value = TRUE)

Use the function substring inside brackets for subsetting:
df[substring(df, 1,2) == "1."]

Or:
sum(substr(df, 1, 2) == "1.")
[1] 3
And for the values themselves:
df[substr(df, 1, 2) == "1."]
[1] "1.20" "1.10" "1.67.4.3"

df[df<"2"]
#[1] "1.20" "1.10" "1.67.4.3"
Depending on your dataset (e.g., if there are version numbers with a leading zero), you might need to expand this suggested solution by df[df<"2" & df>="1"]
The total number of values starting with a "1" can in this case be obtained with length(df[df<"2"]) (or length(df[df<"2" & df >="1"]) ).

Related

Getting last two digits of Sequence Date in R

I have sequence date:
names<-format(seq.Date(as.Date("2012-11-01"),as.Date("2012-12-01"),
by = 'months'),format = "%Y%m")
How can I get the last two digit, like the result for last two digits of names[1] is 11?
Using the stringr package you can just put
stringr::str_sub(string = names, start = -2, end = -1)
You could use substr():
names = substr(names, nchar(names)-1, nchar(names))
The result is:
[1] "11" "12"
Or as integer:
names = as.integer(substr(names, nchar(names)-1, nchar(names))
Result:
[1] 11 12
There can be tenths of ways. The simpliest I could invent was to find the remainder of intiger division:
as.integer(names) %% 100
that returns:
[1] 11 12
Technically these are integers. If you stricktly require characters apply as.character() to the result to cast the type.

R: remove substring and change the remaining string by addition of a number

in R: I have some strings with the following pattern of letters and numbers
A11B3XyC4
A1B14C23XyC16
B14C23XyC16D3
B14C23C16D3
I want to remove the part "Xy" (always the same letters) and when I do this I want to increase the number behind the Letter B by one (everything else should stay the same).
When there is no "Xy" in the string there is no change to the string
The result should look like this:
A11B4C4
A1B15C23C16
B15C23C16D3
B14C23C16D3
Could you point me to a function capable of this? I struggle with doing a calculation (x+1) with a string.
Thank you!
We could use str_replace to do the increment on the substring of numbers that follows the 'B' string after removing the 'Xy' only for cases where there is 'Xy' substring in case_when
library(stringr)
library(dplyr)
case_when(str_detect(str1, "Xy") ~ str_replace(str_remove(str1,
"Xy"), "(?<=B)(\\d+)", function(x) as.numeric(x) + 1), TRUE ~str1)
[1] "A11B4C4" "A1B15C23C16" "B15C23C16D3" "B14C23C16D3"
data
str1 <- c("A11B3XyC4", "A1B14C23XyC16", "B14C23XyC16D3", "B14C23C16D3")

Keep significant zeros when switching column to character formatting in R

I am cleaning up data in R and would like to maintain numeric formatting when switching my column from numeric to character, specifically the significant zeros in the hundredths place (in example below). My input columns mostly begin as Factor data and the below is an example of what I am trying to do.
I'm sure there is a better way, just hoping for some folks with more knowledge than I to shed some light. Most questions online deal with leading zeros or formatting purely numeric columns, but the aspect of the "<" symbol in my data throws me for a loop as to the proper way of doing this.
df = as.factor(c("0.01","5.231","<0.02","0.30","0.801","2.302"))
ind = which(df %in% "<0.02") # Locate the below detection value.
df[ind] <- NA # Substitute NA temporarily
df = as.numeric(as.character(df)) # Changes to numeric column
df = round(df, digits = 2) # Rounds to hundredths place
ind1 = which(df < 0.02) # Check for below reporting limit values
df = as.character(df) # Change back to character column...
df[c(ind,ind1)] = "<0.02" # so I can place the reporting limit back
> # RESULTS::
> df
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3"
However, the 4th, 5th, and 6th values in the data are no longer reporting the zero in the hundredths place. What would be the proper order of operations for this? Perhaps changing the column back to character is incorrect? Any advice would be appreciated.
Thank you.
EDIT: ---- Upon recommendations from hrbrmstr and Mike:
Thanks for the advice. I tried the following and they both result in the same problem. Perhaps there is another way I could be indexing/replacing values?
format, same problem:
#... code from above...
ind1 = which(df < 0.02)
df = as.character(df)
df[!c(ind,ind1)] = format(df[!c(ind,ind1)],digits=2,nsmall=2)
> df
[1] "<0.02" "5.23" "<0.02" "0.3 " "0.8 " "2.3 "
sprintf, same problem:
# ... above code from example ...
ind1 = which(df < 0.02) # Check for below reporting limit values.
sprintf("%.2f",df) # sprintf attempt.
[1] "0.01" "5.23" "NA" "0.30" "0.80" "2.30"
df[c(ind,ind1)] = "<0.02" # Feed the symbols back into the column.
> df
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3" #Same Problem.
Tried a different way of replacing the values, and same problem.
# ... above code from example ...
> ind1 = which(df < 0.02)
> df[c(ind,ind1)] = 9999999
> sprintf("%.2f",df)
[1] "9999999.00" "5.23" "9999999.00" "0.30" "0.80" "2.30"
> gsub("9999999.00","<0.02",df)
[1] "<0.02" "5.23" "<0.02" "0.3" "0.8" "2.3" #Same Problem.
You could just pad it out with a gsub and a bit of regex...
df <- c("<0.02", "5.23", "<0.02", "0.3", "4", "0.8", "2.3")
gsub("^([^\\.]+)$", "\\1\\.00", gsub("\\.(\\d)$", "\\.\\10", df))
[1] "<0.02" "5.23" "<0.02" "0.30" "4.00" "0.80" "2.30"
The first gsub looks for a dot followed by a single digit and an end-of-string, and replaces the digit (the capture group \\1) with itself followed by a zero. The second checks for numbers with no dots, and adds .00 to the end.

R-- Add leading zero to string, with no fixed string format

I have a column as below.
9453, 55489, 4588, 18893, 4457, 2339, 45489HQ, 7833HQ
I would like to add leading zero if the number is less than 5 digits. However, some numbers have "HQ" in the end, some don't.(I did check other posts, they dont have similar problem in the "HQ" part)
so the finally desired output should be:
09453, 55489, 04588, 18893, 04457, 02339, 45489HQ, 07833HQ
any idea how to do this? Thank you so much for reading my post!
A one-liner using regular expressions:
my_strings <- c("9453", "55489", "4588",
"18893", "4457", "2339", "45489HQ", "7833HQ")
gsub("^([0-9]{1,4})(HQ|$)", "0\\1\\2",my_strings)
[1] "09453" "55489" "04588" "18893"
"04457" "02339" "45489HQ" "07833HQ"
Explanation:
^ start of string
[0-9]{1,4} one to four numbers in a row
(HQ|$) the string "HQ" or the end of the string
Parentheses represent capture groups in order. So 0\\1\\2 means 0 followed by the first capture group [0-9]{1,4} and the second capture group HQ|$.
Of course if there is 5 numbers, then the regex isn't matched, so it doesn't change.
I was going to use the sprintf approach, but found the the stringr package provides a very easy solution.
library(stringr)
x <- c("9453", "55489", "4588", "18893", "4457", "2339", "45489HQ", "7833HQ")
[1] "9453" "55489" "4588" "18893" "4457" "2339" "45489HQ" "7833HQ"
This can be converted with one simple stringr::str_pad() function:
stringr::str_pad(x, 5, side="left", pad="0")
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "7833HQ"
If the number needs to be padded even if the total string width is >5, then the number and text need to be separated with regex.
The following will work. It combines regex matching with the very helpful sprintf() function:
sprintf("%05.0f%s", # this encodes the format and recombines the number with padding (%05.0f) with text(%s)
as.numeric(gsub("^(\\d+).*", "\\1", x)), #get the number
gsub("[[:digit:]]+([a-zA-Z]*)$", "\\1", x)) #get just the text at the end
[1] "09453" "55489" "04588" "18893" "04457" "02339" "45489HQ" "07833HQ"
Another attempt, which will also work in cases like "123" or "1HQR":
x <- c("18893","4457","45489HQ","7833HQ","123", "1HQR")
regmatches(x, regexpr("^\\d+", x)) <- sprintf("%05d", as.numeric(sub("\\D+$","",x)))
x
#[1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
This basically finds any numbers at the start of the string (^\\d+) and replaces them with a zero-padded (via sprintf) string that was subset out by removing any non-numeric characters (\\D+$) from the end of the string.
We can use only sprintf() and gsub() by splitting up the parts then putting them back together.
sprintf("%05d%s", as.numeric(gsub("[^0-9]+", "", x)), gsub("[0-9]+", "", x))
# [1] "18893" "04457" "45489HQ" "07833HQ" "00123" "00001HQR"
Using #thelatemail's data:
x <- c("18893", "4457", "45489HQ", "7833HQ", "123", "1HQR")

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Resources