string split operation in R - r

In my data I have a column of strings. Each string is five characters long. I would like to figure out how to split the string so that I keep the first two characters, the last two and disregard the middle or third character.
I looked at other stackoverflow questions and found the answer listed below as helpful. Initially, the solution below was useful until I saw that in certain cases it didn't work or it worked in the way I wasn't expecting.
This is what I have:
statecensusFIPS <- c("01001", "03001", "13144")
newFIPS <- lapply(2:3, function(i){
if(i==2){
str_sub(statecensusFIPS, end = i)
} else {
str_sub(statecensusFIPS, i)
}})
StateFIPS <- newFIPS[[1]]
CountyFIPS <- newFIPS[[2]]
# Results
> StateFIPS
[1] "01" "03" "13"
> CountyFIPS
[1] "001" "001" "144"
How do I adjust the code so that I have these results instead?
StateFIPS
[1] "01" "03" "13"
CountyFIPS
[1] "01" "01" "44"

How about this (assuming that you want first 2 characters as the statefips and last 2 characters of your strings as county fips and all your strings are of length 5)?
statecensusFIPS<-c("01001", "03001", "13144")
newFIPS<-lapply(2:3,function(i) if(i==2) str_sub(statecensusFIPS,end=i) else str_sub(statecensusFIPS,i+1))
StateFIPS<-newFIPS[[1]]
CountyFIPS<-newFIPS[[2]]
Simpler way could be:
statecensusFIPS<-c("01001", "03001", "13144")
stateFIPS<- str_sub(statecensusFIPS,end=2)
CountyFIPS<- str_sub(statecensusFIPS,4)

Related

readr::parse_number with leading zero

I would like to parse numbers that have a leading zero.
I tried readr::parse_number, however, it omits the leading zero.
library(readr)
parse_number("thankyouverymuch02")
#> [1] 2
Created on 2022-12-30 with reprex v2.0.2
The desired output would be 02
The simplest and most naive would be:
gsub("\\D", "", "thankyouverymuch02")
[1] "02"
The regex special "\\d" matches a single 0-9 character only; the inverse is "\\D" which matches a single character that is anything except 0-9.
If you have strings with multiple patches of numbers and you want them to be distinct, neither parse_number nor this simple gsub is going to work.
gsub("\\D", "", vec)
# [1] "02" "0302"
For that, it must always return a list (since we don't necessarily know a priori how may elements have 0, 1 or more number-groups).
vec <- c("thankyouverymuch02", "thank03youverymuch02")
regmatches(vec, gregexpr("\\d+", vec))
# [[1]]
# [1] "02"
# [[2]]
# [1] "03" "02"
#### equivalently
stringr::str_extract_all(vec, "\\d+")
# [[1]]
# [1] "02"
# [[2]]
# [1] "03" "02"

How can I use `for loop` at once without non-numeric errors?

I wonder how for loop can be used at once without non-numeric error. I would like to make multiple character values in a vector Nums, using for loop.
But after the third line, the vector becomes chr so cannot continue the rest. This comes out to be same even when I use if loop or while loop... Can someone give a hint about this?
for(n in 1:30){
Nums<-1:n
Nums[Nums%%2==0 & Nums%%3==0]<-"OK1"
Nums[Nums%%2==0 & Nums%%3!=0]<-"OK2"
Nums[Nums%%2!=0 & Nums%%3==0]<-"OK3"
Nums[Nums%%2!=0 & Nums%%3!=0]<-n
}
Error in Nums%%2 : non-numeric argument to binary operator
I don't think the loop is actually doing what you want it to do. You are replacing Nums at every iteration, so nothing is actually being saved. Maybe you don't actually want a loop.
Nums <- 1:30
x <- 1:30
dplyr::case_when(
Nums%%2==0 & x%%3==0 ~ "OK1",
Nums%%2==0 & x%%3!=0 ~ "OK2",
Nums%%2!=0 & x%%3==0 ~ "OK3",
Nums%%2!=0 & x%%3!=0 ~ as.character(x)
)
#> [1] "1" "OK2" "OK3" "OK2" "5" "OK1" "7" "OK2" "OK3" "OK2" "11" "OK1"
#> [13] "13" "OK2" "OK3" "OK2" "17" "OK1" "19" "OK2" "OK3" "OK2" "23" "OK1"
#> [25] "25" "OK2" "OK3" "OK2" "29" "OK1"
Character and numeric values can't coexist in a vector*. As #Ands. points out, you don't really need a loop for this. If you want to avoid case_when (which is from the dplyr package, part of the "tidyverse"), you can do:
n <- 30
Nums <- 1:n
x <- as.character(Nums)
x[Nums%%2==0 & Nums%%3==0]<-"OK1"
x[Nums%%2==0 & Nums%%3!=0]<-"OK2"
x[Nums%%2!=0 & Nums%%3==0]<-"OK3"
You don't need the final statement because the remaining elements were already set to the corresponding numeric values.
If you want to use a for loop and replace as you go, you could convert the vector to a list:
Nums <- 1:n
Nums <- as.list(Nums)
for (i in 1:n) {
if (i%%2==0 & i%%3==0) Nums[[i]] <- "OK1"
if (i%%2==0 & i%%3!=0) Nums[[i]] <- "OK2"
if (i%%2!=0 & i%%3==0) Nums[[i]] <- "OK3"
}
unlist(Nums)
* Technically they can't coexist in an atomic vector — lists are vectors too ...

Extracting every nth element of vector of lists

I have the following ids.
ids <- c('a-000', 'b-001', 'c-002')
I want to extract the numeric part of them (001, 002, 003).
I tried this :
(str_split(ids, '-', n=2))[2]
returns the following :
[[1]]
[1] "b" "001"
I don't want the second element of the list. I want the second element of all elements in the vector. I know this is definitely a basic question, but how do I resolve the syntax conflict? Going through lambda function ?
The function is also available in base R.
sapply(strsplit(ids, "-"), `[`, 2)
# [1] "000" "001" "002"
You can also try gsub and substring.
gsub("\\D+", "", ids)
# [1] "000" "001" "002"
substring(ids, 3)
# [1] "000" "001" "002"
To continue with your attempt, you can use sapply :
sapply(stringr::str_split(ids, '-', n=2), `[`, 2)
#[1] "000" "001" "002"
It is better to use str_split_fixed though here.
stringr::str_split_fixed(ids, '-', n=2)[, 2]
#[1] "000" "001" "002"
Or in base R :
sub('.*?-(.*)-?.*', '\\1', ids)
You could try str_remove(ids, "\\D+")
With base R you can remove all the characters that are not digits:
ids <- c('a-000', 'b-001', 'c-002')
gsub("[^[:digit:]]", "", ids)
#> [1] "000" "001" "002"
[:digit:] is regex for digit and ^ means everything that is not a digit, so you basically replace every other characters with empty string "".
For more information see documentation for gsub() and regex in R.
An option with str_replace
library(stringr)
str_replace(ids, "\\D+", "")
#[1] "000" "001" "002"

R table not outputting results within for loop

I am just trying to loop over my columns and print out the count of unique values for further processing - but getting not output. This should be simple but I am not getting any output. Here is a simplified version of my code. Is there something glaringly obviously missing as I suspect
for (i in 1:length(mydata)) {
(table(mydata[,i]))
}
Do you mean using apply?
> x <- data.frame("SN" = 1:4, "Age" = c(21,15,56,15), "Name" =
c("John","Dora","John","Dora"))
> apply(x,2,function(x) unique(x))
$SN
[1] "1" "2" "3" "4"
$Age
[1] "21" "15" "56"
$Name
[1] "John" "Dora"
You can also count the uniques like this:
> apply(x,2,function(x) length(unique(x)))
SN Age Name
4 3 2

Regexes works on their own, but not when used together in strsplit

I'm trying to split a string in R using strsplit and a perl regex. The string consists of various alphanumeric tokens separated by periods or hyphens, e.g "WXYZ-AB-A4K7-01A-13B-J29Q-10". I want to split the string:
wherever a hyphen appears.
wherever a period appears.
between the second and third character of a token that is exactly 3 characters long and consists of 2 digits followed by 1 capital letter, e.g "01A" produces ["01", "A"] (but "012A", "B1A", "0A1", and "01A2" are not split).
For example, "WXYZ-AB-A4K7-01A-13B-J29Q-10" should produce ["WXYZ", "AB", "01", "A", "13", "B", "J29Q", "10"].
My current regex is ((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-] and it works perfectly in this online regex tester.
Furthermore, the two parts of the alternative, ((?<=[-.]\\d{2})(?=[A-Z][-.])) and [.-], both serve to split the string as intended in R, when they are used separately:
#correctly splits on periods and hyphens
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
#correctly splits tokens where a letter follows two digits
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))", perl=T)
[[1]]
[1] "WXYZ-AB-A4K7-01" "A-13" "B-J29Q-10"
But when I try and combine them using an alternative, the second regex stops working, and the string is only split on periods and hyphens:
#only second alternative is used
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01A" "13B" "J29Q" "10"
Why is this happening? Is it a problem with my regex, or with strsplit? How can I achieve the desired behavior?
Desired output:
## [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
An alternative that prevents you from having to consider how the strsplit algorithm works, is to use your original regex with gsub to insert a simple splitting character in all the right places, then do use strsplit to do the straightforward splitting.
strsplit(
gsub("((?<=[-.]\\d{2})(?=[A-Z][-.]))|[.-]", "-", x, perl = TRUE),
"-",
fixed = TRUE)
#[[1]]
#[1] "XYZ" "02" "01" "C" "33" "D" "2285"
Of course, RichScriven's answer and Wiktor Stribiżew's comment are probably better since they only have one function call.
You may use a consuming version of a positive lookahead (a match reset operator \K) to make sure strsplit works correctly in R and avoid the problem of using a negative lookbehind inside a positive one.
"(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]"
See the R demo online (and a regex demo here).
strsplit("XYZ-02-01C-33D-2285", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "XYZ" "02" "01" "C" "33" "D" "2285"
strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "(?<![^.-])\\d{2}\\K(?=[A-Z](?:[.-]|$))|[.-]", perl=TRUE)
## => [[1]]
## [1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"
Here, the pattern matches:
(?<![^.-])\d{2}\K(?=[A-Z](?:[.-]|$)) - a sequence of:
(?<![^.-])\d{2} - 2 digits (\d{2}) that are not preceded with a char other than . and - (i.e. that are preceded with . or - or start of string, it is a common trick to avoid alternation inside a lookaround)
\K - the match reset operator that makes the regex engine discard the text matched so far and go on matching the subsequent subpatterns if any
| - or
[.-] - matches . or -.
Thanks to Rich Scriven and Jota I was able to solve the problem. Every time strsplit finds a match, it removes the match and everything to its left before looking for the next match. This means that regex's that rely on lookbehinds may not function as expected when the lookbehind overlaps with a previous match. In my case, the hyphens between tokens were removed upon being matched, meaning that the second regex could not use them to detect the beginning of the token:
#first match found
"WXYZ-AB-A4K7-01A-13B-J29Q-10"
^
#match + left removed
"AB-A4K7-01A-13B-J29Q-10"
#further matches found and removed
"01A-13B-J29Q-10"
#second regex fails to match because of missing hyphen in lookbehind:
#((?<=[-.]\\d{2})(?=[A-Z][-.]))
# ^^^^^^^^
"01A-13B-J29Q-10"
#algorithm continues
"13B-J29Q-10"
This was fixed by replacing the [.-] class to detect the edges of the token in the lookbehind with a boundary anchor, as per Jota's suggestion:
> strsplit("WXYZ-AB-A4K7-01A-13B-J29Q-10", "[-.]|(?<=\\b\\d{2})(?=[A-Z]\\b)", perl=T)
[[1]]
[1] "WXYZ" "AB" "A4K7" "01" "A" "13" "B" "J29Q" "10"

Resources