regex grouping and re-ordering

regex grouping and re-ordering - r

I am afraid I have a regex question. I would like to extract the first group of a string, which is 1 digit, leave out 2nd group, which is 2 digits, and then extract the ending 5 digits as 3rd group.
In my opinion it should look like: str_extract(a, "(\\d{1})(\\d{2})(\\d{5})\\1\\3"). But that doesn't work.
Sample data is here and also the wanted outcome, but with a different expression:
library(tidyverse)
d <- tibble(a = as.character(as.integer(runif(10, 1e8, 2e8))) )
d %>%
mutate(want_but_wrong_regex = str_remove(a, "(?<=\\d)\\d{2}")) #
# A tibble: 10 x 2
#a want_but_wrong_regex
#<chr> <chr>
# 1 103016397 1016397
#2 164356395 1356395
#3 134615352 1615352
#4 176581897 1581897
#5 127035705 1035705
#6 158055182 1055182
#7 193991176 1991176
#8 147845896 1845896
#9 177083273 1083273
#10 129086338 1086338

I think what you are looking for is str_replace, rather than str_extract (thank #AnilGoyal's dummy data), i.e.,
> str_replace(strings, "(\\d{1})(\\d{2})(\\d{5})", "\\1\\3")
[1] "133333" "245678" "023456"

You're doing it wrong way. You're capturing groups but not extracting these. Use string substitution functions like gsub with capturing groups in pattern argument and groups reference in replacement argument and you'll get results as desired
strings <- c('12233333', '23345678', '00123456')
gsub('(\\d{1})(\\d{2})(\\d{5})', '\\1\\3', strings)
[1] "133333" "245678" "023456"

Since the position is fixed why you not extract the string using substring or similar functions instead of regex ? They are usually faster than regex extraction.
library(dplyr)
library(stringr)
d %>% mutate(res = str_c(str_sub(a, 1, 1), str_sub(a, 4, 9)))
#. a res
#1 103016397 1016397
#2 164356395 1356395
#3 134615352 1615352
#4 176581897 1581897
#5 127035705 1035705
#6 158055182 1055182
#7 193991176 1991176
#8 147845896 1845896
#9 177083273 1083273
#10 129086338 1086338
Or in base R -
transform(d, res = paste0(substr(a, 1, 1), substr(a, 4, 9)))

Related

R: explode a character-string and get the last element (row-wise)

I have the following data-frame
df <- data.frame(var1 = c("f253.02.ds.a01", "f253.02.ds.a02", "f253.02.ds.x.a01", "f253.02.ds.x.a02", "f253.02.ds.a10", "test"))
df
What's the easiest way to extract the last two digits of the variable var1? (e.g. 1, 2, 10, NA) I was experimenting with separate(), but the number of points in the characters is not always the same. Maybe with regular expressions?

With separate, we can use a regex lookaround
library(dplyr)
library(tidyr)
df %>%
separate(var1, into = c('prefix', 'suffix'),
sep="(?<=[a-z])(?=\\d+$)", remove = FALSE, convert = TRUE)
-output
# var1 prefix suffix
#1 f253.02.ds.a01 f253.02.ds.a 1
#2 f253.02.ds.a02 f253.02.ds.a 2
#3 f253.02.ds.x.a01 f253.02.ds.x.a 1
#4 f253.02.ds.x.a02 f253.02.ds.x.a 2
#5 f253.02.ds.a10 f253.02.ds.a 10
#6 test test NA

The expected output shown in the question has 4 elements but the input has 6 rows so we assume that the expected output shown in the question is erroneous and that the correct output is that shown below. tes).
Now assuming that the 2 digits are preceded by a non-digit and note that \D means non-digit (backslash must be doubled within double quo
df %>% mutate(last2 = as.numeric(sub(".*\\D", "", var1)))
giving:
var1 last2
1 f253.02.ds.a01 1
2 f253.02.ds.a02 2
3 f253.02.ds.x.a01 1
4 f253.02.ds.x.a02 2
5 f253.02.ds.a10 10
6 test NA

Remove prefix letter from column variables

I have all column names that start with 'm'. Example: mIncome, mAge. I want to remove the prefix. So far, I have tried the following:
df %>%
rename_all(~stringr::str_replace_all(.,"m",""))
This removes all the column names that has the letter 'm'. I just need it removed from from the start. Any suggestions?

You can use sub in base R to remove "m" from the beginning of the column names.
names(df) <- sub('^m', '', names(df))

We need to specify the location. The ^ matches the start of the string (or here the column name). So, if we use ^m, it will only match 'm' at the beginning or start of the string and not elsewhere.
library(dplyr)
library(stringr)
df %>%
rename_all(~stringr::str_replace(.,"^m",""))
# ba Mbgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Also, if the case should be ignored, wrap with regex and specify ignore_case = TRUE
df %>%
rename_all(~ stringr::str_replace(., regex("^m", ignore_case = TRUE), ""))
# ba bgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Another option is word boundary (\\bm), but this could match the beginning of words where there are multi word column names
NOTE: str_replace_all is used when we want to replace multiple occurrence of the pattern. Here, we just need to replace the first instance and for that str_replace is enough.
data
df <- data.frame(mba = 1:3, Mbgeg = 2:4, gmba = 4:6, cfor = 6:8)

Another way you can try
library(tidyverse)
df <- data.frame(mma = 1:2, mbapbe = 1:2)
df2 <- df %>%
rename_at(vars(c("mma", "mbapbe")) ,function(x) gsub("^m", "", x))
# ma bapbe
# 1 1 1
# 2 2 2

How to separate a column into two columns

df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'))
I want to separate the variable 'CODE' into two variables. One variable shows the first letter of 'CODE' ('N' or 'M' in my case), the other shows the left number. If there are more than two digits, give a '.' after the second digit.
The output should be
df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'),
VOR_1=c('N','N','N','M'),
VOR_2=c('18','18.0','19.0','19.20'))
Finally, define the variable of 'VOR_2' as a numeric variable.

Using sub for a base R solution:
df$VOR_1 <- sub("^([A-Z]).*$", "\\1", df$CODE)
df$VOR_2 <- sub("^([0-9]{2})(?=[0-9])", "\\1.", sub("^[A-Z]([0-9]+)$", "\\1", df$CODE), perl=TRUE)
df$VOR_2 <- as.numeric(df$VOR_2) # if desired
df
PATIENT_ID CODE VOR_1 VOR_2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20
An explanation on the logic behind VOR_2 is warranted. We first extract all the digits from the second character onwards using the simple regex ^[A-Z]([0-9]+)$. Then, we make a second call to sub on the digit string, to insert a decimal point after the second digit. The pattern uses a positive lookahead which ensures that a dot gets intercolated only in the case of three or more digits.

An idea via tidyr using separate can be,
library(dplyr)
library(tidyr) #separate
df %>%
separate(CODE, into = c("text", "num"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
mutate(num = as.numeric(num),
num = num / (10 ^ (nchar(num) - 2))
)
# PATIENT_ID text num
#1 1 N 18.0
#2 2 N 18.0
#3 3 N 19.0
#4 4 M 19.2

You can use str_extract and sub:
library(stringr)
df$VOR1 <- str_extract(df$CODE, "^[A-Z]")
Here, you simply grasp the capicatl letter at the beginning of the string marked by ^.
df$VOR2 <- sub("(\\d{2})(\\d{1,2})", "\\1.\\2", str_extract(df$CODE, "\\d+"))
Here, you first extract just the digits using str_extract and then insert the period .where appropriate:
Result:
df
PATIENT_ID CODE VOR1 VOR2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20

Is there any way to delete the rows of data which don't have all numeric values?

I have data that has two columns. Each column of data has numerical values in it but some of them don't have any numerical values. I want to remove the rows which don't have all values numerical. In reality, the data has 1000 rows but for simplification, I made the data file in smaller size here. Thanks!
a <- c(1, 2, 3, 4, "--")
b <- c("--", 2, 3, "--", 5)
data <- data.frame(a, b)

One base R option could be:
data[!is.na(Reduce(`+`, lapply(data, as.numeric))), ]
a b
2 2 2
3 3 3
And for importing the data, use stringsAsFactors = FALSE.
Or using sapply():
data[!is.na(rowSums(sapply(data, as.numeric))), ]

An easier option is to check for NA after converting to numeric with as.numeric. If the element is not numeric, it returns NA and that can be detected with is.na and use it in filter_all to remove the rows
library(dplyr)
data %>%
filter_all(all_vars(!is.na(as.numeric(.))))
# a b
#1 2 2
#2 3 3
If we don't like the warnings, an option is to detect the numbers only element with regex by checking one or more digits ([0-9.]+) including a dot from start (^) to end ($) of string with str_detect
library(stringr)
data %>%
filter_all(all_vars(str_detect(., "^[0-9.]+$")))
# a b
#1 2 2
#2 3 3
If we have only -- as non-numeric, it is easier to remove
data[!rowSums(data == "--"),]
# a b
#2 2 2
#3 3 3
data
data <- data.frame(a,b, stringsAsFactors = FALSE)

Separate Million and Billion Data from one column

I am trying below code for separating "M" and "B" with their values in 2 different column.
I want output like this:
level 1 level 2
M 3.2 B 3.6
M 4 B 2.8
B 3.5
Input:
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
#class(reve)
data=data.frame(reve)
Here is what I have tried.
index=which(grepl("M ",data$reve)
data$reve=gsub("M ","",data$reve)
data$reve=gsub("B ","",data$reve)
data$reve=as.numeric(data$reve)

If you have a data frame you can do that with dplyr separate()
I give you an example of this:
library(dplyr)
df <- tibble(coupe = c("M 2.3", "M 4.5", "B 1"))
df %>% separate(coupe, c("MorB","Quant"), " ")
OUTPUT
# MorB Quant
# <chr> <chr>
#1 M 2.3
#2 M 4.5
#3 B 1
Hope it help you!
For counting the number of "M" rows:
df %>% separate(YourColumn, c("MorB","Quant"), " ") %>%
filter(MorB == "M") %>% nrow()

Here is a base R approach.
lst <- split(reve, substr(reve, 1, 1))
df1 <- as.data.frame(lapply(lst, `length<-`, max(lengths(lst))))
df1
# B M
#1 B 3.6 M 3.2
#2 B 2.8 M 4
#3 B 3.5 <NA>
split the vector in two by the first letter. This gives you a list with entries of unequal length. Use lapply to make the entries having the same length, i.e. append the shorter one with NAs. Call as.data.frame.
If you want to change the names, you can use setNames
setNames(df1, c("level_2", "level_1"))
In case I misunderstood your desired output, try
df1 <- data.frame(do.call(rbind, (strsplit(reve, " "))), stringsAsFactors = FALSE)
df1[] <- lapply(df1, type.convert, as.is = TRUE)
df1
# X1 X2
#1 M 3.2
#2 B 3.6
#3 B 2.8
#4 B 3.5
#5 M 4.0

I think options rooted in regex may also be helpful for these types of problems
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
data=data.frame(reve, stringsAsFactors = F) # handle your data as strings, not factors
# regex to extract M vals and B vals
mvals <- stringi::stri_extract_all_regex(data, "M+\\s[0-9]\\.[0-9]|M+\\s[0-9]")[[1]]
bvals <- stringi::stri_extract_all_regex(data, "B+\\s[0-9]\\.[0-9]|B+\\s[0-9]")[[1]]
# gluing things together into a single df
len <- max(length(mvals), length(bvals)) # find the length
data.frame(M = c(mvals, rep(NA, len - length(mvals))) # ensure vectors are the same size
,B = c(bvals, rep(NA, len - length(bvals)))) # ensure vectors are the same size
In case regex is unfamiliar, the first expression searches for "M" followed by a space, then by digits 0 through 9, then a period, then digits 0 through 9 again. The vertical pipe is on "or" operator, so the expression also searches for "M" followed by a space, then digits 0 through 9. The second half of the expression accounts for cases like "M 4". The second expression does the same thing, just for lines that contain "B" in lieu of "M".
These are quick and dirty regex statements. I'm sure cleaner formulations are possible to get the same results.

We can count Millions or Billions as follows:
Input datatset:
reve=c("M 3.2","B 3.6","B 2.8","B 3.5","M 4")
data=data.frame(reve)
Code
library(dplyr)
library(tidyr)
data %>%
separate(reve, c("Label", "Value"),extra = "merge") %>%
group_by(Label) %>%
summarise(n = n())
Output
# A tibble: 2 x 2
Label n
<chr> <int>
1 B 3
2 M 2