R: explode a character-string and get the last element (row-wise) - r

I have the following data-frame
df <- data.frame(var1 = c("f253.02.ds.a01", "f253.02.ds.a02", "f253.02.ds.x.a01", "f253.02.ds.x.a02", "f253.02.ds.a10", "test"))
df
What's the easiest way to extract the last two digits of the variable var1? (e.g. 1, 2, 10, NA) I was experimenting with separate(), but the number of points in the characters is not always the same. Maybe with regular expressions?

With separate, we can use a regex lookaround
library(dplyr)
library(tidyr)
df %>%
separate(var1, into = c('prefix', 'suffix'),
sep="(?<=[a-z])(?=\\d+$)", remove = FALSE, convert = TRUE)
-output
# var1 prefix suffix
#1 f253.02.ds.a01 f253.02.ds.a 1
#2 f253.02.ds.a02 f253.02.ds.a 2
#3 f253.02.ds.x.a01 f253.02.ds.x.a 1
#4 f253.02.ds.x.a02 f253.02.ds.x.a 2
#5 f253.02.ds.a10 f253.02.ds.a 10
#6 test test NA

The expected output shown in the question has 4 elements but the input has 6 rows so we assume that the expected output shown in the question is erroneous and that the correct output is that shown below. tes).
Now assuming that the 2 digits are preceded by a non-digit and note that \D means non-digit (backslash must be doubled within double quo
df %>% mutate(last2 = as.numeric(sub(".*\\D", "", var1)))
giving:
var1 last2
1 f253.02.ds.a01 1
2 f253.02.ds.a02 2
3 f253.02.ds.x.a01 1
4 f253.02.ds.x.a02 2
5 f253.02.ds.a10 10
6 test NA

Related

filter with a list of string conditions

This is an example what the data looks like:
height <- c("T_0.1", "T_0.2", "T_0.3", "T_0.11", "T_0.12", "T_0.13", "T_10.1", "T_10.2",
"T_10.3", "T_10.11", "T_10.12", "T_10.13","T_36.1", "T_36.2", "T_36.3", "T_36.11", "T_36.12",
"T_36.13")
value <- c(1,12,14,15,20,22,5,9,4,0.0,0.45,0.7,1,2,7,100,9,45)
df <- data.frame(height,value)
I want to filter all the values in height that ends with ".1", ".2", and ".3". However I want to do that using a "list of patterns" because the actual data frame has more than 1000 values.
Here what I tried:
vars_list <- c(".1", ".2",".3")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
matchPattern <- paste(vars_list, collapse = "|")
df_new <- df %>% select(matches(matchPattern))
Both codes returns 0 observation. I am not sure what it is the issue and I couldn't find a post that would help. So any help is very much appreciated!
The dot is a regex metacharacter, which matches any character except a new line. You need to escape it (i.e. tell R you are looking for a literal dot), by prepending it with \\.
However, your pattern will then match all rows in your sample data.
I assume you do not want to match, for example, "T_0.13", because it does not end with ".1", ".2" or ".3". In which case, you should add a dollar sign to indicate that you want your string to end with the desired match, rather than just contain it.
vars_list <- c("\\.1$", "\\.2$","\\.3$")
df_new<-df[grepl(paste(vars_list, collapse = "|"), df$height),]
df_new
# height value
# 1 T_0.1 1
# 2 T_0.2 12
# 3 T_0.3 14
# 7 T_10.1 5
# 8 T_10.2 9
# 9 T_10.3 4
# 13 T_36.1 1
# 14 T_36.2 2
# 15 T_36.3 7
Incidentally, another way you could express this is:
df[grepl("\\.[1-3]$", df$height),]
You can read more here about the syntax used in regular expressions.
Alternatively use the base function endsWith
df <- data.frame(height,value) %>% filter(endsWith(height,vars_list))
Created on 2023-02-12 with reprex v2.0.2
height value
1 T_0.1 1
2 T_0.2 12
3 T_0.3 14
4 T_10.1 5
5 T_10.2 9
6 T_10.3 4
7 T_36.1 1
8 T_36.2 2
9 T_36.3 7

regex grouping and re-ordering

I am afraid I have a regex question. I would like to extract the first group of a string, which is 1 digit, leave out 2nd group, which is 2 digits, and then extract the ending 5 digits as 3rd group.
In my opinion it should look like: str_extract(a, "(\\d{1})(\\d{2})(\\d{5})\\1\\3"). But that doesn't work.
Sample data is here and also the wanted outcome, but with a different expression:
library(tidyverse)
d <- tibble(a = as.character(as.integer(runif(10, 1e8, 2e8))) )
d %>%
mutate(want_but_wrong_regex = str_remove(a, "(?<=\\d)\\d{2}")) #
# A tibble: 10 x 2
#a want_but_wrong_regex
#<chr> <chr>
# 1 103016397 1016397
#2 164356395 1356395
#3 134615352 1615352
#4 176581897 1581897
#5 127035705 1035705
#6 158055182 1055182
#7 193991176 1991176
#8 147845896 1845896
#9 177083273 1083273
#10 129086338 1086338
I think what you are looking for is str_replace, rather than str_extract (thank #AnilGoyal's dummy data), i.e.,
> str_replace(strings, "(\\d{1})(\\d{2})(\\d{5})", "\\1\\3")
[1] "133333" "245678" "023456"
You're doing it wrong way. You're capturing groups but not extracting these. Use string substitution functions like gsub with capturing groups in pattern argument and groups reference in replacement argument and you'll get results as desired
strings <- c('12233333', '23345678', '00123456')
gsub('(\\d{1})(\\d{2})(\\d{5})', '\\1\\3', strings)
[1] "133333" "245678" "023456"
Since the position is fixed why you not extract the string using substring or similar functions instead of regex ? They are usually faster than regex extraction.
library(dplyr)
library(stringr)
d %>% mutate(res = str_c(str_sub(a, 1, 1), str_sub(a, 4, 9)))
#. a res
#1 103016397 1016397
#2 164356395 1356395
#3 134615352 1615352
#4 176581897 1581897
#5 127035705 1035705
#6 158055182 1055182
#7 193991176 1991176
#8 147845896 1845896
#9 177083273 1083273
#10 129086338 1086338
Or in base R -
transform(d, res = paste0(substr(a, 1, 1), substr(a, 4, 9)))

Remove prefix letter from column variables

I have all column names that start with 'm'. Example: mIncome, mAge. I want to remove the prefix. So far, I have tried the following:
df %>%
rename_all(~stringr::str_replace_all(.,"m",""))
This removes all the column names that has the letter 'm'. I just need it removed from from the start. Any suggestions?
You can use sub in base R to remove "m" from the beginning of the column names.
names(df) <- sub('^m', '', names(df))
We need to specify the location. The ^ matches the start of the string (or here the column name). So, if we use ^m, it will only match 'm' at the beginning or start of the string and not elsewhere.
library(dplyr)
library(stringr)
df %>%
rename_all(~stringr::str_replace(.,"^m",""))
# ba Mbgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Also, if the case should be ignored, wrap with regex and specify ignore_case = TRUE
df %>%
rename_all(~ stringr::str_replace(., regex("^m", ignore_case = TRUE), ""))
# ba bgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Another option is word boundary (\\bm), but this could match the beginning of words where there are multi word column names
NOTE: str_replace_all is used when we want to replace multiple occurrence of the pattern. Here, we just need to replace the first instance and for that str_replace is enough.
data
df <- data.frame(mba = 1:3, Mbgeg = 2:4, gmba = 4:6, cfor = 6:8)
Another way you can try
library(tidyverse)
df <- data.frame(mma = 1:2, mbapbe = 1:2)
df2 <- df %>%
rename_at(vars(c("mma", "mbapbe")) ,function(x) gsub("^m", "", x))
# ma bapbe
# 1 1 1
# 2 2 2

How to separate a column into two columns

df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'))
I want to separate the variable 'CODE' into two variables. One variable shows the first letter of 'CODE' ('N' or 'M' in my case), the other shows the left number. If there are more than two digits, give a '.' after the second digit.
The output should be
df <- data.frame(PATIENT_ID=c(1,2,3,4),
CODE=c('N18','N180','N190','M1920'),
VOR_1=c('N','N','N','M'),
VOR_2=c('18','18.0','19.0','19.20'))
Finally, define the variable of 'VOR_2' as a numeric variable.
Using sub for a base R solution:
df$VOR_1 <- sub("^([A-Z]).*$", "\\1", df$CODE)
df$VOR_2 <- sub("^([0-9]{2})(?=[0-9])", "\\1.", sub("^[A-Z]([0-9]+)$", "\\1", df$CODE), perl=TRUE)
df$VOR_2 <- as.numeric(df$VOR_2) # if desired
df
PATIENT_ID CODE VOR_1 VOR_2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20
An explanation on the logic behind VOR_2 is warranted. We first extract all the digits from the second character onwards using the simple regex ^[A-Z]([0-9]+)$. Then, we make a second call to sub on the digit string, to insert a decimal point after the second digit. The pattern uses a positive lookahead which ensures that a dot gets intercolated only in the case of three or more digits.
An idea via tidyr using separate can be,
library(dplyr)
library(tidyr) #separate
df %>%
separate(CODE, into = c("text", "num"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
mutate(num = as.numeric(num),
num = num / (10 ^ (nchar(num) - 2))
)
# PATIENT_ID text num
#1 1 N 18.0
#2 2 N 18.0
#3 3 N 19.0
#4 4 M 19.2
You can use str_extract and sub:
library(stringr)
df$VOR1 <- str_extract(df$CODE, "^[A-Z]")
Here, you simply grasp the capicatl letter at the beginning of the string marked by ^.
df$VOR2 <- sub("(\\d{2})(\\d{1,2})", "\\1.\\2", str_extract(df$CODE, "\\d+"))
Here, you first extract just the digits using str_extract and then insert the period .where appropriate:
Result:
df
PATIENT_ID CODE VOR1 VOR2
1 1 N18 N 18
2 2 N180 N 18.0
3 3 N190 N 19.0
4 4 M1920 M 19.20

Is there any way to delete the rows of data which don't have all numeric values?

I have data that has two columns. Each column of data has numerical values in it but some of them don't have any numerical values. I want to remove the rows which don't have all values numerical. In reality, the data has 1000 rows but for simplification, I made the data file in smaller size here. Thanks!
a <- c(1, 2, 3, 4, "--")
b <- c("--", 2, 3, "--", 5)
data <- data.frame(a, b)
One base R option could be:
data[!is.na(Reduce(`+`, lapply(data, as.numeric))), ]
a b
2 2 2
3 3 3
And for importing the data, use stringsAsFactors = FALSE.
Or using sapply():
data[!is.na(rowSums(sapply(data, as.numeric))), ]
An easier option is to check for NA after converting to numeric with as.numeric. If the element is not numeric, it returns NA and that can be detected with is.na and use it in filter_all to remove the rows
library(dplyr)
data %>%
filter_all(all_vars(!is.na(as.numeric(.))))
# a b
#1 2 2
#2 3 3
If we don't like the warnings, an option is to detect the numbers only element with regex by checking one or more digits ([0-9.]+) including a dot from start (^) to end ($) of string with str_detect
library(stringr)
data %>%
filter_all(all_vars(str_detect(., "^[0-9.]+$")))
# a b
#1 2 2
#2 3 3
If we have only -- as non-numeric, it is easier to remove
data[!rowSums(data == "--"),]
# a b
#2 2 2
#3 3 3
data
data <- data.frame(a,b, stringsAsFactors = FALSE)

Resources