How to remove only numbers from string - r

I have following dataframe in R
ID Village_Name
1 23
2 Name-23
3 34
4 Vasai2
5 23
I only want to remove numbers from Village_Name, my desired dataframe would be
ID Village_Name
1 Name-23
2 Vasai2
How can I do it in R?

We can use grepl to match one or more numbers from the start (^) till the end ($) of the numbers and negate (!) it so that all numbers only elements become FALSE and others TRUE
i1 <- !grepl("^[0-9]+$", df1$Village_Name)
df1[i1, ]
Based on the OP's post, it could be also
data.frame(ID = head(df1$ID, sum(i1)), Village_Name = df1$Village_Name[i1])
# ID Village_Name
#1 1 Name-23
#2 2 Vasai2
Or another option is to convert to numeric resulting in non-numeric elements to be NA and is changed to a logical vector with is.na
df1[is.na(as.numeric(df1$Village_Name)),]

Here is another option using sub:
df1[nchar(sub("\\d+", "", df1$Village_Name)) > 0, ]
Demo
The basic idea is to strip off all digits from the Village_Name column, then assert that there is at least one character remaining, which would imply that the entry is not entirely numerical.
But, I would probably go with the grepl option given by #akrun in practice.

Related

How to extract n-th occurence of a pattern with regex

Let's say I have a string like this:
my_string = "my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool"
And I'd like to extract the first and the second date separately with stringr.
I tried something like str_extract(my_string, '(\\d+\\.\\d+\\.\\d+){n}') and while it works when n=1 it doesn't work with n=2. How can I extract the second occurence?
Example of data.frame:
df <- data.frame(string_col = c("my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool",
"my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd",
"asdad asda-adsad KK-ASD-20.05.05-jjj"))
And I want to create columns date1, date2.
Edit:
Although #RonanShah and #ThomasIsCoding provided solutions based on str_extract_all, I'd really like to get to know how we can do it using regex only as finding n-th occurence seems to be important pattern and potentially may result in much neater solution.
(I) Capturing groups (marked by ()) can be multiplied by {n} but will then count only as one capture group and match the last instance. If you explicitly write down capturing gorups for both dates, you can use str_match (without the "_all"):
> stringr::str_match(df$string_col, '(\\d+\\.\\d+\\.\\d+)-(\\d+\\.\\d+\\.\\d+)?')[, -1, drop = FALSE]
[,1] [,2]
[1,] "19.01.03" "20.01.22"
[2,] "20.01.08" "20.04.01"
[3,] "20.05.05" NA
Here, ? makes the occurrence of the second date optional and [, -1, drop = FALSE] removes the first column that always contains the whole match. You might want to change the - in the pattern to something more general.
To really find only the nth match, you could use (I) in a expression like this:
stringr::str_match(df$string_col, paste0('(?:(\\d+\\.\\d+\\.\\d+).*){', n, '}'))[, -1]
[1] "0.01.22" "0.04.01" NA
Here, we used (?: ) to specify a non-capturing group, such the the caputure (( )) does not include whats in between dates (.*).
you could use stringr::str_extract_all() instead, like this
str_extract_all(my_string, '\\d+\\.\\d+\\.\\d+')
str_extract would always return the first match. While there might be ways altering your regex to capture the nth occurrence of a pattern but a simple way would be to use str_extract_all and return the nth value.
library(stringr)
n <- 1
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "19.01.03"
n <- 2
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "20.01.22"
For the dataframe input we can extract all the date pattern and store it in a list and use unnest_wider to get them as separate columns.
library(dplyr)
df %>%
mutate(date = str_extract_all(string_col, '\\d+\\.\\d+\\.\\d+')) %>%
tidyr::unnest_wider(date) %>%
rename_with(~paste0('date', seq_along(.)), starts_with('..'))
# string_col date1 date2
# <chr> <chr> <chr>
#1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 NA
I guess you might need str_extract_all
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')
or regmatches if you prefer with base R
regmatches(my_string,gregexpr('(\\d+\\.\\d+\\.\\d+)',my_string))
Update
With your data frame df
transform(df,
date = do.call(
rbind,
lapply(
u <- str_extract_all(string_col, "(\\d+\\.\\d+\\.\\d+)"),
`length<-`,
max(lengths(u))
)
)
)
we will get
string_col date.1 date.2
1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>
This is a good example to showcase {unglue}.
Here you have 2 patterns (one date or two dates), the first is two dates separated by a dash and surrounded by anything, the second is a date surrounded by anything. We can write it this way :
library(unglue)
unglue_unnest(
df, string_col,
c("{}{date1=\\d+\\.\\d+\\.\\d+}-{date2=\\d+\\.\\d+\\.\\d+}{}",
"{}{date1=\\d+\\.\\d+\\.\\d+}{}"),
remove = FALSE)
#> string_col date1 date2
#> 1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#> 2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#> 3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>

Finding matches on a character in more than one position in R [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have a character vector where I want to match the first and last parts so I can generate a list of matching characters.
Here is an example character: "20190625_165055_0f4e"
The first part is a date. The last 4 characters are a unique identifier. I need all characters in the list where these two parts are duplicates.
I could use a simple regex to match characters according to position, but some have more middle characters than others, e.g. "20190813_170215_17_1057"
Here is an example vector:
mylist<-c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034","20190719_164712_1001","20190719_164713_1001","20190722_153110_1054","20190813_170215_17_1057","20190813_170217_22_1057","20190828_170318_14_1065")
With this being the desired output:
c("20190712_164755_1034","20190712_164756_1034","20190712_164757_1034")
c("20190719_164712_1001","20190719_164713_1001")
c("20190722_153110_1054")
c("20190813_170215_17_1057","20190813_170217_22_1057")
c("20190828_170318_14_1065")
edits: made my character vector more simple and added desired output
We could remove the middle substring with sub and split the list based on that into a list of character vectors
lst1 <- split(mylist, sub("^(\\d+)_.*_([^_]+)$", "\\1_\\2", mylist))
lst1
#$`20190712_1034`
#[1] "20190712_164755_1034" "20190712_164756_1034" "20190712_164757_1034"
#$`20190719_1001`
#[1] "20190719_164712_1001" "20190719_164713_1001"
#$`20190722_1054`
#[1] "20190722_153110_1054"
#$`20190813_1057`
#[1] "20190813_170215_17_1057" "20190813_170217_22_1057"
#$`20190828_1065`
#[1] "20190828_170318_14_1065"
In the sub, we capture ((...)) one or more digits (\\d+) from the start (^) of the string, followed by a _, and other characters (.*) till the _ and capture the rest of the characters that are not a _ ([^_]+) till the end ($) of the string. In the replacement, we specify the backreference (\\1, \\2) of the captured groups). Essentially, removing the varying part in the middle and keep the fixed substring at the beginning and end and use that to split the character vector
Here's an alternative approach with extract from tidyr.
library(tidyr)
result <- as.data.frame(mylist) %>%
extract(1, into = c("date","var1","var2"),
regex = "(^[0-9]{8}_[0-9]{6})_?(.*)?_([^_]+$)",
remove = FALSE)
result
# mylist date var1 var2
#1 20190625_165055_0f4e 20190625_165055 0f4e
#2 20190625_165056_0f4e 20190625_165056 0f4e
#3 20190625_165057_0f4e 20190625_165057 0f4e
#4 20190712_164755_1034 20190712_164755 1034
#...
#27 20190828_170318_14_1065 20190828_170318 14 1065
#28 20190828_170320_26_1065 20190828_170320 26 1065
#...
Now you can easily manipulate the data based on those variables.
split(result,result$var2)
#$`0f22`
# mylist date var1 var2
#29 20190917_165157_0f22 20190917_165157 0f22
#
#$`0f2a`
# mylist date var1 var2
#18 20190813_152856_0f2a 20190813_152856 0f2a
#19 20190813_152857_0f2a 20190813_152857 0f2a
#...
We can use extract to extract the date part and last 4 characters into separate columns. We then use group_split to split data based on those 2 columns.
tibble::tibble(mylist) %>%
tidyr::extract(mylist, c('col1', 'col2'), regex = '(.*?)_.*_(.*)',
remove = FALSE) %>%
dplyr::group_split(col1, col2, .keep = FALSE)
#[[1]]
# A tibble: 3 x 1
# mylist
# <chr>
#1 20190712_164755_1034
#2 20190712_164756_1034
#3 20190712_164757_1034
#[[2]]
# A tibble: 2 x 1
# mylist
# <chr>
#1 20190719_164712_1001
#2 20190719_164713_1001
#[[3]]
# A tibble: 1 x 1
# mylist
# <chr>
#1 20190722_153110_1054
#...

Sum number in a character string (R)

I have a vector that looks like :
numbers <- c("1/1/1", "1/0/2", "1/1/1/1", "2/0/1/1", "1/2/1")
(not always the same number of "/" character)
How can I create another vector with the sum of the numbers of each string?
Something like :
sum
3
3
4
4
4
One solution with strsplit and sapply:
sapply(strsplit(numbers, '/'), function(x) sum(as.numeric(x)))
#[1] 3 3 4 4 4
strsplit will split your stings on / (doesn't matter how many /s you have). The output of strsplit is a list, so we iterate over it to calculate the sum with sapply.
What seems to me to be the most straightforward approach here is to convert your number strings to actual valid string arithmetic expressions, and then evaluate them in R using eval along with parse. Hence, the string 1/0/2 would become 1+0+2, and then we can simply evaluate that expression.
sapply(numbers, function(x) { eval(parse(text=gsub("/", "+", x))) })
1/1/1 1/0/2 1/1/1/1 2/0/1/1 1/2/1
3 3 4 4 4
Demo
1) strapply strapply matches each string of digits using \\d+ and then applies as.numeric to it returning a list with one vector of numbers per input string. We then apply sum to each of those vectors. This solution seems particularly short.
library(gsubfn)
sapply(strapply(numbers, "\\d+", as.numeric), sum)
## [1] 3 3 4 4 4
2) read.table This applies sum(read.table(...)) to each string. It is a bit longer (but still only one line of code) but uses no packages.
sapply(numbers, function(x) sum(read.table(text = x, sep = "/")))
## 1/1/1 1/0/2 1/1/1/1 2/0/1/1 1/2/1
## 3 3 4 4 4
Add the USE.NAMES = FALSE argument to sapply if you don't want names on the output.
scan(textConnection(x), sep = "/", quiet = TRUE) could be used in place of read.table but is longer.

find string that the second string is 9 using R

I have a list of numbers and I want to find numbers which their second string is 9. the grep() code find any number that has 9 but I am looking for a code that find number that second string is 9. so the below returns:
p <- c(34405, 09098424, 6908347, 8900333, 453434)
grep(9, p)
[1] 1 2 3 4
I am looking for something that return:
[1] 2 3 4
Thanks
Majran
We can use substr to extract the 2nd digit and check whether (==) that is equal to 9, get the numeric index by wrapping with which.
which(substr(p,2,2)=="9")
#[1] 2 3 4
Or another option is grep where we match the pattern ^.9 (where ^ suggests the start of the string, . can be any character followed by 9 i.e. the second character)
grep("^.9", p)
#[1] 2 3 4
NOTE: Here I am assuming that the OP's vector is character class because numeric elements don't have 0 padded on the left.
data
p <- c("34405", "09098424", "6908347", "8900333", "453434")

How to grep two terms at the same time in R

I have a dataframe as follows
chr Type
1 Tum,B,B,Tum
2 B,B
3 Tum,Tum
4 B,B,B,Tum
I would like to only select those rows which have BOTH Tum and B to be inserted into a new dataframe with the following result:
chr Type
1 Tum,B,B,Tum
4 B,B,B,Tum
I have tried the following
PusungMix <- as.data.frame(Pusung[grep("Barr"&"Tum", Pusung$Type])
but I get the error
Error in "Barr" & "Tum" :
operations are possible only for numeric, logical or complex types
We can use a double grepl to create the two logical index and check whether for instances where both are TRUE using &. This can be used for subsetting the rows of 'df1'.
indx <- grepl('B', df1$Type) & grepl('Tum', df1$Type)
df1[indx,]
# chr Type
#1 1 Tum,B,B,Tum
#4 4 B,B,B,Tum
Or as #Gaurav suggested in the comments, subset is another option if we don't want to use [. We can remove the df1$ within the subset and also don't have to worry about dropping the dimensions as drop=FALSE is the default in subset, whereas in [, it is drop=TRUE. So, when we have a single column or single row, it will drop the dimensions to a vector if we don't specify explicitly drop=FALSE in [.
subset(df,grepl('B', Type) & grepl('Tum', Type))
Or by pure regex w/o the need of 2 grepl:
indx <- grepl("Tum.*B|B.*Tum", df1$Type)
df1[indx, ]
# chr Type
# 1 1 Tum,B,B,Tum
# 4 4 B,B,B,Tum

Resources