Splitting strings in R and extracting information from lists - r

I have the following row names in my data:
column_01.1
column_01.2
column_01.3
column_02.1
column_02.2
I can split these rownames with the following command:
strsplit(rownames(my_data),split= "\\.")
and get the list:
[[1]]
[1] "column_01" "1"
[[2]]
[1] "column_01" "2"
[[3]]
[1] "column_01" "3"
...
But since I want characters out of the first part and completely discard the second part, like this:
column_01
column_01
column_01
column_02
column_02
I have run out of tricks to extract only this part of the information. I've tried some options with unlist() and as.data.frame(), but no luck. Or is there an easier way to split the strings? I do not want to use as.character(substring(rownames(my_data),1,9)) as the location of the "." can change (while it would work for this example).

You can map [ to get the first elements:
sapply(strsplit(rownames(my_data),split= "\\."),'[',1)
...or (better) use regular expressions:
gsub('\\..*$','',rownames(my_data))
(translation: find all matches of (dot-character, something, end-of-string) and replace with empty string)

Since I like the stringr package, I thought I'd throw this out there:
str_replace(rownames(my_data), "(^column_.+)\\.\\d+", "\\1")
(I'm not great with regex so the ^ might be better outside the parenthesis)

Related

Extract text in two columns from a string

I have a table where one column has data like this:
table$test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
1.) I am trying to extract the first part of this string within the square brackets in one column, i.e.
table$project_name <- "projectname"
using the regex:
project_name <- "^\\[|(?:[a-zA-Z]|[0-9])+|\\]$"
table$project_name <- str_extract(table$test_string, project_name)
If I test the regex on 1 value (1 row individually) of the table, the above regex works with using
str_extract_all(table$test_string, project_name[[1]][2]).
However, I get NA when I apply the regex pattern to the whole table and an error if I use str_extract_all.
2.) Second part of the string, which is a URL in another column,
table$url_link <- "https://somewebsite.com/projectname/Abc/xyz-09"
I am using the following regex expression for URL:
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
table$url_link <- str_extract(table$test_string, url_pattern)
and this works on the whole table, however, I still get the ')' last paranthesis in the url link.
What am I missing here? and why does the first regex work individually and not on the whole table?
and for the url, how do I not get the last paranthesis?
It feels like you could simplify things considerably by using parentheses to group capture. For example:
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
regex <- "\\[(.*)\\]\\((.*)\\)"
gsub(regex, "\\1", test_string)
#> [1] "projectname"
gsub(regex, "\\2", test_string)
#> [1] "https://somewebsite.com/projectname/Abc/xyz-09"
We can make use of convenient functions from qdapRegex
library(qdapRegex)
rm_round(test_string, extract = TRUE)[[1]]
#[1] "https://somewebsite.com/projectname/Abc/xyz-09"
rm_square(test_string, extract = TRUE)[[1]]
#[1] "projectname"
data
test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"

Extract inside of first square brackets

I know there are a few similar questions, but they did not help me, perhaps due to my lack of understanding the basics of string manipulation.
I have a piece of string that I want to extract the inside of its first square brackets.
x <- "cons/mod2/det[4]/rost2/rost_act[2]/Q2w5"
I have looked all over the internet to assemble the following code but it gives me inside of 2nd brackets
sub(".*\\[(.*)\\].*", "\\1", x, perl=TRUE)
The code returns 2. I expect to get 4.
Would appreciate if someone points out the missing piece.
---- update ----
Replacing .* to .*? in the first two instances worked, but do not know how. I leave the question open for someone who can provide why this works:
sub(".*?\\[(.*?)\\].*", "\\1", x, perl=TRUE)
You're almost there:
sub("^[^\\]]*\\[(\\d+)\\].*", "\\1", x, perl=TRUE)
## [1] "4"
The original problem is that .* matches as much as possible of anything before it matches [. Your solution was *? which is lazy version of * (non-greedy, reluctant) matches as little as it can.
Completely valid, another alternative I used is [^\\]]*: which translates into match anything that is not ].
stringr
You can solve this with base R, but I usually prefer the functions from the stringr-package when handeling such 'problems'.
x <- "cons/mod2/det[4]/rost2/rost_act[2]/Q2w5"
If you want only the first string between brackets, use str_extract:
stringr::str_extract(x, "(?<=\\[).+?(?=\\])")
# [1] "4"
If you want all the strings between brackets, use str_extract_all:
stringr::str_extract_all(x, "(?<=\\[).+?(?=\\])")
# [[1]]
# [1] "4" "2"

R: replacing a table column with a modified version of that column

I am using R currently and I have produced a table with 3 columns. The first column contains names looking like "XXX_YYY_ZZZ" and I would only want to keep the "XXX" part. This is why I tried gsub, but couldn't make it so I turned to strapplyc(), which works but produces only one column. Apparently, I would want to keep my initial table, but with the first column replaced by the strapplyc() output. Or any other different approach you think would fit better!
Thank you in advance.
Since you have NOT showed samples so creating a simplex example here for testing it.
cal1 <- c("XXX_YYY_ZZZ","XXX_YYY_ZZZ")
gsub("_.*","",cal1)
Output will be as follows.
> gsub("_.*","",cal1)
[1] "XXX" "XXX"
Works for me. Here is a regex which looks for three groups of text, separated by underscores. The ^ indicates start of string and $ indicates end of string. I capture first (\\1) group, but there's nothing stopping you from capturing \\2, \\3 or even \\1\\3.
gsub("^(.*)_(.*)_(.*)$", "\\1", "XXX_YYY_ZZZ")
[1] "XXX"
You could also use strsplit.
> strsplit("XXX_YYY_ZZZ", "_")[[1]][1]
[1] "XXX"

Using tidyr package in R, are we able to filter and extract links from a tibble?

Let's say I have this in my tibble,
Transcript
1 Hi i would like to find out more about http://mywebsite.com/internalfaq/faq/154200 please help
2 Hello my results were withheld at https://mywebsite.com/123 hope you can help
3 Hello my friend join me at https://mywebsite.com/456
I tried
links = data %>%
extract(Transcript, url.pattern)
but it's not giving me what I want. It's not returning me the list of links even though I supply the url pattern. It returns me the first word only. Is there something wrong here that I did?
Thanks in advance!
This is my url pattern: https://mywebsite.com/.*
The into input to extract must be specified. Also, try adding parentheses to your regex.
url.pattern <- "(https://mywebsite.com/[^> | ]*)"
data %>%
extract(Transcript, into = 'link',regex = url.pattern)
you can use regmatches
regmatches(h,gregexpr("http.*?(\\d+)",h))
[[1]]
[1] "https://mywebsite.com/internalfaq/faq/154200" "http://mywebsite.com/internalfaq/faq/154200"
[[2]]
[1] "https://mywebsite.com/123" "https://mywebsite.com/123"
[[3]]
[1] "https://mywebsite.com/456"
This gives you the whole url's. What is h? his the Transcript[,1]. It is a vector and not a dataframe.
Since it seems the webpages are repeated, you can obtaine only the first one in every vector by using regexpr instead of gregexpr:
regmatches(h,regexpr("http.*?(\\d+)",h))
[1] "https://mywebsite.com/internalfaq/faq/154200" "https://mywebsite.com/123"
[3] "https://mywebsite.com/456"
You can also use the sub function with a backreference:
sub("(.*:)(.*\\d+)(.*)","https:\\2",h)
[1] "https://mywebsite.com/internalfaq/faq/154200" "https://mywebsite.com/123"
[3] "https://mywebsite.com/456"

Prevent grep in R from treating "." as a letter

I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before a .:
> "xYz" "ge" "qrstu"
However, the grep function seems to be treating . as a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, such as on regexpal.
How can I get grep to behave as expected?
grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.
If you need to remove the substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.
grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.
If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression
You may try str_extract function from stringr package.
str_extract(text, "[^.]*$")
This would match all the non-dot characters exists at the last.
Your pattern does work, the problem is that grep does something different than what you are thinking it does.
Let's first use your pattern with str_extract_all from the package stringr.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Notice that the results came as you expected!
The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"

Resources