R-- Extract specific text and the subsequent text/numbers - r

I have a column with text strings in it, and I would like to extract not just a specific string, but also the string or number/s following this specified string. What is a good solution for this?
In the example below- I would like to create a column "extract" and str_extract the words "lot" and "unit" AND also extract the subsequent numbers following this text.
id
notes
extract
1
LOT 56, STRATA TITLE, 56/SP77100,
LOT 56
2
18/SP71866, COMMERCIAL, 17/SP71866, lot 18
lot 18
3
unit 9; 3R/PS732002
unit 9
4
V1602 F63, Section 8 Block 68 Unit 3
Unit 3
Have looked at a lot of regex code but nothing helpful to find how to extract subsequent values from the specified target text string.
Tried this so far from another StackOverflow problem-
result <- table %>%
mutate(extract = str_extract(notes, "(?lot\\s)\\W\\s?\\d+\\")) %>%
mutate(lot = str_squish(lot))

You can use
str_extract(notes, "(?i)\\b(?:lot|unit)\\W*\\d+")
See the regex demo.
Details
(?i) - case insensitive flag
\b - a word boundary
(?:lot|unit) - either lot or unit
\W* - any zero or more non-word chars
\d+ - one or more digits.
R test:
library(dplyr)
library(stringr)
df <- data.frame(notes=c("LOT 56, STRATA TITLE, 56/SP77100,","18/SP71866, COMMERCIAL, 17/SP71866, lot 18","unit 9; 3R/PS732002", "V1602 F63, Section 8 Block 68 Unit 3"))
df %>%
+ mutate(extract = str_extract(notes, "(?i)\\b(?:lot|unit)\\W*\\d+"))
notes extract
1 LOT 56, STRATA TITLE, 56/SP77100, LOT 56
2 18/SP71866, COMMERCIAL, 17/SP71866, lot 18 lot 18
3 unit 9; 3R/PS732002 unit 9
4 V1602 F63, Section 8 Block 68 Unit 3 Unit 3

A base R option using regmatches
transform(
df,
extract = unlist(regmatches(notes, gregexpr("\\b(lot|unit)\\s\\d+", notes, ignore.case = TRUE)))
)
gives
id notes extract
1 1 LOT 56, STRATA TITLE, 56/SP77100 LOT 56
2 2 18/SP71866, COMMERCIAL, 17/SP71866, lot 18 lot 18
3 3 unit 9; 3R/PS732002 unit 9
4 4 V1602 F63, Section 8 Block 68 Unit 3 Unit 3
Data
> dput(df)
structure(list(id = 1:4, notes = c("LOT 56, STRATA TITLE, 56/SP77100",
"18/SP71866, COMMERCIAL, 17/SP71866, lot 18", "unit 9; 3R/PS732002",
"V1602 F63, Section 8 Block 68 Unit 3")), class = "data.frame", row.names = c(NA,
-4L))

Related

Extract portion of nested tibbles?

I've got a pile of nested tibbles that are from the tidyrss package. The data looks like this:
What I'm trying to do is take the four common items from each tibble and tidy them, so that the output looks like this:
item_title
item_link
item_description
item_pub_date
title from article 1
some url
longer text
posix date
title from article 2
some url
longer text
posix date
title from article 3
some url
longer text
posix date
title from article 4
some url
longer text
posix date
Thus far I've tried unlist() and deframe() and both of those just make a general mess of things - and an added twist is not all the list items are tibbles. Some are functions, and I want to ignore those. What's the best tidyverse approach to tackle this task?
map_dfr seems to do what you want! It loops over a list and applies a function to each one - in this case, the only "function" we want to apply is returning the data frame/tibble, but that also allows us to skip the functions:
clean_feed_df <- list(
data.frame(item_title=sample(letters, 3),
item_link=sample(letters, 3),
item_desc=sample(letters, 3),
item_date=sample(letters, 3)),
data.frame(item_title=as.character(sample(1:100, 5)),
item_link=as.character(sample(1:100, 5)),
item_desc=as.character(sample(1:100, 5)),
item_date=as.character(sample(1:100, 5))),
function(x)sum(x)
)
map_dfr(clean_feed_df, function(rssentry){
if(is(rssentry, "data.frame")){
return(rssentry)
}
})
which returns
item_title item_link item_desc item_date
1 s s u i
2 x d o x
3 t x d h
4 40 51 21 91
5 4 25 37 34
6 5 44 18 71
7 65 70 83 90
8 32 85 76 89

Separate character variable into two columns

I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!
I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)
You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59

Splitting complex string between symbols R

I have a dataset full of IDs and qualification strings. My issue with this is two fold;
How to deal with splits between different symbols and,
how to iterate output down a dataframe whilst retaining an ID.
ID <- c(1,2,3)
Qualstring <- c("LE:Science = 45 Distinctions",
"A:Chemistry = A A:Biology = A A:Mathematics = A",
"A:Biology = A A:Chemistry = A A:Mathematics = A B:Baccalaureate Advanced Diploma = Pass"
)
s <- data.frame(ID, Qualstring)
The desired output would be:
ID Qualification Subject Grade
1 1 LE: Science 45 Distinctions
2 2 A: Chemistry A
3 2 A: Biology A
4 2 A: Mathematics A
5 3 A: Biology A
6 3 A: Chemistry A
7 3 A: Mathematics A
8 3 WB: Welsh Baccalaureate Advanced Diploma Pass
The commonality of the splits is the ":" and "=", and the codes/words around those.
Looking at the problem from my perspective, it appears complex and whether a continued fudge in excel is ultimately the way to go for this structure of data. Would love to know otherwise if there are any recommendations or direction.
A solution using data.table and stringr. The use of data.table is just for my personal convenience, you could use data.frame with do.call(rbind,.) instead of rbindlist()
library(stringr)
qual <- str_extract_all(s$Qualstring,"[A-Z]+(?=\\:)")
subject <- str_extract_all(s$Qualstring,"(?<=\\:)[\\w ]+")
grade <- str_extract_all(s$Qualstring,"(?<=\\= )[A-z0-9]+")
library(data.table)
df <- lapply(seq(s$ID),function(i){
N = length(qual[[i]])
data.table(ID = rep(s[i,"ID"],N),
Qualification = qual[[i]],
Subject = subject[[i]],
Grade = grade[[i]]
)
}) %>% rbindlist()
ID Qualification Subject Grade
1: 1 LE Science 45
2: 2 A Chemistry A
3: 2 A Biology A
4: 2 A Mathematics A
5: 3 A Biology A
6: 3 A Chemistry A
7: 3 A Mathematics A
8: 3 B Baccalaureate Advanced Diploma Pass
In short, I use positive look behind (?<=) and positive look ahead (?=). [A-Z]+ is for a group of upper letters, [\\w ]+ for a group of words and spaces, [A-z0-9]+ for letters (up and low cases) and numbers. string_extract_all gives a list with all the match on each cell of the character vector tested.

How to I add a leading numeric identifier (not necessarily zero) to a character string in r

I apologize if this is a duplicate, I've searched through all of the "add leading zero" content I can find, and I'm struggling to find a solution I can work with. I have the following:
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier)
and I want a modified siteid that is always six (6) characters long with zeroes to fill the gaps. The Site ID can vary in nchar from 1-3, the modifier is always a length of 2, and the number of zeroes can vary depending on the length of the site ID (so that 6 is always the final modified length).
I would like the following final output:
df
# siteid modifier mod.siteid
#1 1 44 440001
#2 11 22 220011
#3 111 11 110111
Thanks for any suggestions or direction. This could also be numeric, but it seems like character manipulation has more options...?
The vocabulary here is left pad and paste here is one way using sprintf()::
df$mod.siteid <- with(df, sprintf("%s%04d", modifier, as.integer(siteid)))
# Note:
# code simplified thanks to suggestion by Maurits.
Output:
siteid modifier mod.siteid
1 1 44 440001
2 11 22 220011
3 111 11 110111
Data:
df <- data.frame(
siteid = c("1", "11", "111"),
modifier = c("44", "22", "11"),
stringsAsFactors = FALSE
)
Extra: If you don't want to left pad with 0, then using the stringi package is one option: with(df, paste0(modifier, stringi::stri_pad_left(siteid, 4, "q")))
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier, stringsAsFactors = FALSE)
df$mod.siteid = paste0( df$modifier,
formatC( as.numeric(df$siteid), width = 4, format = "d", flag="0") )
df
# siteid modifier mod.siteid
# 1 1 44 440001
# 2 11 22 220011
# 3 111 11 110111

Parsing in a R Data Frame by columns, not by sep

I have Two Line Element data https://celestrak.org/NORAD/elements/ that I have made single line, resulting in 1,00s of rows of 160 numbers and characters. Unlike a CSV there are no separators. Using R, how do I parse the data into the correct column width? Here is an example of the data, and some of the first columns.
1 00011U 59001A 18243.16403752 .00000112
123456789012345678901234567890
... col# content
1 01–01 Line number,example - 1
2 03–07 Satellite number, example - 25544
3 08–08 Classification (U=Unclassified), example - U
4 10–11 Intl Designator (Last two digits of launch year), example - 98
5 12–14 Intl Designator (Launch number - year),example - 067
6 15–17 Intl Designator (piece of the launch), example - A
thank you so much in advance
You can parse these kinds of "fixed width format" files in R using read.fwf(). You have to specify the width of each column, I'm having a little bit of trouble matching your example data to the column descriptions you provided but this mostly works:
read.fwf(
textConnection("1 00011U 59001A 18243.16403752 .00000112"),
widths = c(2, 5, 2, 2, 3, 4),
# Just reading everything as a string for the moment
colClasses = "character"
)
Output:
V1 V2 V3 V4 V5 V6
1 1 00011 U 59 001 A 18

Resources