How to remove data after certain characters - r

I need to know how to remove all characters from a value after the first D letter and 1st number or 2 second number. I am not sure how to start.
I have a data frame and I have a column of type Character
The column is called " Eircode "
The postal codes go from D01 to D24 ( these are Dublin postal codes )
The values are inputted like so
What you see in red is what needs to be removed.
I need to be able to remove the characters after the last digit.
My dataframe is called "MainSchools"
So if the " Eircode " is D03P820, I need to have it as D03 after my change.
I would preferably like to be able to do this with the Tidyverse package if possible.

You may use sub here:
df <- data.frame(Eircode=c("D15P820", "K78YD27", "D03P820"),
stringsAsFactors=FALSE)
df$Eircode <- sub("^(D(?:0[1-9]|1[0-9]|2[0-4])).*$", "\\1", df$Eircode)
df
Eircode
1 D15
2 K78YD27
3 D03
The regex pattern used above matches and captures Dublin postal codes as follows:
D match D
(?:
0[1-9] followed by 0-9
| OR
1[0-9] 10-19
| OR
2[0-4] 20-24
)
Then, we use \1 as the replacement in sub, leaving behind only the 3 character Dublin postal code.

I like to use the stringr package for such operations.
library(dplyr)
library(sitrngr)
df %>% mutate(Eircode = str_extract_all(Eircode, '^[A-Z][0-9]{2}'))
output with the data from #Tim Biegeleisen:
Eircode
1 D15
2 K78
3 D03

Related

Extract words enclosed within asterisks in a column in R

I have a dataframe, col1 contains text and within the text there are words enclosed by double asterisks. I want to extract all of these words and put them in another column called col2. If there is more than 1 word enclosed with double asterisks, I would like them to be separated by a comma. col2 in the example shows the desired result.
col1<-c("**sometimes** i code in python",
"I like **walks** in the park",
"I **often** do **exercise**")
col2<-c("**sometimes**","**walks**","**often**,**exercise**")
df<-data.frame(col1, col2, stringsAsFactors = FALSE)
Can anyone suggest a solution?
You may use stringr::str_match_all -
df$col3 <- sapply(stringr::str_match_all(df$col1, '(\\*+.*?\\*+)'),
function(x) toString(x[, 2]))
df
# col1 col2 col3
#1 **sometimes** i code in python **sometimes** **sometimes**
#2 I like **walks** in the park **walks** **walks**
#3 I **often** do **exercise** **often**,**exercise** **often**, **exercise**
* has a special meaning in regex. Here we want to match an actual * so we escape it with \\. We extract all the values which come between 1 or more than 1 *.
str_match_all returns a list of matrix, we are interested in the capture group that is between (...) which is the 2nd column hence x[, 2] and finally for more than one value we collapse them in one comma separated string using toString.
You can use str_extract_all:
library(stringr)
library(dplyr)
df %>%
mutate(col2 = str_extract_all(col1, "\\*\\*[^* ]+\\*\\*"))
col1 col2
1 **sometimes** i code in python **sometimes**
2 I like **walks** in the park **walks**
3 I **often** do **exercise** **often**, **exercise**
How the regex works:
\\*\\* matches two asterisks
[^* ]+ matches any character occurring one or more time which is not a literal * and not a whitespace
\\*\\* matches two asterisks
If you don't need the asterisks in col2, then this is how you can extract the strings without them:
df %>%
mutate(col2 = str_extract_all(col1, "(?<=\\*\\*)[^* ]+(?=\\*\\*)"))
col1 col2
1 **sometimes** i code in python sometimes
2 I like **walks** in the park walks
3 I **often** do **exercise** often, exercise
How this regex works:
(?<=\\*\\*): positive lookbehind asserting that there must be two asterisks to the left
[^* ]+ matches any character occurring one or more time which is not a literal * and not a whitespace
(?=\\*\\*) positive lookahead asserting that there must be two two asterisks to the right

Removing a tweet/row if it contains any non-english word

I want to remove the whole tweet or a row from a data-frame if it contains any non-english word.
My data-frame looks like
text
1 | morning why didnt i go to sleep earlier oh well im seEING DNP TODAY!!
JIP UHH <f0><U+009F><U+0092><U+0096><f0><U+009F><U+0092><U+0096>
2 | #natefrancis00 #SimplyAJ10 <f0><U+009F><U+0098><U+0086><f0><U+009F
<U+0086> if only Alan had a Twitter hahaha
3 | #pchirsch23 #The_0nceler #livetennis Whoa whoa let’s not take this too
far now
4 | #pchirsch23 #The_0nceler #livetennis Well Pat that’s just not true
5 | One word #Shame on you! #Ji allowing looters to become president
The expected dataframe should be like this:
text
3 | #pchirsch23 #The_0nceler #livetennis Whoa whoa let’s not take this too
far now
4 | #pchirsch23 #The_0nceler #livetennis Well Pat that’s just not true
5 | One word #Shame on you! #Ji allowing looters to become president.
You want to preserve the alpha-numeric characters along with some of punctuation's like #, ! etc.
If your column contains mainly of <unicode>, then this should do:
For data frame df with text column, using grep:
new_str <- grep(df_str$text, pattern = "<*>", value= TRUE , invert = TRUE )
new_str[new_str != ""]
To put it back to your original column text. you can just work with indices that you need and put other to NA:
idx <- grep(df$text, pattern = "<*>", invert = TRUE )
df$text[-idx] <- NA
For cleaning the tweet, you can use gsub function. refer this post cleaning tweet in R

extracting names and numbers using regex

I think I might have some issues with understanding the regular expressions in R.
I need to extract phone numbers and names from a sample vector and create a data-frame with corresponding columns for names and numbers using stringr package functionality.
The following is my sample vector.
phones <- c("Ann 077-789663", "Johnathan 99656565",
"Maria2 099-65-6569 office")
The code that I came up with to extract those is as follows
numbers <- str_remove_all(phones, pattern = "[^0-9]")
numbers <- str_remove_all(numbers, pattern = "[a-zA-Z]")
numbers <- trimws(numbers)
names <- str_remove_all(phones, pattern = "[A-Za-z]+", simplify = T)
phones_data <- data.frame("Name" = names, "Phone" = numbers)
It doesn't work, as it takes the digit in the name and joins with the phone number. (not optimal code as well)
I would appreciate some help in explaining the simplest way for accomplishing this task.
Not a regex expert, however with stringr package we can extract a number pattern with optional "-" in it and replace the "-" with empty string to extract numbers without any "-". For names, we extract the first word at the beginning of the string.
library(stringr)
data.frame(Name = str_extract(phones, "^[A-Za-z]+"),
Number = gsub("-","",str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+")))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
If you want to stick completely with stringr we can use str_replace_all instead of gsub
data.frame(Name = str_extract(phones, "[A-Za-z]+"),
Number=str_replace_all(str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+"), "-",""))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
I think Ronak's answer is good for the name part, I don't really have a good alternative to offer there.
For numbers, I would go with "numbers and hyphens, with a word boundary at either end", i.e.
numbers = str_extract(phones, "\\b[-0-9]+\\b") %>%
str_remove_all("-")
# Can also specify that you need at least 5 numbers/hyphens
# in a row to match
numbers2 = str_extract(phones, "\\b[-0-9]{5,}\\b") %>%
str_remove_all("-")
That way, you're not locked into a fixed format for the number of hyphens that appear in the number (my suggested regex allows for any number).
If you (like me) prefer to use base-R and want to keep the regex as simple as possible you could do something like this:
phone_split <- lapply(
strsplit(phones, " "),
function(x) {
name_part <- grepl("[^-0-9]", x)
c(
name = paste(x[name_part], collapse = " "),
phone = x[!name_part]
)
}
)
phone_split
[[1]]
name phone
"Ann" "077-789663"
[[2]]
name phone
"Johnathan" "99656565"
[[3]]
name phone
"Maria2 office" "099-65-6569"
do.call(rbind, phone_split)
name phone
[1,] "Ann" "077-789663"
[2,] "Johnathan" "99656565"
[3,] "Maria2 office" "099-65-6569"

Rename Dataframe Column Names in R using Previous Column Name and Regex Pattern

I am working in R for the first time and I have been having difficulty renaming column names in a dataframe (Grade.Data). I have a dataset imported from an csv file that has column names like this:
Student.ID
Grade
Interactive.Exercises.1..Health
Interactive.Exercises.2..Fitness
Quizzes.1..Week.1.Quiz
Quizzes.2..Week.2.Quiz
Case.Studies.1..Case.Study1
Case.Studies.2..Case.Study2
I would like to be able to change the variable names so that they are more simple, i.e. from Interactive.Exercises.1.Health to Interactive.Exercises.1 or Quizzes.1.Week.1.Quiz to Quizzes.1
So far, I have tried this:
grep(".*[0-9]", names(Grade.Data))
But I get this returned:
[1] 3 4 5 6 7 8 9 11 12 13 14 15 16 17 19 20 21 22 23 24 25
Can anyone help me figure out what is going on, and write a better regex expression? Thank you so much.
It seems you truncate column names after the first chunk of digits.
You may use the following sub solution:
names(Grade.Data) <- sub("^(.*?\\d+).*$", "\\1", names(Grade.Data))
See the regex demo
Details
^ - start of string
(.*?\\d+) - Group 1 (later referred with \1 from the replacement pattern) matching any 0+ chars as few as possible (.*?) and then 1 or more digits (\d+)
.* - any 0+ chars as many as possible
$ - end of string
There is nothing wrong with your regex itself. What you are looking for is probably the combination of regexpr - which gets the start and ending of your regex- and regmatches - which gets the actual string corresponding to the output of regexpr:
start_end <- regexpr(".*[0-9]", names(Grade.data))
regmatches(names(Grade.data), start_end)
# [1] "Interactive.Exercises.1" "Interactive.Exercises.2"
# [3] "Quizzes.1..Week.1" "Quizzes.2..Week.2"
# [5] "Case.Studies.1..Case.Study1"
Adding a question-mark behind the dot-star will make the regex match as few characters as possible, so it will stop after the first numeric value:
start_end <- regexpr(".*?[0-9]", names(Grade.data))
regmatches(names(Grade.data), start_end)
# [1] "Interactive.Exercises.1" "Interactive.Exercises.2"
# [3] "Quizzes.1" "Quizzes.2"
# [5] "Case.Studies.1"
you should use the function names, following I write a little example, the names string can be as long as you need.
names(x = Grade.Data) <- c("Col1_name", "Col2_name")

Using regular expression in string replacement

I have a broken csv file that I am attempting to read into R and repair using a regular expression.
The reason it is broken is that it contains some fields which include a comma but does not wrap those fields in double quotes. So I have to use a regular expression to find these fields, and wrap them in double quotes.
Here is an example of the data source:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00
So you can see that in the third row, the Price field contains a comma but it is not wrapped in double quotes. This breaks the read.table function.
My approach is to use readLines and str_replace_all to wrap the price with commas in double quotes. But I am not good at regular expressions and stuck.
vector <- read.Lines(file)
vector_temp <- str_replace_all(vector, ",\\$[0-9]+,\\d{3}\\.\\d{2}", ",\"\\$[0-9]+,\\d{3}\\.\\d{2}\"")
I want the output to be:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,"$1,250.00"
With this format, I can read into R.
Appreciate any help!
lines <- readLines(textConnection(object="DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00"))
library(stringi)
library(tidyverse)
stri_split_regex(lines, ",", n=3, simplify=TRUE) %>%
as_data_frame() %>%
docxtractr::assign_colnames(1)
## DataField1 DataField2 Price
## 1 ID1 Value1
## 2 ID2 Value2 $500.00
## 3 ID3 Value3 $1,250.00
from there you can readr::write_csv() or write.csv()
The extra facilities in the stringi or stringr packages do not seem needed. gsub seems perfectly suited for this. You just need understand about capture-groups with paired parentheses (brackets to Brits) and the use of the double-backslash_n convention for referring to capture-group matches in the replacement argument:
txt <- "DataField1,DataField2,Price, extra
ID1,Value1, ,
ID2,Value2,$500.00,
ID3,Value3,$1,250.00, o"
vector<- gsub("([$][0-9]{1,3}([,]([0-9]{3})){0,10}([.][0-9]{0,2}))" , "\"\\1\"", readLines(textConnection(txt)) )
> read.csv(text=vector)
DataField1 DataField2 Price extra
1 ID1 Value1
2 ID2 Value2 $500.00
3 ID3 Value3 $1,250.00 o
You are putting quotes around specific sequence of digits possibly repeated(commas digits) and possible period and 2 digits . There might be earlier SO questions about formatting as "currency".
Here are some solutions:
1) read.pattern This uses read.pattern in the gsubfn package to read in a file (assumed to be called sc.csv) such that the capture groups, i.e. the parenthesized portions, of the pattern are the fields. This will read in the file and process it all in one step so it is not necessary to use readLines first.
^(.*?), that begins the pattern will match everything from the start until the first comma. Then (.*?), will match to the next comma and finally (.*)$ will match everything else to the end. Normally * is greedy, i.e. it matches as much as it can, but the question mark after it makes it ungreedy. We needed to specify perl=TRUE so that it uses perl regular expressions since by default gsubfn uses tcl regular expressions based on Henry Spencer's regex parser which does not support *? . If you would rather have character columns instead of factor then add the as.is=TRUE argument to read.pattern.
The final line of code removes the $ and , characters from the Price column and converts it to numeric. (Omit this line if you actually want it formatted.)
library(gsubfn)
DF <- read.pattern("sc.csv", pattern = "^(.*?),(.*?),(.*)$", perl = TRUE, header = TRUE)
DF$Price <- as.numeric(gsub("[$,]", "", DF$Price)) ##
giving:
> DF
DataField1 DataField2 Price
1 ID1 Value1 NA
2 ID2 Value2 500
3 ID3 Value3 1250
2) sub This uses very simple regular expression (just a single character match) and no packages. Using vector as defined in the question this replaces the first two commas with semicolons. Then it can be read in using sep = ";"
read.table(text = sub(",", ";", sub(",", ";", vector)), header = TRUE, sep = ";")
Add the line marked ## in (1) if you want numeric prices.

Resources