I trying to change the all the names with the word stocker in job.tittle to a new column job.title.2
I tried to use gsub() without the expected result
My data.frame looks liek this:
x<- data.frame(Job.tittle=c("DW Overnight Stockers", "Checkers","TH Stockers", "CM Midland Stockers"), Head.counts=c(100,50,100,200))
Thank you
I tried this: x$job.tittle.2<-gsub("\bDW Overnight Stockers\w+","Stocker",x$Job.tittle)
and did not work
Here you go. Using regex, this takes a string that contains the word "stocker" or "stockers", in either upper or lower case, any where in the string, and replaces it with "Stocker".
x$job.title.2 <- gsub(".*stockers?.*", "Stocker", x$Job.tittle, ignore.case = TRUE)
x
Job.tittle Head.counts job.title.2
1 DW Overnight Stockers 100 Stocker
2 Checkers 50 Checkers
3 TH Stockers 100 Stocker
4 CM Midland Stockers 200 Stocker
Related
I have been practicing web scrapping from wikipedia with the rvest library, and I would like to solve a problem that I found when using the str_replace_all() function. here is the code:
library(tidyverse)
library(rvest)
pagina <- read_html("https://es.wikipedia.org/wiki/Anexo:Premio_Grammy_al_mejor_%C3%A1lbum_de_rap") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# convert to a table
html_table()
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista <- str_replace_all(rap$Artista, '\\[[^\\]]*\\]', '')
rap$Trabajo <- str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
The problem is that when I remove the elements between brackets (hyperlinks in wikipedia) from the Artist variable, when doing the tabulation to see the count by artist, Eminem is repeated three times as if it were three different artists, the same happens with Kanye West that is repeated twice. I appreciate any solutions in advance.
There are some hidden bits still attached to the strings and trimws() is not working to remove them. You can use nchar(sort(test)) to see the number of character associated with each entry.
Here is a messy regular expression to extract out the letters, space, comma and - and skip everything else at the end.
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista<-gsub("([a-zA-Z -,&]+).*", "\\1", rap$Artista)
rap$Trabajo <- stringr::str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
Cardi B Chance the Rapper Drake Eminem Jay Kanye West Kendrick Lamar
1 1 1 6 1 4 2
Lil Wayne Ludacris Macklemore & Ryan Lewis Nas Naughty by Nature Outkast Puff Daddy
1 1 1 1 1 2 1
The Fugees Tyler, the Creator
1 2
Here is another reguarlar expression that seems a bit clearer:
gsub("[^[:alpha:]]*$", "", rap$Artista)
From the end, replace zero or more characters which are not a to z or A to Z.
I have a question how to extract parts of the text and convert them df output.
This is an example of my df, output of one row in my one column (content of one cell)
[{"id"=>"aaaaaaaaaaaaaaaa", "effortDate"=>"2021-07-04T23:00:00.000Z", "effort"=>2, "author"=>"a:aa:a"}, {"id"=>"bbbbbbbbbbbbbb", "effortDate"=>"2021-07-11T23:00:00.000Z", "effort"=>1, "author"=>"b:bb:b"}, {"id"=>"ccccccccccccc", "effortDate"=>"2021-07-17T23:00:00.000Z", "effort"=>1, "author"=>"c:cc:c"}]
My expected output would be to have 2 columns with as many rows I get from this string:
effortDate
2021-07-04
2021-04-11
and second column
effort
2
1
Any suggestion how to achieve that?
Thanks!
looks like json-content... but the => messes with the reading. If you replace it with :, you sould be able to read properly.
mystr <- '[{"id"=>"aaaaaaaaaaaaaaaa", "effortDate"=>"2021-07-04T23:00:00.000Z", "effort"=>2, "author"=>"a:aa:a"}, {"id"=>"bbbbbbbbbbbbbb", "effortDate"=>"2021-07-11T23:00:00.000Z", "effort"=>1, "author"=>"b:bb:b"}, {"id"=>"ccccccccccccc", "effortDate"=>"2021-07-17T23:00:00.000Z", "effort"=>1, "author"=>"c:cc:c"}]'
jsonlite::fromJSON(gsub("=>", ":", mystr))
# id effortDate effort author
# 1 aaaaaaaaaaaaaaaa 2021-07-04T23:00:00.000Z 2 a:aa:a
# 2 bbbbbbbbbbbbbb 2021-07-11T23:00:00.000Z 1 b:bb:b
# 3 ccccccccccccc 2021-07-17T23:00:00.000Z 1 c:cc:c
I am trying the following :
gg <-c("delete from below 110 11031133 11 11031135 110",
"delete froml #10989431 from adfdaf 10888022 <(>&<)> 10888018",
"this is for the deletion of an incorrect numberss that is no longer used for asd09 and sd040",
"please delete the following mangoes from trey 10246211 1 10821224 1 10821248 1 10821249",
"from 11015647 helppp 1 na from 0050 - zfhhhh 10840637 1")
pattern_to_find <- c('\\d{4,}')
aa <- str_extract_all(gg, pattern_to_find)
aa
with this code I am able to extact any numeric pattern with number greater than a fixed number. But if I want to extract 2 didit number then it picks up all the first two numbers from the numeric field .
pattern_to_find <- c('\\d{2}').
How can I modify my pattern to work on both ways.
Regards,
R
Tidyverse solution:
library(tidyverse)
pattern_to_find <- c('\\d{2,}')
aa <- str_extract_all(gg, pattern_to_find)
Base R solution:
base_aa <- regmatches(gg, gregexpr(pattern_to_find, gg))
I'm trying to split a string in one column...
> df.arpt
arpt
1 CMH 39402
2 IAH 97571
3 DAL 67191
4 HOU 07614
5 OKC 11127
...and break it out into two new columns with a result that looks like this...
> df.arpt
arpt arptCode arptID
1 CMH 39402 CMH 39402
2 IAH 97571 IAH 97571
3 DAL 67191 DAL 67191
4 HOU 07614 HOU 07614
5 OKC 11127 OKC 11127
I really want something like this to be possible...
> df.arpt$arptCode <- strsplit(df.arpt$arpt, " ")[[...]][1]
> df.arpt$arptID <- strsplit(df.arpt$arpt, " ")[[...]][2]
... where the ... in the code represents "for every record in the data frame".
Any suggestions on how to go about this? (I'd like to stick with base R / "out-of-the-box" R rather than higher-level packages.) Am I thinking about this the right way in R?
If the arptCode values are row names, you can convert them into a column.
library(tidyverse)
df.arpt %>%
rownames_to_column(var = "arptCode")
If they are not row names then you can use separate.
library(tidyverse)
df.arpt %>%
separate(arpt, into = c('arptCode', 'aprtID'))
How about this:
df<-data.frame(arpt =c("CMH 39402", "IAH 97571", "DAL 67191", "HOU 07614", "OKC 11127"))
tidyr::separate(df, arpt, into = c("artpCode", "arptID"))
Because the strings are all fixed length, I was able to apply the substr function instead in order to move past the problem. However, I still don't know what the solution would be if the result of the function was a list.
I have a dataset with a "Notes" column, which I'm trying to clean up with R. The notes look something like this:
Collected for 2 man-hours total. Cloudy, imminent storms.
Collected for 2 man-hours total. Rainy.
Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.
..And so on
I want to remove all sentences that start with "Collected" but not any of the sentences that follow. The number of sentences that follow vary, e.g. from 0-4 sentences afterwards. I was trying to remove all combinations of Collected + (last word of the sentence) but there's too many combinations. Removing Collected + [.] removes all the subsequent sentences. Does anyone have any suggestions? Thank you in advance.
An option using gsub can be as:
gsub("^Collected[^.]*\\. ","",df$Notes)
# [1] "Cloudy, imminent storms."
# [2] "Rainy."
# [3] "Sunny."
Regex explanation:
- `^Collected` : Starts with `Collected`
- `[^.]*` : Followed by anything other than `.`
- `\\. ` : Ends with `.` and `space`.
Replace such matches with "".
Data:
df<-read.table(text=
"Notes
'Collected for 2 man-hours total. Cloudy, imminent storms.'
'Collected for 2 man-hours total. Rainy.'
'Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.'",
header = TRUE, stringsAsFactors = FALSE)
a = "Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny."
sub("^ ","",sub("Collected.*?\\.","",a))
> [1] "Sunny."
Or if you know that there will be a space after the period:
sub("Collected.*?\\. ","",a)