Vexing Regex Using stringr [next line query] - r

I have made so many attempts at this and must now turn to you. I've seen related posts here on SO but none help. I'm vexed as to why I can't get a list of instruments, which seem to appear on the line following the word Instruments:!
library(RCurl);library(XML);library(rvest);library(dplyr);library(stringr)
A<-"https://www.google.com/search?q=lester+young&oq=lester+young&aqs=chrome..69i57j69i60l2j0l3.1767j1j4&sourceid=chrome&ie=UTF-8"
result<-A %>%
read_html()%>%
html_nodes(xpath="//span")%>%html_text()
# Parse `result` with regex
instruments<-str_extract(result,"(.*Instruments:\n.*)")
instruments
dob<-str_extract(result,".*(Born: \n.*)")
dob
'result' looks like this, in part:
[38] "Lester Willis Young, nicknamed \"Pres\" or \"Prez\", was an American jazz tenor saxophonist and occasional clarinetist.\nComing to prominence while a member of Count Basie's orchestra, Young was one of the most influential players on his instrument. Wikipedia"
[39] "Born: "
[40] "August 27, 1909, Woodville, MS"
[41] "Died: "
[42] "March 15, 1959, New York City, NY"
[43] "Nickname: "
[44] "Prez"
[45] "Instruments: "
[46] "Tenor saxophone, clarinet"
While it's possible to use instruments<-result[46] for this webpage, the HTML scraping yields instrument and dob information on different lines for different searches.
Ultimately, I would like to see "Piano" in the instruments object and a date of birth in the dob object.
Thank you...

This worked for me. Get the index of "Instruments:" and then print the next entry. Of course, if the page format changes, this may not work.
> i <- as.integer(grep("Instruments:",result))
> print(result[i+1])
[1] "Tenor saxophone, clarinet"
or this:
> result_all <- paste(result,collapse="\n")
> str_extract(result_all,"(Instruments:.*\\n.*)")
[1] "Instruments: \nTenor saxophone, clarinet"

Related

Parse text from html tags in xml table

Asked to review questions for grammar and spelling:
library(XML)
tbls_all <- readHTMLTable(url_v9)
length(tbls_all)
[1] 34
names(tbls_all)
[1] "NULL" "DisplayedQuestions" "AllQuestions1"
[4] "HiddenAnswerTable1" "HiddenAnswerTable1" "AllQuestions2"
[7] "HiddenAnswerTable2" "HiddenAnswerTable2" "AllQuestions3"
[10] "HiddenAnswerTable3" "HiddenAnswerTable3" "AllQuestions4"
[13] "HiddenAnswerTable4" "HiddenAnswerTable4" "AllQuestions5"
[16] "HiddenAnswerTable5" "HiddenAnswerTable5" "AllQuestions6"
[19] "HiddenAnswerTable6" "HiddenAnswerTable6" "AllQuestions7"
[22] "HiddenAnswerTable7" "HiddenAnswerTable7" "AllQuestions8"
[25] "HiddenAnswerTable8" "HiddenAnswerTable8" "AllQuestions9"
[28] "HiddenAnswerTable9" "HiddenAnswerTable9" "AllQuestions10"
[31] "HiddenAnswerTable10" "HiddenAnswerTable10" "TotalsTable"
[34] "HiddenTable"
just interested in AllQuestions, so
tbls_q <- tbls_all[grep('AllQuestions\\d', names(tbls_all))]
length(tbls_q)
[1] 10
names(tbls_q[[1]])
[1] "V1" "V2" "V3" "V4"
The questions are in V1
tbls_q[[1]]$V1[2]
[1] "<strong>Now I am going to evaluate how well you can remember the names of some common items. First, I will show you pictures of 16 items that I want you to remember. Each item belongs to a different category. For example, 'type of reading materials' is a category. I will show you the items four at a time and ask you to tell me which item belongs with each category and then to immediately recall the items when I tell you their categories. Later, I will ask you to recall all of the items I have shown you. For any items you miss, I will tell you the categories to help you recall more items. You will have 3 tries to recall the items.</strong>(726368 - WAS_Card1_Intro)"
> tbls_q[[1]]$V1[4]
[1] "<br><br><font color=\"blue\"><i>Bear - Correctly Named?</i></font>(726370 - WAS_Card1_Word1_Name)"
> tbls_q[[1]]$V1[3]
[1] "<font color=\"blue\"><i>Place Worksheet 1 in front of the subject.</i></font><br><br><strong>There are 4 pictures on this worksheet. When I tell you a category, point to the item that is in that category and tell me its name. <br><br><br>Point to the 4–Legged Animal and tell me its name.</strong><br><br><font color=\"blue\"><i>Bear - Correctly Identified?</i></font>(726369 - WAS_Card1_Word1_Identify)"
At which point, I'm stuck for how to further extract the text without embedded html <tags, report what it says, report what it should say and which variable, (726369 for example, the question is. I can imagine some regex approaches, but, fragile...

Substitution of strings results in incorrect names

I,d like to change several strings in vector. In my case, I have in all.images object:
# Original character's list
all.images <-c("S2B2A_20171003_124_IndianaIIPR00911120170922_BOA_10.tif",
"S2B2A_20181028_124_IndianaIIPR0065820181024_BOA_10.tif",
"S2B2A_20170715_124_SantaMariaCalcasPR0033420170731_BOA_10.tif",
"S2B2A_20180928_124_NSraAparecidaBortolettoPR0042720180912_BOA_10.tif",
"S2A2A_20170610_124_LagoaAmarelaPR0022020170619_BOA_10.tif",
"S2A2A_20160705_124_AguaSumidaPR001320160629_BOA_10.tif",
"S2A2A_20181023_124_SaoPedroGabrielGarciaPR001720181031_BOA_10.tif",
"S2B2A_20180908_124_NSraAparecidaBortolettoPR001920180911_BOA_10.tif",
"S2A2A_20180824_124_NSraAparecidaBortolettoPR0043320180911_BOA_10.tif",
"S2A2A_20170720_124_VoAnaPR001520170802_BOA_10.tif",
"S2B2A_20180322_124_SaoMateusPR0021920180314_BOA_10.tif",
"S2A2A_20181212_124_NSradeFatimaJoaoBatistaPR002320181128_BOA_10.tif",
"S2A2A_20180413_081_SantaFeSebastiaoFogacaPR0021920180427_BOA_10.tif",
"S2B2A_20170913_124_PerdizesPR0034920170905_BOA_10.tif",
"S2A2A_20170610_124_TresMeninasPR001820170601_BOA_10.tif",
"S2B2A_20180428_081_SantaFeSebastiaoFogacaPR0021020180501_BOA_10.tif",
"S2B2A_20180508_081_SantaFeSebastiaoFogacaPR0022320180427_BOA_10.tif",
"S2A2A_20170809_124_VoAnaPR001620170803_BOA_10.tif",
"S2B2A_20180819_124_PontalIIPR0012220180801_BOA_10.tif",
"S2B2A_20181214_081_NSradeFatimaJoaoBatistaPR002320181128_BOA_10.tif",
"S2A2A_20180423_081_SantaFeSebastiaoFogacaPR0033920180427_BOA_10.tif",
"S2A2A_20180814_124_PontalIIPR0012220180801_BOA_10.tif",
"S2B2A_20170715_124_VoAnaPR0015A20170803_BOA_10.tif",
"S2A2A_20160615_124_AguaSumidaPR0011220160627_BOA_10.tif",
"S2A2A_20170720_124_SantaMariaCalcasPR0022820170726_BOA_10.tif",
"S2A2A_20180913_124_SantaMariaCalcasPR001620180829_BOA_10.tif",
"S2B2A_20170804_124_NSraAparecidaBortolettoPR0035720170811_BOA_10.tif",
"S2A2A_20170809_124_SantaFeBaracatPR001920170801_BOA_10.tif",
"S2B2A_20180322_124_NSradeFatimaGlebaAPR001320180403_BOA_10.tif",
"S2B2A_20180508_081_SantaFeSebastiaoFogacaPR0021920180427_BOA_10.tif")
#
My idea is 1) remove S2B2A_ and _BOA_10.tif; 2) After S2B2A_ convert the 8 values into dates (e.g. 2017-09-05); 3) After the dates take the next three
values to the end (eg. 124 or 081); and 4) Separate the characters based in capital letters and dates (eg. AguaSumidaPR0011220160627 to AguaSumida-PR00112-2016-06-27).
But when I try to do:
sub("^\\w+_(\\d+)_(\\d+)_([A-Za-z]+)([A-Z]{2}\\d{3})(\\d)(\\d{4})(\\d{2})(\\d+)_.*",
"\\3_\\4_\\5_\\6-\\7-\\8_\\1_\\2", all.images)
[1] "IndianaII_PR009_1_1120-17-0922_20171003_124"
[2] "IndianaII_PR006_5_8201-81-024_20181028_124"
...
[28] "SantaFeBaracat_PR001_9_2017-08-01_20170809_124"
[29] "NSradeFatimaGlebaA_PR001_3_2018-04-03_20180322_124"
[30] "SantaFeSebastiaoFogaca_PR002_1_9201-80-427_20180508_081"
I have incorrected dates (eg. in [30] 9201-80-427_20180508_081) and my desirable output needs to be:
[1] "IndianaII_PR009111_2017-09-22_2017-10-03_124"
[2] "IndianaII_PR00658_2018-10-24_2018-10-28_124"
...
[28] "SantaFeBaracat_PR0019_2017-08-01_2017-08-09_124"
[29] "NSradeFatimaGlebaA_PR0013_2018-04-03_2018-03-22_124"
[30] "SantaFeSebastiaoFogaca_PR00219_2018-04-27_2018-05-08_081"
Please any help with it?
I think this handles those exceptions in the comments on your answer using look ahead:
sub("^\\w+_(\\d{4})(\\d{2})(\\d{2})_(\\d+)_([A-Za-z]+)([A-Z]{2}\\w+)(?=\\d{8})+(\\d{4})(\\d{2})(\\d+)_.*",
"\\5_\\6_\\7-\\8-\\9_\\1-\\2-\\3_\\4", all.images, perl = TRUE)

Levels of a dataframe after filtering

i've been doing an assignment for a self study in R programming. I have a question about what happens with factors in a dataframe once you filter it. I have a dataframe that has the columns (movie)Studio and Genre.
For the assignment i need to filter it. I succeeded in this, but when i check the levels of the newly filtered columns all factors are still present, so not only the filtered ones.
Why is this? Am i doing something wrong?
StudioTarget <- c("Buena Vista Studios","Fox","Paramount Pictures","Sony","Universal","WB")
GenreTarget <- c("action","adventure","animation","comedy","drama")
dftest <- df[df$Studio %in% StudioTarget & df$Genre %in% GenreTarget,]
> levels(dftest$Studio)
[1] "Art House Studios" "Buena Vista Studios" "Colombia Pictures"
[4] "Dimension Films" "Disney" "DreamWorks"
[7] "Fox" "Fox Searchlight Pictures" "Gramercy Pictures"
[10] "IFC" "Lionsgate" "Lionsgate Films"
[13] "Lionsgate/Summit" "MGM" "MiraMax"
[16] "New Line Cinema" "New Market Films" "Orion"
[19] "Pacific Data/DreamWorks" "Paramount Pictures" "Path_ Distribution"
[22] "Relativity Media" "Revolution Studios" "Screen Gems"
[25] "Sony" "Sony Picture Classics" "StudioCanal"
[28] "Summit Entertainment" "TriStar" "UA Entertainment"
[31] "Universal" "USA" "Vestron Pictures"
[34] "WB" "WB/New Line" "Weinstein Company"
You can do droplevels(dftest$Studio) to remove unused levels
No, you're not doing anything wrong. A factor defines a fixed number of levels. These levels remain the same even if one or more of them are not present in the data. You've asked for the levels of your factor, not the values present after filtering.
Consider:
library(tidyverse)
mtcars %>%
mutate(cyl= as.factor(cyl)) %>%
filter(cyl == 4) %>%
distinct(cyl) %>%
pull(cyl)
[1] 4
Levels: 4 6 8
Welcome to SO. Next time, pleasetry to provide a minumum working example. This post will help you construct one.

Delete/filter rows with a specific value

We conducted an experiment at Uni which we tried out ourselves before we gave it to real test persons. The problem now is, that our testing-data is included in the whole csv datafile so I need to delete the first 23 "test persons".
They all got a unique code and I could count how many of those unique codes exist (as you can see, there are 38). Now I only need the last 15 of them... I tried it with subset but I don't really now how to filter for those specific last 15 subjectId's (VPcount)
unique(d$VPcount)
uniqueN(d$VPcount)
[1] 7.941675e-312 7.941683e-312 7.941686e-312 7.941687e-312 7.941695e-312 7.941697e-312 7.941734e-312
[8] 7.942134e-312 7.942142e-312 7.942146e-312 7.942176e-312 7.942191e-312 7.942194e-312 7.942199e-312
[15] 7.942268e-312 7.942301e-312 7.942580e-312 7.943045e-312 7.944383e-312 7.944386e-312 7.944388e-312
[22] 7.944388e-312 7.944429e-312 7.944471e-312 7.944477e-312 7.944478e-312 7.944494e-312 7.944500e-312
[29] 7.944501e-312 7.944501e-312 7.944503e-312 7.944503e-312 7.944506e-312 7.944506e-312 7.944506e-312
[36] 7.944506e-312 7.944508e-312 7.944511e-312
[1] 38
You can try :
data <- subset(d, VPcount %in% tail(unique(VPcount), 15))

How to turn rvest output into table

Brand new to R, so I'll try my best to explain this.
I've been playing with data scraping using the "rvest" package. In this example, I'm scraping US state populations from a table on Wikipedia. The code I used is:
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
forecasthtml = html_nodes(statepop, "td")
forecasttext = html_text(forecasthtml)
forecasttext
The resulting output was as follows:
[2] "7000100000000000000♠1"
[3] " California"
[4] "39,250,017"
[5] "37,254,503"
[6] "7001530000000000000♠53"
[7] "738,581"
[8] "702,905"
[9] "12.15%"
[10] "7000200000000000000♠2"
[11] "7000200000000000000♠2"
[12] " Texas"
[13] "27,862,596"
[14] "25,146,105"
[15] "7001360000000000000♠36"
[16] "763,031"
[17] "698,487"
[18] "8.62%"
How can I turn these strings of text into a table that is set up similar to the way it is presented on the original Wikipedia page (with columns, rows, etc)?
Try using rvest's html_table function.
Note there are five tables on the page thus you will need to specify which table you would like to parse.
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
#find all of the tables on the page
tables<-html_nodes(statepop, "table")
#convert the first table into a dataframe
table1<-html_table(tables[1])

Resources