How to parse non xml as xml? [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
http://api.bart.gov/api/stn.aspx?cmd=stns&key=MW9S-E7SL-26DU-VV8V
How do I port this as an XML document? I'm trying to parse this in R.

You can use xml2 to read and parse:
library(xml2)
library(tidyverse)
xml <- read_xml('https://api.bart.gov/api/stn.aspx?cmd=stns&key=MW9S-E7SL-26DU-VV8V')
bart <- xml %>% xml_find_all('//station') %>% # select all station nodes
map_df(as_list) %>% # coerce each node to list, collect to data.frame
unnest() # unnest list columns of data.frame
bart
#> # A tibble: 46 × 9
#> name abbr gtfs_latitude gtfs_longitude
#> <chr> <chr> <chr> <chr>
#> 1 12th St. Oakland City Center 12TH 37.803768 -122.271450
#> 2 16th St. Mission 16TH 37.765062 -122.419694
#> 3 19th St. Oakland 19TH 37.808350 -122.268602
#> 4 24th St. Mission 24TH 37.752470 -122.418143
#> 5 Ashby ASHB 37.852803 -122.270062
#> 6 Balboa Park BALB 37.721585 -122.447506
#> 7 Bay Fair BAYF 37.696924 -122.126514
#> 8 Castro Valley CAST 37.690746 -122.075602
#> 9 Civic Center/UN Plaza CIVC 37.779732 -122.414123
#> 10 Coliseum COLS 37.753661 -122.196869
#> # ... with 36 more rows, and 5 more variables: address <chr>, city <chr>,
#> # county <chr>, state <chr>, zipcode <chr>

Using library rvest. The base idea is to find nodes (xml_nodes) of interest with XPath selectors, then grab the values with xml_text
library(rvest)
doc <- read_xml("http://api.bart.gov/api/stn.aspx?cmd=stns&key=MW9S-E7SL-26DU-VV8V")
names <- doc %>%
xml_nodes(xpath = "/root/stations/station/name") %>%
xml_text()
names[1:5]
# [1] "12th St. Oakland City Center" "16th St. Mission" "19th St. Oakland" "24th St. Mission"
# [5] "Ashby"

I had some problems using the URL within read_html directly. So I used readLines first. After that, its finding all the nodesets with <station>. Transform it into a list and feed it into data.table::rbindlist. Idea of using rbindlist came from here
library(xml2)
library(data.table)
nodesets <- read_html(readLines("http://api.bart.gov/api/stn.aspx?cmd=stns&key=MW9S-E7SL-26DU-VV8V")) %>%
xml_find_all(".//station")
data.table::rbindlist(as_list(nodesets))

Related

I need help merging two rows based on certain string character, the string is complaint

I am trying to calculate the fraction of the construction noise per zip code across NY city. The data is from NYC 311.
I am using dplyr and have grouped the data per zip.
However, I am finding difficulties merging the row for the complain column, I have to merge the data as per the string "construction" it appear anywhere meaning middle, front or end.
My solution, this is just the beginning
comp_types <- df %>% select(complaint_type,descriptor,incident_zip) %>%
group_by(incident_zip)
can you help me merge the row if unique value in descriptor contains any construction value.
Can you clarify what you mean by "merging"? I don't think you actually want to merge because you only have one dataframe. The term "merging" is used to describe the joining of two dataframes.
See ?base::merge:
Merge two data frames by common columns or row names, or do other versions of database join operations.
If I understand correctly, you want to look into the descriptor variable and see if it contains the string "construction" anywhere in the cell, so you can determine if the person's complaint was construction-related; same for "music". I don't believe you need to use complaint_type since complaint_type never contains the string "construction" or "music"; only descriptor does.
You can use a combination of ifelse and grepl to create a new variable that indicates whether the complaint was construction-related, music-related, or other.
library(tidyverse)
library(janitor)
url <- "https://data.cityofnewyork.us/api/views/p5f6-bkga/rows.csv"
df <- read.csv(url, nrows = 10000) %>%
clean_names() %>%
select(complaint_type, descriptor, incident_zip)
comp_types <- df %>%
select(complaint_type, descriptor, incident_zip) %>%
group_by(incident_zip)
head(comp_types)
#> # A tibble: 6 × 3
#> # Groups: incident_zip [6]
#> complaint_type descriptor incident_zip
#> <chr> <chr> <int>
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
table(df$complaint_type)
#>
#> Noise Noise - Commercial Noise - Helicopter
#> 555 591 145
#> Noise - House of Worship Noise - Park Noise - Residential
#> 20 72 5675
#> Noise - Street/Sidewalk Noise - Vehicle
#> 2040 902
df <- df %>%
mutate(descriptor_misc = ifelse(grepl("Construction", descriptor), "Construction",
ifelse(grepl("Music", descriptor), "Music", "Other")))
df %>%
group_by(descriptor_misc) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: descriptor_misc [3]
#> descriptor_misc n
#> <chr> <int>
#> 1 Construction 328
#> 2 Music 6354
#> 3 Other 3318
head(df)
#> complaint_type descriptor incident_zip
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
#> descriptor_misc
#> 1 Other
#> 2 Music
#> 3 Other
#> 4 Music
#> 5 Music
#> 6 Construction

R combine rows and columns within a dataframe

I've looked around for a while trying to figure this out, but I just can't seem to describe my problem concisely enough to google my way out of it. I am trying to work with Michigan COVID stats where the data has Detroit listed separately from Wayne County. I need to add Detroit's numbers to Wayne County's numbers, then remove the Detroit rows from the data frame.
I have included a screen grab too. For the purposes of this problem, can someone explain how I can get Detroit City added to Dickinson, and then make the Detroit City rows disappear? Thanks.
library(tidyverse)
library(openxlsx)
cases_deaths <- read.xlsx("https://www.michigan.gov/coronavirus/-/media/Project/Websites/coronavirus/Cases-and-Deaths/4-20-2022/Cases-and-Deaths-by-County-2022-04-20.xlsx?rev=f9f34cd7a4614efea0b7c9c00a00edfd&hash=AA277EC28A17C654C0EE768CAB41F6B5.xlsx")[,-5]
# Remove rows that don't describe counties
cases_deaths <- cases_deaths[-c(51,52,101,102,147,148,167,168),]
Code chunk output picture
You could do:
cases_deaths %>%
filter(COUNTY %in% c("Wayne", "Detroit City")) %>%
mutate(COUNTY = "Wayne") %>%
group_by(COUNTY, CASE_STATUS) %>%
summarize_all(sum) %>%
bind_rows(cases_deaths %>%
filter(!COUNTY %in% c("Wayne", "Detroit City")))
#> # A tibble: 166 x 4
#> # Groups: COUNTY [83]
#> COUNTY CASE_STATUS Cases Deaths
#> <chr> <chr> <dbl> <dbl>
#> 1 Wayne Confirmed 377396 7346
#> 2 Wayne Probable 25970 576
#> 3 Alcona Confirmed 1336 64
#> 4 Alcona Probable 395 7
#> 5 Alger Confirmed 1058 8
#> 6 Alger Probable 658 5
#> 7 Allegan Confirmed 24109 294
#> 8 Allegan Probable 3024 52
#> 9 Alpena Confirmed 4427 126
#> 10 Alpena Probable 1272 12
#> # ... with 156 more rows
Created on 2022-04-23 by the reprex package (v2.0.1)

Webscraping Tables From Wikipedia in R

I was wondering if anyone had useful ideas or code for web scraping tables from Wikipedia.
Specifically, I'm interested in the Presidential election results table in the "Results by county" section on Wikipedia.
An example table can be found using the following link and scrolling down to the "Results by county" section: https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas
The table looks like this:
I've tried some solutions from the following StackOverflow post: Importing wikipedia tables in R
However, they don't appear to be appliable to the type of table I want to scrape in Wikipedia.
Any advice, solutions, or code would be greatly appreciated. Thank you!
Making use of the rvest package you could get the table by first selecting the element containing the desired table via html_element("table.wikitable.sortable") and then extracting the table via html_table() like so:
library(rvest)
url <- "https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas"
html <- read_html(url)
county_table <- html %>%
html_element("table.wikitable.sortable") %>%
html_table()
head(county_table)
#> # A tibble: 6 x 14
#> County `Harry S. Truman… `Harry S. Truman… `Thomas E. Dewey… `Thomas E. Dewe…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 County # % # %
#> 2 Anders… 3,242 62.37% 1,199 23.07%
#> 3 Andrews 816 85.27% 101 10.55%
#> 4 Angeli… 4,377 69.05% 1,000 15.78%
#> 5 Aransas 418 61.02% 235 34.31%
#> 6 Archer 1,599 86.20% 191 10.30%
#> # … with 9 more variables: Strom ThurmondStates’ Rights Democratic <chr>,
#> # Strom ThurmondStates’ Rights Democratic.1 <chr>,
#> # Henry A. WallaceProgressive <chr>, Henry A. WallaceProgressive.1 <chr>,
#> # Various candidatesOther parties <chr>,
#> # Various candidatesOther parties.1 <chr>, Margin <chr>, Margin.1 <chr>,
#> # Total votes cast[11] <chr>

How to replace NA values when parsing a html page to make a dataframe? [duplicate]

This question already has answers here:
How to return NA when nothing is found in an xpath?
(2 answers)
Closed 5 years ago.
When trying to parse a html page, we can get NA values. So when we try to build a data frame with data in a list, missing values make it impossible.
Is there any easy way to succeed. please see the following example:
library(rvest)
library(RCurl)
library(XML)
pg <- getURL("https://agences.axa.fr/ile-de-france/paris/paris-19e-75019")
page = htmlTreeParse(pg,useInternal = TRUE,encoding="UTF-8")
unlist(xpathApply(page,'//b[#class="Name"]',xmlValue))
data.frame(noms = unlist(xpathApply(page,'//b[#class="Name"]',xmlValue)),
rue = unlist(xpathApply(page,'//span[#class="street-address"]',xmlValue)))
Using rvest and purrr (the tidyverse package for lists/functional programming, which pairs very nicely with rvest),
library(rvest)
library(purrr)
# be nice, only scrape once
h <- 'https://agences.axa.fr/ile-de-france/paris/paris-19e-75019' %>% read_html()
df <- h %>%
# select each list item
html_nodes('div.ListConseiller li') %>%
# for each item, make a list of parsed name and street; coerce results to data.frame
map_df(~list(nom = .x %>% html_node('b.Name') %>% html_text(),
rue = .x %>% html_node('span.street-address') %>% html_text(trim = TRUE)))
df
#> # A tibble: 14 × 2
#> nom rue
#> <chr> <chr>
#> 1 Marie France Tmim <NA>
#> 2 Rachel Tobie <NA>
#> 3 Bernard Licha <NA>
#> 4 David Giuili <NA>
#> 5 Myriam Yajid Khalfi <NA>
#> 6 Eytan Elmaleh <NA>
#> 7 Allister Charles <NA>
#> 8 Serge Savergne 321 Rue De Belleville
#> 9 Patrick Allouche 1 Rue Clavel
#> 10 Anne Fleiter 14 Avenue De Laumiere
#> 11 Eric Fitoussi <NA>
#> 12 Jean-Baptiste Crocombette 1 Bis Rue Emile Desvaux
#> 13 Eric Zunino 14 Rue De Thionville
#> 14 Eric Hayoun <NA>
The code uses CSS selectors for brevity, but use XPath ones via the xpath parameter of html_nodes and html_node, if you prefer.

r rvest webscraping hltv

Yes, that's just another "how-to-scrape" question. Sorry for that, but I've read the previous answers and the manual for rvest as well.
I'm doing web-scraping for my homework (so I do not plan to use the data for any commercial issue). The idea is to show that average skill of team affect individual skill. I'm trying to use CS:GO data from HLTV.org for it.
The information is available at http://www.hltv.org/?pageid=173&playerid=9216
I need two tables: Keystats (data only) and Teammates (data and URLs). I try to use CSS selectors generated by SelectorGadget and I also tryed to analyze the source code of webpage. I've failed. I'm doing the following:
library(rvest)
library(dplyr)
url <- 'http://www.hltv.org/?pageid=173&playerid=9216'
info <- html_session(url) %>% read_html()
info %>% html_node('.covSmallHeadline') %>% html_text()
Can you please tell me that is the right CSS selector?
If you look at the source, those tables aren't HTML tables, but just piles of divs with inconsistent nesting and inline CSS for alignment. Thus, it's easiest to just grab all the text and fix the strings afterwards, as the data is either all numeric or not at all.
library(rvest)
library(tidyverse)
h <- 'http://www.hltv.org/?pageid=173&playerid=9216' %>% read_html()
h %>% html_nodes('.covGroupBoxContent') %>% .[-1] %>%
html_text(trim = TRUE) %>%
strsplit('\\s*\\n\\s*') %>%
setNames(map_chr(., ~.x[1])) %>% map(~.x[-1]) %>%
map(~data_frame(variable = gsub('[.0-9]+', '', .x),
value = parse_number(.x)))
#> $`Key stats`
#> # A tibble: 9 × 2
#> variable value
#> <chr> <dbl>
#> 1 Total kills 9199.00
#> 2 Headshot %% 46.00
#> 3 Total deaths 6910.00
#> 4 K/D Ratio 1.33
#> 5 Maps played 438.00
#> 6 Rounds played 11242.00
#> 7 Average kills per round 0.82
#> 8 Average deaths per round 0.61
#> 9 Rating (?) 1.21
#>
#> $TeammatesRating
#> # A tibble: 4 × 2
#> variable value
#> <chr> <dbl>
#> 1 Gabriel 'FalleN' Toledo 1.11
#> 2 Fernando 'fer' Alvarenga 1.11
#> 3 Joao 'felps' Vasconcellos 1.09
#> 4 Epitacio 'TACO' de Melo 0.98

Resources