Compiling API outputs in XML format in R - r

I have searched everywhere trying to find an answer to this question and I haven't quite found what I'm looking for yet so I'm hoping asking directly will help.
I am working with the USPS Tracking API, which provides an output an XML format. The API is limited to 35 results per call (i.e. you can only provide 35 tracking numbers to get info on each time you call the API) and I need information on ~90,000 tracking numbers, so I am running my calls in a for loop. I was able to store the results of the call in a list, but then I had trouble exporting the list as-is into anything usable. However, when I tried to convert the results from the list into JSON, it dropped the attribute tag, which contained the tracking number I had used to generate the results.
Here is what a sample result looks like:
<TrackResponse>
<TrackInfo ID="XXXXXXXXXXX1">
<TrackSummary> Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830.</TrackSummary>
<TrackDetail>February 6 6:49 am NOTICE LEFT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805</TrackDetail>
<TrackDetail>February 5 7:28 pm ENROUTE 33699</TrackDetail>
<TrackDetail>February 5 7:18 pm ACCEPT OR PICKUP 33699</TrackDetail>
Here is the script I ran to get the output I'm currently working with:
final_tracking_info <- list()
for (i in 1:x) { # where x = the number of calls to the API the loop will need to make
usps = input_tracking_info[i] # input_tracking_info = GET commands
usps = read_xml(usps)
final_tracking_info1[[i+1]]<-usps$TrackResponse
gc()
}
final_output <- toJSON(final_tracking_info)
write(final_output,"final_tracking_info.json") # tried converting to JSON, lost the ID attribute
cat(capture.output(print(working_list),file = "Final_Tracking_Info.txt")) # exported the list to a textfile, was not an ideal format to work with
What I ultimately want tog et from this data is a table containing the tracking number, the first track detail, and the last track detail. What I'm wondering is, is there a better way to compile this in XML/JSON that will make it easier to convert to a tibble/df down the line? Is there any easy way/preferred format to select based on the fact that I know most of the columns will have the same name ("Track Detail") and the DFs will have to be different lengths (since each package will have a different number of track details) when I'm trying to compile 1,000 of results into one final output?

Using XML::xmlToList() will store the ID attribute in .attrs:
$TrackSummary
[1] " Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830."
$TrackDetail
[1] "February 6 6:49 am NOTICE LEFT BARTOW FL 33830"
$TrackDetail
[1] "February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830"
$TrackDetail
[1] "February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805"
$TrackDetail
[1] "February 5 7:28 pm ENROUTE 33699"
$TrackDetail
[1] "February 5 7:18 pm ACCEPT OR PICKUP 33699"
$.attrs
ID
"XXXXXXXXXXX1"
A way of using that output which assumes that the Summary and ID are always present as first and last elements, respectively, is:
xml_data <- XML::xmlToList("71563898.xml") %>%
unlist() %>% # flattening
unname() # removing names
data.frame (
ID = tail(xml_data, 1), # getting last element
Summary = head(xml_data, 1), # getting first element
Info = xml_data %>% head(-1) %>% tail(-1) # remove first and last elements
)

Related

How do I access (new july 2022) targeting information from the Facebook Ad Library API (R solution preferred)?

As this announcement mentions (https://www.facebook.com/business/news/transparency-social-issue-electoral-political-ads) new targeting information (or a summary) has been made available in the Facebook Ad Library.
I am used to use the 'Radlibrary' package in R, but I can't seem to find any fields in 'Radlibrary' which allows me to get this information? Does anyone know either how to access this information from the Radlibrary package in R (preferred, since this is what I know and usually works with) or how to access this from the API in another way?
I use it to look at how politicians choose to target their ads, why it would be a too big of a task to manually look it up at the facebook.com/ads/library
EDIT
The targeting I refer to is found browsering the ad library like the screenshots below
Thanks for highlighting this data being published which I did not know had been announced. I just registered for an API token to play around with it.
It seems to me that looking for ads from a particular politician or organisation is a question of downloading large amounts of data and then manipulating it in R. For example, to recreate the curl query on the API docs page:
curl -G \
-d "search_terms='california'" \
-d "ad_type=POLITICAL_AND_ISSUE_ADS" \
-d "ad_reached_countries=['US']" \
-d "access_token=<ACCESS_TOKEN>" \
"https://graph.facebook.com/<API_VERSION>/ads_archive"
We can simply do:
# enter token interactively so it doesn't get added to R history
token <- readline()
query <- adlib_build_query(
search_terms = "california",
ad_reached_countries = 'US',
ad_type = "POLITICAL_AND_ISSUE_ADS"
)
response <- adlib_get(params = query, token = token)
results_df <- Radlibrary::as_tibble(response, censor_access_token = TRUE)
This seems to return what one would expect:
names(results_df)
# [1] "id" "ad_creation_time" "ad_creative_bodies" "ad_creative_link_captions" "ad_creative_link_titles" "ad_delivery_start_time"
# [7] "ad_snapshot_url" "bylines" "currency" "languages" "page_id" "page_name"
# [13] "publisher_platforms" "estimated_audience_size_lower" "estimated_audience_size_upper" "impressions_lower" "impressions_upper" "spend_lower"
# [19] "spend_upper" "ad_creative_link_descriptions" "ad_delivery_stop_time"
library(dplyr)
results_df |>
group_by(page_name) |>
summarise(n = n()) |>
arrange(desc(n))
# # A tibble: 237 x 2
# page_name n
# <chr> <int>
# 1 Senator Brian Dahle 169
# 2 Katie Porter 122
# 3 PragerU 63
# 4 Results for California 28
# 5 Big News Buzz 20
# 6 California Water Service 20
# 7 Cancer Care is Different 17
# 8 Robert Reich 14
# 9 Yes On 28 14
# 10 Protect Tribal Gaming 13
# # ... with 227 more rows
Now - assuming that you are interested specifically in the ads by Senator Brian Dahle, it does not appear that you can send a query for all ads he has placed (i.e. using the page_name parameter in the query). But you can request for all political ads in their area (setting the limit parameter to a high number) with a particular search_term or search_page_id, and then filter the data to the relevant person.

Scraping table from timeanddate.com using R

I'm trying to scrape weather data (in R) for the 2nd of March on the following web page: https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020 I am interested in the table at the end, below "Stockholm Weather History for..."
Just above and to the right of that table is a drop-down list where I chose the 2nd of March. But when I scrape using rselenium I only get the data of the 1st of March.
How can I get the data for the 2nd (and any other date except the 1st)
I have also tried to scrape the entire page using read_html but I can't find a way to extract the data I want from that.
The following code is the one that only seem to work for the 1st but any other date in the month.
library(tidyverse)
library(rvest)
library(RSelenium)
library(stringr)
library(dplyr)
rD <- rsDriver(browser="chrome", port=4234L, chromever ="85.0.4183.83")
remDr <- rD[["client"]]
remDr$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
webElems <- remDr$findElements(using="class name", value="sticky-wr")
s<-webElems[[1]]$getElementText()
s<-as.character(s)
print(s)
Here's an approach with RSelenium
library(RSelenium)
library(rvest)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
client$findElement(using = "link text","Mar 2")$clickElement()
source <- client$getPageSource()[[1]]
read_html(source) %>%
html_node(xpath = '//*[#id="wt-his"]') %>%
html_table %>%
head
Conditions Conditions Conditions Comfort Comfort Comfort
1 Time Temp Weather Wind Humidity Barometer Visibility
2 12:20 amMon, Mar 2 39 °F Chilly. 7 mph ↑ 87% 29.18 "Hg N/A
3 12:50 am 37 °F Chilly. 7 mph ↑ 87% 29.18 "Hg N/A
4 1:20 am 37 °F Passing clouds. 7 mph ↑ 87% 29.18 "Hg N/A
5 1:50 am 37 °F Passing clouds. 7 mph ↑ 87% 29.18 "Hg N/A
6 2:20 am 37 °F Overcast. 8 mph ↑ 87% 29.18 "Hg N/A
You can then iterate over dates for findElement().
You can find the xpath by right clicking on the table and choosing Inspect in Chrome:
Then, you can find the table element, right click and choose Copy > Copy XPath.
It is always useful to use your browser's "developer tools" to inspect the web page and figure out how to extract the information you need.
A couple of tutorials that explain this I just googled:
https://towardsdatascience.com/tidy-web-scraping-in-r-tutorial-and-resources-ac9f72b4fe47
https://www.scrapingbee.com/blog/web-scraping-r/
For example, in this particular webpage, when we select a new date in the dropdown list, the webpage sends a GET request to the server, which returns a JSON string with the data of the requested date. Then the webpage updates the data in the table (probably using javascript -- did not check this).
So, in this case you need to emulate this behavior, capture the json file and parse the info in it.
In Chrome, if you look at the developer tool network pane, you will see that the address of the GET request is of the form:
https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=YYYYMMDD&month=M&year=YYYY&json=1
where YYYY stands for year with 4 digits, MM(M) month with two (one) digits, and DD day of the month with two digits.
So you can set your code to do the GET request directly to this address, get the json response and parse it accordingly.
library(rjson)
library(rvest)
library(plyr)
library(dplyr)
year <- 2020
month <- 3
day <- 7
# create formatted url with desired dates
url <- sprintf('https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=%4d%02d%02d&month=%d&year=%4d&json=1', year, month, day, month, year)
webpage <- read_html(url) %>% html_text()
# json string is not formatted the way fromJSON function needs
# so I had to parse it manually
# split string on each row
x <- strsplit(webpage, "\\{c:")[[1]]
# remove first element (garbage)
x <- x[2:length(x)]
# clean last 2 characters in each row
x <- sapply(x, FUN=function(xx){substr(xx[1], 1, nchar(xx[1])-2)}, USE.NAMES = FALSE)
# function to get actual data in each row and put it into a dataframe
parse.row <- function(row.string) {
# parse columns using '},{' as divider
a <- strsplit(row.string, '\\},\\{')[[1]]
# remove some lefover characters from parsing
a <- gsub('\\[\\{|\\}\\]', '', a)
# remove what I think is metadata
a <- gsub('h:', '', gsub('s:.*,', '', a))
df <- data.frame(time=a[1], temp=a[3], weather=a[4], wind=a[5], humidity=a[7],
barometer=a[8])
return(df)
}
# use ldply to run function parse.row for each element of x and combine the results in a single dataframe
df.final <- ldply(x, parse.row)
Result:
> head(df.final)
time temp weather wind humidity barometer
1 "12:20 amSat, Mar 7" "28 °F" "Passing clouds." "No wind" "100%" "29.80 \\"Hg"
2 "12:50 am" "28 °F" "Passing clouds." "No wind" "100%" "29.80 \\"Hg"
3 "1:20 am" "28 °F" "Passing clouds." "1 mph" "100%" "29.80 \\"Hg"
4 "1:50 am" "30 °F" "Passing clouds." "2 mph" "100%" "29.80 \\"Hg"
5 "2:20 am" "30 °F" "Passing clouds." "1 mph" "100%" "29.80 \\"Hg"
6 "2:50 am" "30 °F" "Low clouds." "No wind" "100%" "29.80 \\"Hg"
I left everything as strings in the data frame, but you can convert the columns to numeric or dates with you need.

how to clean irregular strings & organize them into a dataframe at right column

I have two long strings that look like this in a vector:
x <- c("Job Information\n\nLocation: \n\n\nScarsdale, New York, 10583-3050, United States \n\n\n\n\n\nJob ID: \n53827738\n\n\nPosted: \nApril 22, 2020\n\n\n\n\nMin Experience: \n3-5 Years\n\n\n\n\nRequired Travel: \n0-10%",
"Job Information\n\nLocation: \n\n\nGlenview, Illinois, 60025, United States \n\n\n\n\n\nJob ID: \n53812433\n\n\nPosted: \nApril 21, 2020\n\n\n\n\nSalary: \n$110,000.00 - $170,000.00 (Yearly Salary)")
and my goal is to neatly organized them in a dataframe (output form) something like this:
#View(df)
Location Job ID Posted Min Experience Required Travel Salary
[1] Scarsdale,... 53827738 April 22... 3-5 Years 0-10% NA
[2] Glenview,... 53812433 April 21... NA NA $110,000.00 - $170,000.00 (Yearly Salary)
(...) was done to present the dataframe here neatly.
However as you see, two strings doesn't necessarily have same attibutes. Forexample, first string has Min Experience and Required Travel, but on the second string, those field don't exist, but has Salary. So this getting very tricky for me. I thought I will read between \n character but they are not set, some have two newlines, other have 4 or 5. I was wondering if someone can help me out. I will appreciate it!
We can split the string on one or more '\n' ('\n{1,}'). Remove the first word from each (which is 'Job Information') as we don't need it anywhere (x <- x[-1]). For remaining part of the string we can see that they are in pairs in the form of columnname - columnvalue. We create a dataframe from this using alternating index and bind_rows combine all of them by name.
dplyr::bind_rows(sapply(strsplit(gsub(':', '', x), '\n{1,}'), function(x) {
x <- x[-1]
setNames(as.data.frame(t(x[c(FALSE, TRUE)])), x[c(TRUE, FALSE)])
}))
# Location Job ID Posted Min Experience
#1 Scarsdale, New York, 10583-3050, United States 53827738 April 22, 2020 3-5 Years
#2 Glenview, Illinois, 60025, United States 53812433 April 21, 2020 <NA>
# Required Travel Salary
#1 0-10% <NA>
#2 <NA> $110,000.00 - $170,000.00 (Yearly Salary)

Is there a R function to remove rows that do NOT start with a time stamp?

I am trying to familiarize myself with R by cleaning some data from a Whatsapp chat between myself and a friend. So far, I have converted the .txt to a .csv
but i have an issue.
I would like my row to be formatted like this:
,,
When a chat is too long, it starts on a new line (row). So then i end up with a row like:
How can I clean the file so that all my rows start with a timestamp?
I have been trying to work with regular expressions. I have been following a tutorial https://journocode.com/2016/01/31/project-visualizing-whatsapp-chat-logs-part-1-cleaning-data/ but the results are not what I expected
# Add 5 empty rows to end to make space for shift
chat <- cbind(chat, matrix(nrow = nrow(chat), ncol = 5))
cat("Rows without time stamp:", length(grep("^\\D", chat[,1])),
"(", grep("^\\D", chat[,1]), ")", "\n")
for(row in grep("^\\D", chat[,1])){
end <- which(is.na(chat[row,]))[1] #first column without text in it
chat[row, 6:(5+end)] <- chat[row, 1:(end-1)]
chat[row, 1:(end-1)] <- NA
}
chat <- chat[-which(apply(chat, 1, function(x) all(is.na(x))) == TRUE),]
I end up with a very messy csv file. time stamps all over, chat all over. Def not the outcome I had in mind
I wrote a package a little while ago to take care of WhatsApp data. I will use the important parts of the source code and the example data to show you how you can do that on your own. First, let's grab some example data:
chat_raw <- scan(text = "
12/07/2017, 22:35 - Messages to this group are now secured with end-to-end encryption. Tap for more info.
12/07/2017, 22:35 - You created group 'Tes'
12/07/2017, 22:35 - Johannes Gruber: <Media omitted>
12/07/2017, 22:35 - Johannes Gruber: Fruit bread with cheddar <U+263A><U+0001F44C><U+0001F3FB>
13/07/2017, 09:12 - Test: It's fun doing text analysis with R
isn't it?
13/07/2017, 09:16 - Johannes Gruber: Haha it sure is <U+0001F605>
28/09/2018, 13:27 - Johannes Gruber: Did you know there is an incredible number of emojis in WhatsApp? Check it out:
", what = character(), sep = "\n")
This leaves us with an object that is just like what we would get by using readLines(): every line of text is one element of the character vector. Now we can extract the timestamps using regular expressions:
time <- stringi::stri_extract_first_regex(
str = chat_raw,
pattern = "^\\d{2}/\\d{2}/\\d{4}, \\d{2}:\\d{2}"
)
\\d{2} is a number of with two characters, \d{4} a number with four characters. You will have to alter the characters in between the numbers to get your correct date format. I use stringi here for speed but many people find stringr more convenient and the functions work pretty much in the same way. Now the time vector looks like this:
time
#> [1] "12/07/2017, 22:35" "12/07/2017, 22:35" "12/07/2017, 22:35"
#> [4] "12/07/2017, 22:35" "13/07/2017, 09:12" NA
#> [7] "13/07/2017, 09:16" "28/09/2018, 13:27"
We grabbed a time from every line but the one without a timestamp. We can loop through all the lines without a timestamp and add the characters in there to the line before:
for (l in which(is.na(time))) {
chat_raw[l - 1] <- stringi::stri_paste(chat_raw[l - 1], chat_raw[l],
sep = " ")
}
which(is.na(time)) in this case will return only 6 as this is the only line where time is NA. So you can read chat_raw[l - 1] as chat_raw[5], i.e. the fifth line of chat_raw. stringi::stri_paste is the same as paste(), hence line 6 is added to line 5. You can choose a differnt seperator if you want. I choose "\n" to mark a newline in my package. Now the chat_raw and time vector still have this additional element that is useless to us now. We can remove it with:
chat_raw <- chat_raw[!is.na(time)]
time <- time[!is.na(time)]
To get this into a nice format, let's make a data.frame out of it:
tibble::tibble(
time = time,
text = chat_raw
)
#> # A tibble: 7 x 2
#> time text
#> <chr> <chr>
#> 1 12/07/2017, 22~ 12/07/2017, 22:35 - Messages to this group are now secur~
#> 2 12/07/2017, 22~ 12/07/2017, 22:35 - You created group 'Tes'
#> 3 12/07/2017, 22~ 12/07/2017, 22:35 - Johannes Gruber: <Media omitted>
#> 4 12/07/2017, 22~ 12/07/2017, 22:35 - Johannes Gruber: Fruit bread with ch~
#> 5 13/07/2017, 09~ 13/07/2017, 09:12 - Test: It's fun doing text analysis w~
#> 6 13/07/2017, 09~ 13/07/2017, 09:16 - Johannes Gruber: Haha it sure is <U+~
#> 7 28/09/2018, 13~ 28/09/2018, 13:27 - Johannes Gruber: Did you know there ~
Here you go, nice and clean output :)
If you want to do more with your whatsapp data, work through the demo of my package. I haven't posted it on CRAN since I think the contribution is a bit small so far but if you can think of cool functions, I can add them and maybe over time this becomes a legitimate package.

how to extract text from anchor tag inside div class in r

I am trying to fetch text from anchor tag, which is embedded in div tag. Following is the link of website `http://mmb.moneycontrol.com/forum-topics/stocks-1.html
The text I want to extract is Mawana Sugars
Mawana Sugars
So I want to extract all the stocks names listed on this website and description of it.
Here is my attempt to do it in R
doc <- htmlParse("http://mmb.moneycontrol.com/forum-topics/stocks-1.html")
xpathSApply(doc,"//div[#class='clearfix PR PB5']//text()",xmlValue)
But, it does not return anything. How can I do it in R?
My answer is essentially the same as the one I just gave here.
The data is dynamically loaded, and cannot be retrieved directly from the html. But, looking at "Network" in Chrome DevTools for instance, we can find a nicely formatted JSON at http://mmb.moneycontrol.com/index.php?q=topic/ajax_call&section=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1
To get you started:
library(jsonlite)
dat <- fromJSON("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call&section=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1")
Output looks like:
dat[1:3, c("msg_id", "user_id", "topic", "heading", "flag", "price", "message")]
# msg_id user_id topic heading flag
# 1 47730730 liontrade NMDC Stocks APR
# 2 47730726 agrawalknath Glenmark Glenmark APR
# 3 47730725 bissy91 Infosys Stocks APR
# price
# 1 Price when posted : BSE: Rs. 127.90 NSE: Rs. 128.15
# 2 Price when posted : NSE: Rs. 714.10
# 3 Price when posted : BSE: Rs. 956.50 NSE: Rs. 955.00
# message
# 1 There is no mention of dividend in the announcement.
# 2 Eagerly Waiting for 670 to 675 to BUY second phase of Buying in Cash Delivery. Already Holding # 800.
# 3 6 ✂ ✂--Don t Pay High Brokerage While Trading. Take Delivery Free & Rs 20 to trade in any size - Join Today .👉 goo.gl/hDqLnm

Resources