I am using WFP country website (http://www1.wfp.org/countries) aiming web scraping it in order to build up a dataset containing the news issued periodically on that without clicking each time page after page.
Furthermore, I would add some columns including keyword count.
Leaving aside the part of the script containing the Countries and the urls I would focus on the scraping itself, indeed.
Yet, I am using a bunch of packages.
library(rvest)
library(stringr)
library(tidyr)
library(data.table)
library(plyr)
library(xml2)
library(selectr)
library(tibble)
library(purrr)
library(datapasta)
library(jsonlite)
library(countrycode)
library(httr)
library(stringi)
library(tidyverse)
library(dplyr)
library(XML)
I have prepared the dataset for another website and it seems to work well.
A helper here suggested a quite elegant solution for the thing and I have integrated it with my previous work on the country part and everything works well in that. Nevertheless, the solution does not seem to comply with my present need.
Yet, I have this:
## 11. Creating a function in order to scrape data from a website (in this case, WFP's)
wfp_get_news <- function(iso3) { GET(
url = "http://www1.wfp.org/countries/common/allnews/en/",
query = list(iso3=iso3)
) -> res
warn_for_status(res)
if (status_code(res) > 399) return(NULL)
out <- content(res, as="text", encoding="UTF-8")
out <- jsonlite::fromJSON(out)
out$iso3 <- iso3
tbl_df(out)
}
## 12. Setting all the Country urls in order for them to be automatically scraped
pb <- progress_estimated(length(countrycode_data$iso3c[])) # THIS TAKES LONG TO BE PROCESSED
map_df(countrycode_data$iso3c[], ~{
pb$tick()$print()
Sys.sleep(5)
wfp_get_news(.x)
}) -> xdf
## 13. Setting keywords (of course, this process is arbitrary: one can decide any keywor s/he prefers)
keywords <- c("drought", "food security")
keyword_regex <- sprintf("(%s)", paste0(keywords, collapse="|"))
## 14. Setting the keywords search
bind_cols(
xdf,
stri_match_all_regex(tolower(xdf$bodytext), keyword_regex) %>%
map(~.x[,2]) %>%
map_df(~{
res <- table(.x, useNA="always")
nm <- names(res)
nm <- ifelse(is.na(nm), "NONE", stri_replace_all_regex(nm, "[ -]", "_"))
as.list(set_names(as.numeric(res), nm))
})
) %>%
select(-NONE) -> xdf_with_keyword_counts
In particular, when I run point 14. if the script, I attain the following error message:
Error in overscope_eval_next(overscope, expr) :
object "NONE" not found
Furthermore: Warning message:
Unknown or uninitialised column: 'bodytext'.
The expected result should be, more or less, instead:
> glimpse(xdf_with_keyword_counts)
Observations: 12,375
Variables: 12
$ uid <chr> "1071595", "1069933", "1069560", "1045264", "1044139", "1038339", "405003", "1052711", NA, "1062329", "1045248", "...
$ table <chr> "news", "news", "news", "news", "news", "news", "news", "news", NA, "news", "news", "news", "news", "news", NA, "n...
$ title <chr> "Conflicts and drought spur hunger despite strong global food supply", "FAO Calls for Stronger Collaboration on Tr...
$ date <chr> "1512640800", "1511823600", "1511737200", "1508191200", "1508104800", "1505980800", "1459461600", "1293836400", NA...
$ bodytext <chr> " 7 December 2017, Rome- Strong cereal harvests are keeping global food supplies buoyant, but localised drought, f...
$ date_format <chr> "07/12/2017", "28/11/2017", "27/11/2017", "17/10/2017", "16/10/2017", "21/09/2017", "01/04/2016", "01/01/2011", NA...
$ image <chr> "http://www.wfp.org...", "http://www.wfp.org...
$ pid <chr> "2330", "50840", "16275", "70992", "16275", "2330", "40990", "40990", NA, "53724", "53724", "2330", "53724", "5084...
$ detail_pid <chr> "/news/story/en/item/1071595/icode/", "/neareast/news/view/en/c/1069933/", "/asiapacific/news/detail-events/en/c/1...
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "ALA", "ALB", "ALB", "ALB", "ALB", "DZA", "ASM", "AND", "A...
$ drought <dbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ food_security <dbl> NA, NA, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
I hope I made myself quite clear.
Any clue?
I think you hit one of the "gotchas" in web scraping: they removed this functionality/paths on the web site.
Try going to http://www1.wfp.org/countries/common/allnews/en/iso=SLV (El Salvador's news page from the URL scheme you were using a cpl days ago). It doesn't exist.
But, if you go to http://www1.wfp.org/countries/el-salvador there's a link for http://www.wfp.org/news/el-salvador-177 on that page which is the El Salvador news items.
I think it's the same content, just presented differently, so it's just attacking it differently:
library(rvest)
library(httr)
library(stringi)
library(tidyverse)
This is a helper so we can get their country id's and name mappings:
get_countries <- function() {
pg <- read_html("http://www.wfp.org/news/news-releases?tid=All&tid_2=All")
# find the country popup
country_sel <- html_nodes(pg, "select[name='tid'] option")
# extract ids and name for each country, ignoring "All"
data_frame(
cid = html_attr(country_sel, "value"),
cname = html_text(country_sel)
) %>%
filter(stri_detect_regex(cid, "[[:digit:]]"))
}
This is a helper to get the news content on a page
get_news <- function(cid, tid) {
GET("http://www.wfp.org/news/news-releases",
query=list(tid=cid, tid_2=tid)) -> res
warn_for_status(res)
if (status_code(res) > 200) return(NULL)
res <- content(res, as="parsed")
# check for no stories by testing for the presence of the
# div that has the "no stories are found" text
if (length(html_node(res, "div.view-empty")) != 0) return(NULL)
# find the news item boxes on this page
items <- html_nodes(res, "div.list-page-item")
# extract the contents
data_frame(
cid = cid,
tid = tid,
# significant inconsistency in how they assign CSS classes to date boxes
date = html_text(html_nodes(items, xpath=".//div[contains(#class, 'box-date')]"), trim=TRUE),
title = html_text(html_nodes(items, "h3"), trim=TRUE),
# how & where they put summary text in the div is also inconsistent so we
# need to (unfortunately) include the date and title to ensure we capture it
# we cld get just the text, but it's more complex code.
summary = html_text(items, trim=TRUE),
link = html_attr(html_nodes(items, "h3 a"), "href")
)
}
Now, we iterate over the countries and get all the stories:
country_df <- get_countries()
pb <- progress_estimated(length(country_df$cid))
map_df(country_df$cid, ~{
pb$tick()$print()
get_news(.x, "All")
}) -> news_df
# add in country names
mutate(news_df, cid = as.character(cid)) %>%
left_join(country_df) -> news_df
glimpse(news_df)
## Observations: 857
## Variables: 7
## $ cid <chr> "120", "120", "120", "120", "120", "120", "120", "120", "120", "120"...
## $ tid <chr> "All", "All", "All", "All", "All", "All", "All", "All", "All", "All"...
## $ date <chr> "26 October 2017", "16 October 2017", "2 October 2017", "10 July 201...
## $ title <chr> "US Contribution To Boost WFP Food Assistance And Local Economy In A...
## $ summary <chr> "26 October 2017\t\t\r\n\t\t\r\n\tUS Contribution To Boost WFP Food ...
## $ link <chr> "/news/news-release/us-contribution-boost-wfp-food-assistance-and-lo...
## $ cname <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghani...
You still need to try to classify this by adapting the other code you have, and you can use the link in the data frame to mine more text for said classification.
NOTE: this only gets the most recent news page for each country but that's pretty much what you want to do anyway (check for net-new & classify them).
Now, we can try to auto-classify stories by looping through country & pop-up topics list since those topics seem to be what you care about (some of them). You'll need to trust that they tagged things well.
NOTE: This is going to take a long time especially with the "being kind" delay hence why I only scaffold-ed it and didn't run it apart from a light test to ensure it worked:
# get topic ids
get_topics <- function() {
pg <- read_html("http://www.wfp.org/news/news-releases?tid=All&tid_2=All")
# find the topic popup
country_sel <- html_nodes(pg, "select[name='tid_2'] option")
# extract ids and name for each topic, ignoring "All" and sub-topics
# i.e. ignore ones that begin with "-"
data_frame(
tid = html_attr(country_sel, "value"),
tname = html_text(country_sel)
) %>%
filter(stri_detect_regex(tid, "[[:digit:]]")) %>%
filter(tid != "All") # exclude "All" since we're trying to auto-tag
}
topics_df <- get_topics()
pb <- progress_estimated(length(country_df$cid))
map_df(country_df$cid, ~{
pb$tick()$print()
cid <- .x
Sys.sleep(5) ## NOTE THIS SHOULD REALLY GO IN get_news() but I didn't want to mess with that function for this extra part of the example
map_df(topics_df$tid, ~get_news(cid, .x))
}) -> news_with_tagged_topics_df
mutate(news_with_tagged_topics_df, tid = as.character(tid), cid = as.character(cid)) %>%
left_join(topics_df) %>%
left_join(country_df) %>%
glimpse()
I ran it for a random sample of 3 countries:
## Observations: 11
## Variables: 8
## $ cid <chr> "4790", "4790", "4790", "4790", "4790", "4790", "4790", "152", "152"...
## $ tid <chr> "4488", "3929", "3929", "995", "999", "1005", "1005", "997", "995", ...
## $ date <chr> "16 December 2014", "2 September 2016", "1 October 2014", "1 October...
## $ title <chr> "Russia & WFP Seal Partnership To End Hunger; Kamaz Trucks Rolled Ou...
## $ summary <chr> "16 December 2014\t\t\r\n\t\t\r\n\tRussia & WFP Seal Partnership To ...
## $ link <chr> "/news/news-release/russia-wfp-seal-partnership-end-hunger-kamaz-tru...
## $ tname <chr> "Executive Director", "Centre of Excellence against Hunger", "Centre...
## $ cname <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil"...
and it did pick up a diversity of tags:
unique(news_with_tagged_topics_df$tname)
## [1] "Executive Director" "Centre of Excellence against Hunger"
## [3] "Nutrition" "Procurement"
## [5] "School Meals" "Logistics"
Related
I have two data frames:
df<-structure(list(`Active Contact*` = "Entries must be in a Yes or No format. Only active staff may be added to a protocol.",
`First Name*` = "Free text field. [255]", `Middle Name
` = "Free text field. [255]",
`Last Name*` = "Free text field. [255]", `Email**
` = "This field is required when the contact is a user or the contact has any of the Receives Broadcast Emails, Receives Notifications, or Receives Administrative System Notifications settings set to Yes.\r\nThis field must be mapped if Email is selected in the Unique Identifier field. Entries must be unique across all contacts (both active and inactive) and must be in a valid email format (abc#efg.zyx). [254]"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
df2<-structure(list(ActiveContact = c("Yes", "Yes", "Yes", "Yes",
"Yes", "Yes", "Yes"), fname = c("practice", "practice", "practice",
"practice", "practice", "practice", "practice"), middlename = c(NA,
NA, NA, NA, NA, NA, NA), lname = c("PI", "research nurse", "research nurse",
"research nurse", "regulatory", "regulatory", "regulatory"),
email = c("ppi#lifespan.org", "prn#lifespan.org", "prn#lifespan.org",
"prn#lifespan.org", "preg#lifespan.org", "preg#lifespan.org",
"preg#lifespan.org")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
I need to use the the column name from df, and also the first row from df.... as column name and first row in df2 (replacing the column name from df2, and also pushing the first row in df2 down 1 row to fit).
My expected output would be:
I know the column names are terrible (weird symbols and spaces and things I hate), and also I know the first row that I need is full of all sorts of stuff I typically hate, but I need this for my output format.
Thank you!
You can try to row bind them, simultaneously renaming the columns of df2
rbind(df,setNames(df2,names(df)))
Output:
`Active Contact*` `First Name*` `Middle Name\n ~ `Last Name*` `Email**\n \n ~
<chr> <chr> <chr> <chr> <chr>
1 Entries must be in a Yes or No fo~ Free text field~ Free text field. [255] Free text fie~ "This field is required when the contact is a us~
2 Yes practice NA PI "ppi#lifespan.org"
3 Yes practice NA research nurse "prn#lifespan.org"
4 Yes practice NA research nurse "prn#lifespan.org"
5 Yes practice NA research nurse "prn#lifespan.org"
6 Yes practice NA regulatory "preg#lifespan.org"
7 Yes practice NA regulatory "preg#lifespan.org"
8 Yes practice NA regulatory "preg#lifespan.org"
names(df2) <- names(df)
df3 <- rbind(df, df2)
I have this sample dataset:
structure(list(Title = c("Isn't It Romantic", "Isn't It Romantic",
"Isn't It Romantic", "Isn't It Romantic", "Isn't It Romantic",
"Isn't It Romantic", "Gully Boy", "Gully Boy", "Gully Boy", "Gully Boy",
"Gully Boy", "Gully Boy", "The Wandering Earth", "The Wandering Earth",
"The Wandering Earth", "The Wandering Earth", "The Wandering Earth",
"The Wandering Earth", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "American Woman",
"American Woman", "Us", "Us", "Us", "Us", "Us", "Us", "The Wolf's Call",
"The Wolf's Call", "Avengers: Endgame", "Avengers: Endgame",
"Avengers: Endgame", "Avengers: Endgame", "Avengers: Endgame",
"Avengers: Endgame", "The Silence", "The Silence", "The Silence",
"The Silence", "The Silence", "The Silence", "My Little Pony: Equestria Girls: Spring Breakdown",
"My Little Pony: Equestria Girls: Spring Breakdown"), Ratings = c("Internet Movie Database",
"5.9/10", "Rotten Tomatoes", "68%", "Metacritic", "60/100", "Internet Movie Database",
"8.4/10", "Rotten Tomatoes", "100%", "Metacritic", "65/100",
"Internet Movie Database", "6.4/10", "Rotten Tomatoes", "74%",
"Metacritic", "62/100", "Internet Movie Database", "7.6/10",
"Rotten Tomatoes", "91%", "Metacritic", "71/100", "Rotten Tomatoes",
"57%", "Internet Movie Database", "7.1/10", "Rotten Tomatoes",
"94%", "Metacritic", "81/100", "Internet Movie Database", "7.6/10",
"Internet Movie Database", "8.7/10", "Rotten Tomatoes", "94%",
"Metacritic", "78/100", "Internet Movie Database", "5.2/10",
"Rotten Tomatoes", "23%", "Metacritic", "25/100", "Internet Movie Database",
"7.7/10")), row.names = c(NA, -48L), class = c("tbl_df", "tbl",
"data.frame"))
The Ratings column has 3 different types of Ratings (Imdb, Rotten Tomatoes and Metacritic) for each movie, spread out over 6 rows for each movie.
I'd like to wrangle this dataset so that for each movie, I create a new column called rottentomatoes_rating and the values are the rating. So, in my sample dataset, Isn't it Romantic movie would have 68% under rottentomatoes_rating, Gully Boy would have 100% under rottentomatoes_rating, etc.
For those movies that don't have a rottentomatoes_rating, then I'd like to put NA under rottentomatoes_rating.
I've thought about using spread in tidyr, but I can't quite figure out how to do so since in my case, the variable and values are all in the same column!
Assuming your dataset is called dt you can use this process to get a tidy version of your dataset:
library(tidyverse)
# specify indexes of Rating companies
ids = seq(1, nrow(dt), 2)
# get rows of Rating companies
dt %>% slice(ids) %>%
# combine with the rating values
cbind(dt %>% slice(-ids) %>% select(RatingsValue = Ratings)) %>%
# reshape dataset
spread(Ratings, RatingsValue)
# Title Year Rated Released Runtime Internet Movie Database Metacritic Rotten Tomatoes
# 1 Gully Boy 2019 Not Rated 2019-02-14 153 min 8.4/10 65/100 100%
# 2 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 5.9/10 60/100 68%
If the data is formatted similarly throughout your dataset, the following code should work:
df %>% group_by(Title) %>%
slice(match("Rotten Tomatoes", df$Ratings) + 1) %>%
rename(rottentomatoes_rating = Ratings)
This gives:
# A tibble: 2 x 6
# Groups: Title [2]
Title Year Rated Released Runtime rottentomatoes_rating
<chr> <chr> <chr> <date> <chr> <chr>
1 Gully Boy 2019 Not Rated 2019-02-14 153 min 100%
2 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 68%
For the NAs, if the original data always has the RT score the row after the string is observed, then it should give you NA by default.
sumshyftw answer is good.
But here is a data.table version if you simply want to get rotten tomatoes's percents:
dt <- dt[dt$Ratings %like% "%",]
dt <- setnames(dt, "Ratings", "rottentomatoes_rating")
Output :
# A tibble: 2 x 6
Title Year Rated Released Runtime rottentomatoes_rating
<chr> <chr> <chr> <date> <chr> <chr>
1 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 68%
2 Gully Boy 2019 Not Rated 2019-02-14 153 min 100%
I used %like% "%" because I assume that the full data is just like your example.
new version that fills NA values when blank
# using data.table
library(data.table)
dt <- as.data.table(df)
# Index will hold whether the row is a Provider eg Rotten Tomatoes, or a value
dt[, Index:=rep(c("Provider", "Value"), .N/2)]
# Need an index to bind these together
dt[, Provider.Id:=rep(1:(.N/2), each=2), by=Title]
dt[1:6,]
# segment out the Provider & Values in to columns
out <- dcast(dt, Title+Provider.Id~Index, value.var = "Ratings")
out[, Provider := NULL]
# now convert to full wide format
out_df <- as.data.frame(dcast(out, Title~Provider, value.var="Value", fill=NA))
out_df
To get all metrics with data.table
# using data.table
library(data.table)
dt <- as.data.table(df)
# groups the data set with by, and extracts the Ratings
# makes use of logic that the odd indeces hold the name of the provider,
# the even ones hold the values. Only works if this holds.
# It can probably be optimised a bit. dcast converts from long to required wide
# format
splitRatings <- function(Ratings){
# e.g. Ratings=dt$Ratings[1:6]
N <- length(Ratings)
split_dt <- data.table(DB=Ratings[1:N %% 2 == 1],
Values=Ratings[1-(1:N %% 2) == 1])
out <- dcast(split_dt, .~DB, value.var = "Values")
out[, ".":=NULL]
out
}
# applies the function based on the by clause, returning the table embedded
dt2 <- dt[, splitRatings(Ratings), by=.(Title, Year, Rated, Released, Runtime)]
# convert back
out <- as.data.frame(dt2)
Here is one version.
df %>%
mutate(Value = ifelse(str_detect(Ratings, "\\d"), Ratings, NA)) %>%
fill(Value, .direction = "up") %>%
filter(!str_detect(Ratings, "\\d")) %>%
spread(Ratings, Value)
After this expression
good.rows<-ifelse(nchar(ufo$DateOccurred)!=10 | nchar(ufo$DateReported)!=10,
FALSE, TRUE)
I expected to get vectors of Booleans but I got
length(good.rows)
[1] 0
This is logical(empty) as I can see in R studio. What can I do to solve this?
dput(head(ufo))
"structure(list(DateOccured = structure(c(9412, 9413, 9131, 9260,
9292, 9428), class = "Date"), DateReported = structure(c(9412,
9414, 9133, 9260, 9295, 9427), class = "Date"), Location = c(" Iowa City, IA",
" Milwaukee, WI", " Shelton, WA", " Columbia, MO", " Seattle, WA",
" Brunswick County, ND"), ShortDescription = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Duration = c(NA, "2 min.", NA, "2 min.", NA, "30 min."), LongDescription = c("Man repts. witnessing "flash, followed by a classic UFO, w/ a tailfin at back." Red color on top half of tailfin. Became triangular.",
"Man on Hwy 43 SW of Milwaukee sees large, bright blue light streak by his car, descend, turn, cross road ahead, strobe. Bizarre!",
"Telephoned Report:CA woman visiting daughter witness discs and triangular ships over Squaxin Island in Puget Sound. Dramatic. Written report, with illustrations, submitted to NUFORC.",
"Man repts. son's bizarre sighting of small humanoid creature in back yard. Reptd. in Acteon Journal, St. Louis UFO newsletter.",
"Anonymous caller repts. sighting 4 ufo's in NNE sky, 45 deg. above horizon. (No other facts reptd. No return tel. #.)",
"Sheriff's office calls to rept. that deputy, 20 mi. SSE of Wilmington, is looking at peculiar, bright white, strobing light."
)), row.names = c(NA, 6L), class = "data.frame")"
There are a couple of reasons why this could be happening:
You're dataset is empty, check this using the dim() method.
The columns are not of type Character check this using the class()
method.
If both of these are correct try running the nchar(...) statements
separately.
Below I've create an example that works correctly, where I've gone through the above mentioned steps. In future please provide a reproducible example as part of your question.
# Create sample data
ufo <- data.frame(DateOccurred=c("a","bb","ccc"),
DateReported=c("a","bb","ccc"),
stringsAsFactors = FALSE)
print(ufo)
# Check size of data (make sure data has rows and columns are of type Character)
dim(ufo)
class(ufo$DateOccurred)
class(ufo$DateReported)
# Check nchar statements (Should run without error/warnings)
nchar(ufo$DateOccurred)
nchar(ufo$DateReported)
# Actual
good.rows <- ifelse(nchar(ufo$DateOccurred)!=3 | nchar(ufo$DateReported)!=3,
FALSE, TRUE)
print(good.rows)
length(good.rows)
My objective is to replace NAs with nearest groupings of variables. For instance, let's say there are four variables A, B , C and Num. Num is numerical variable, while A, B and C are categorical. Now if the value of Num is missing for say, A = Alpha, B = Beta and C = Theta, then I would like to look for other observations for this combination, calculate their mean and replace NA.
If such combination doesn't exist, I would look for observations with combinations of A = Alpha and B = Beta (hence, the term "nearest groupings"), calculate their mean and substitute it.
If such combination doesn't exist, I would look for all observations classified as A = Alpha, calculate their mean and substitute it.
If this is the only observation, then we will replace it with 0. I have created such scenarios in the test file I am posting herewith.
While my code works well, it's very procedural. I have transitioned from doing C/C++ programming and I am still not used to R's vectorized methods. Hence, I am looking for a method that is:
a) cleaner (no for loops please; less memory and faster). While writing the code, I realized that I am not fully utilizing the power of R programming.
b) easy to understand.
I have added comments in sample output below just for reference.
Input Data:
dput(DFile)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G2",
"G3", "G3", "G3", "G3", "G5", "G5", "G5", "G5", "G5", "G6"),
Country_SV = c("United States", "United States", "United States",
"United States", "United States", "United States", "United States",
NA, NA, NA, "Europe", "UK", "France", "Europe", "Europe",
"Australia"), Product_BU = c("Laptop", "Laptop", "Laptop",
"Laptop", "Laptop", "Laptop", "Laptop", NA, NA, NA, "Pencil",
"Power Cord", "Laptop", "Keyboard", "Mouse", "Motherboard"
), Prob_model3 = c(0, 79647405.9878251, 282615405.328728,
NA, NA, 363419594.065383, 0, 72870592.8458704, 260045174.088548,
369512727.253779, NA, 234, NA, 5, 10, NA)), .Names = c("Region_SL",
"Country_SV", "Product_BU", "Prob_model3"), row.names = c(NA,
16L), class = "data.frame")
Expected Output:
Please note that comments are just for reference. That column isn't required.
dput(Output)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G2",
"G3", "G3", "G3", "G3", "G5", "G5", "G5", "G5", "G5", "G6"),
Country_SV = c("United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "Europe",
"UK", "France", "Europe", "Europe", "Australia"), Product_BU = c("Laptop",
"Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop",
"Laptop", "Laptop", "Laptop", "Pencil", "Power Cord", "Laptop",
"Keyboard", "Mouse", "Motherboard"), Prob_model3 = c(0, 79647405.9878251,
282615405.328728, 120754270.438851, 363419594.065383, 363419594.065383,
0, 72870592.8458704, 260045174.088548, 369512727.253779,
7.5, 234, 83, 5, 10, 0), Comment = c(NA, NA, NA, "Grouped on G1, Laptop, US; Average of rows 1 to 3",
"Grouped on G2, US, Laptop; Average is the only value in row 6",
NA, NA, NA, NA, NA, "Group of G5, Europe and Pencil are unique; G5 and Europe exist. Average of row 14 and 15",
NA, "Group of G5, France and Laptop is unique; Group of G5 and France is unique as well; Use group of G5 and take average of row 12, 14, 15",
NA, NA, "Unique. Substitute 0")), .Names = c("Region_SL",
"Country_SV", "Product_BU", "Prob_model3", "Comment"), row.names = c(NA,
16L), class = "data.frame")
Here's my code: (The code works well, and the expected output is b. b is nothing but Output posted above without the comments.
DFile_New <-DFile
DFile_New<-DFile_New %>%
arrange(Region_SL, Country_SV,Product_BU)
#replace categorical variable with the combination above or below the row to complete cases.
DFile_New[,1:3]<- zoo::na.locf(DFile_New[,1:3])
#Create look-up table for means, for each type of combination.
Lookup1<- DFile_New %>%
dplyr::group_by(Region_SL, Country_SV, Product_BU) %>%
dplyr::summarise(count=n(),Mean_prob = mean(Prob_model3,na.rm = TRUE))
Lookup2<-DFile_New %>%
dplyr::group_by(Region_SL, Country_SV) %>%
dplyr::summarise(count=n(),Mean_prob = mean(Prob_model3,na.rm = TRUE))
Lookup3<-DFile_New %>%
dplyr::group_by(Region_SL) %>%
dplyr::summarise(count=n(),Mean_prob = mean(Prob_model3,na.rm = TRUE))
Lookup_Table<-dplyr::bind_rows(Lookup1,Lookup2,Lookup3)
#Get rid of those rows with count = 1
Lookup_Table<-Lookup_Table[!Lookup_Table$count==1,]
colnames(Lookup_Table)[5]<-"Prob_model3"
#Look for combinations based on Region, Country and Product
b<-DFile_New %>%
dplyr::left_join(Lookup_Table,by=c("Region_SL", "Country_SV", "Product_BU"))
b$Prob_model3 <- coalesce(b$Prob_model3.x,b$Prob_model3.y)
#Drop the two columns
b$Prob_model3.x<-NULL
b$Prob_model3.y<-NULL
b$count<-NULL
b<-b[!(is.na(b$Country_SV)&is.na(b$Product_BU)),]
c<-b[is.na(b$Prob_model3),] %>%
dplyr::left_join(Lookup_Table[is.na(Lookup_Table$Product_BU) & !is.na(Lookup_Table$Country_SV),],by=c("Region_SL", "Country_SV")) %>%
dplyr::mutate(Prob_model3 = coalesce(Prob_model3.x,Prob_model3.y)) %>%
dplyr::select(Region_SL:Product_BU.x, Prob_model3)
colnames(c)[3]<-"Product_BU"
colnames(c)[4]<-"Prob_model3"
b<-rbind(b,c)
b%>% unite(Col,Region_SL:Product_BU,sep=".")
c<-b
b<-b[complete.cases(b[4]),]
#Look for combinations based on Region, and Country
c<-c[is.na(c$Prob_model3),] %>%
dplyr::left_join(Lookup_Table[is.na(Lookup_Table$Product_BU) & is.na(Lookup_Table$Country_SV),],by=c("Region_SL")) %>%
dplyr::filter(!is.na(Prob_model3.y)) %>%
dplyr::mutate(Prob_model3.1 = coalesce(Prob_model3.x,Prob_model3.y)) %>%
dplyr::select(Region_SL:Product_BU.x, Prob_model3.1) %>%
unique(.)
colnames(c)[3]<-"Product_BU"
colnames(c)[2]<-"Country_SV"
#Look for combinations based on Region
b<-b%>%
full_join(c) %>%
dplyr::mutate(Prob_model3.2 = coalesce(Prob_model3,Prob_model3.1)) %>%
dplyr::select(Region_SL:Product_BU,Prob_model3.2)
colnames(b)[4]<-"Prob_model3"
#Are there any unique observations left?
b<-rbind(b,anti_join(DFile_New,b,by=c("Region_SL", "Country_SV", "Product_BU")))
b[is.na(b$Prob_model3),"Prob_model3"]<-0
I'm relatively new to the world of R programming. I'd sincerely appreciate any help.
I'm preferably looking for advanced solution--lapply/dplyr/tidyr, anything is fine, so long as it isn't so complicated as mine.
My sessionInfo:
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] grDevices datasets stats graphics grid tcltk utils methods base
$otherPkgs
[1] "bit" "bit64" "boot" "car" "compare"
[6] "corrgram" "corrplot" "cowplot" "debug" "directlabels"
[11] "dplyr" "foreign" "Formula" "ggplot2" "ggthemes"
[16] "gmodels" "hexbin" "Hmisc" "installr" "knitr"
[21] "lattice" "lubridate" "magrittr" "maps" "openxlsx"
[26] "pastecs" "plotly" "plyr" "psych" "purrr"
[31] "R2HTML" "readr" "readstata13" "reshape2" "ResourceSelection"
[36] "rJava" "rmarkdown" "sm" "stringr" "survival"
[41] "tables" "tibble" "tidyr" "tidyverse" "tufte"
[46] "tufterhandout" "vcd" "xlsxjars" "xts" "zoo"
$loadedOnly
[1] "acepack" "assertthat" "backports" "base64enc" "bitops" "broom" "caTools"
[8] "checkmate" "class" "cluster" "codetools" "colorspace" "data.table" "DBI"
[15] "dendextend" "DEoptimR" "digest" "diptest" "evaluate" "flexmix" "foreach"
[22] "fpc" "gclus" "gdata" "gplots" "gridExtra" "gtable" "gtools"
[29] "haven" "hms" "htmlTable" "htmltools" "htmlwidgets" "httr" "iterators"
[36] "jsonlite" "kernlab" "KernSmooth" "latticeExtra" "lazyeval" "lme4" "lmtest"
[43] "MASS" "Matrix" "MatrixModels" "mclust" "mgcv" "minqa" "mnormt"
[50] "modelr" "modeltools" "munsell" "mvbutils" "mvtnorm" "nlme" "nloptr"
[57] "nnet" "parallel" "pbkrtest" "prabclus" "quadprog" "quantreg" "R6"
[64] "RColorBrewer" "Rcpp" "readxl" "registry" "robustbase" "rpart" "rprojroot"
[71] "rvest" "scales" "seriation" "SparseM" "splines" "stats4" "stringi"
[78] "tools" "trimcluster" "TSP" "viridisLite" "whisker" "xml2"
There is a two-step solution using dplyr tools:
Create columns that represent "lookups" for specific type of averaging;
Replace NAs in consecutive fashion.
Here is the code:
library(dplyr)
df_1 <- df %>%
group_by(Region_SL) %>%
summarise(lookup_1 = mean(Prob_model3, na.rm=TRUE))
df_2 <- df %>%
group_by(Region_SL, Country_SV) %>%
summarise(lookup_2 = mean(Prob_model3, na.rm=TRUE))
df_3 <- df %>%
group_by(Region_SL, Country_SV, Product_BU) %>%
summarise(lookup_3 = mean(Prob_model3, na.rm=TRUE))
df_new <- df %>%
left_join(df_3, by = c("Region_SL", "Country_SV", "Product_BU")) %>%
left_join(df_2, by = c("Region_SL", "Country_SV")) %>%
left_join(df_1, by = c("Region_SL")) %>%
mutate(modProb_model3 = coalesce(x=Prob_model3,
lookup_3, lookup_2, lookup_1,
0)) %>%
select(Region_SL, Country_SV, Product_BU, Prob_model3=modProb_model3)
Here df is the input data frame. df_1, df_2 and df_3 are data frames with certain averaging information (index represents the number of categorical variables for grouping).
After consecutive left joins new variable modProb_model3 is created with coalesce function: it finds the first non-missing value at each position.
Edit
I think that solution above is the most effective for a particular problem. If, for example, there are at least 10 possible groupings to consider in NA replacing then some automation will be better. This automation can be as follows (using packages tidyverse and lazyeval):
library(tidyverse)
value_name <- "Prob_model3"
max_group_vars <- c("Region_SL", "Country_SV", "Product_BU")
n_group_vars <- length(max_group_vars)
lookup_vars_list <- c(x = value_name, paste0("lookup_", n_group_vars:1)) %>%
as.list()
get_lookup_table <- function(.data,
group_vars,
value_name = "Prob_model3",
lookup_index = 1) {
summarise_data <- (~ mean(val, na.rm = TRUE)) %>%
lazyeval::interp(val = as.name(value_name)) %>%
list() %>%
setNames(paste0("lookup_", lookup_index))
.data %>%
group_by_(.dots = as.list(group_vars)) %>%
summarise_(.dots = summarise_data)
}
df_new_1 <- c(
list(df),
map(n_group_vars:1, function(lookup_index) {
get_lookup_table(.data = df,
group_vars = max_group_vars[1:lookup_index],
value_name = value_name,
lookup_index = lookup_index)
})
) %>%
reduce(left_join) %>%
mutate(modValue = select_(., .dots = lookup_vars_list) %>%
as.list() %>%
c(0) %>%
do.call(what = coalesce)) %>%
select(-matches(match = paste0("^lookup_[0-9]+$|", value_name))) %>%
rename_(.dots = setNames(list("modValue"), value_name))
Basically algorithm is the same but code is more general: it replaces NA values in column with name value_name based on its averages in groups defined by decreasing set of column names (starting with the set stored in max_group_vars). Code uses standard evaluation of dplyr heavily (see https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html). Here are some explanations:
get_lookup_table creates lookup table for specified grouping variables. The lookup column has unique name lookup_i where i - specified index of lookup table;
Consecutive left joins are done with reduce function which consecutively applies a function with two arguments (here left_join with x and y) to its previous call and a new value. For example: f1=f(x1, x2) --> f2=f(f1, x3) and so on. Note Left joins are done correctly without specifying by argument because lookup tables are created in a way that "natural join" is correct. Also there will be dplyr warnings about its guessing of joining columns;
modValue is created with function do.call and created preliminary list of arguments for coalesce;
The last two elements in the pipe: selects appropriate columns with use of regular expression and then renames modValue to the desired value_name.
I'm writing an R function. I would like it to be able to take a list of movies, download info about them, and then throw it into a data frame.
So far,
rottenrate <- function(movie){
link <- paste("http://www.omdbapi.com/?t=", movie, "&y=&plot=short&r=json&tomatoes=true", sep = "")
jsonData <- fromJSON(link)
return(jsonData)
}
This will return info for one movie and won't convert to a data.frame.
Thanks for any help.
You could do it like this:
# First, vectorize function
rottenrate <- function(movie){
require(RJSONIO)
link <- paste("http://www.omdbapi.com/?t=", movie, "&y=&plot=short&r=json&tomatoes=true", sep = "")
jsonData <- fromJSON(link)
return(jsonData)
}
vrottenrate <- Vectorize(rottenrate, "movie", SIMPLIFY = FALSE)
# Now, query and combine
movies <- c("inception", "toy story")
df <- do.call(rbind, lapply(vrottenrate(movies), function(x) as.data.frame(t(x), stringsAsFactors = FALSE)))
dplyr::glimpse(df)
# Observations: 2
# Variables:
# $ Title (chr) "Inception", "Toy Story"
# $ Year (chr) "2010", "1995"
# $ Rated (chr) "PG-13", "G"
# $ Released (chr) "16 Jul 2010", "22 Nov 1995"
# $ Runtime (chr) "148 min", "81 min"
# $ Genre (chr) "Action, Mystery, Sci-Fi", "Animation,
# ...
Interesting database btw... :-)