I have the following dataframe which contains reviews that customer have left on a restaurant website:
id<-c(1,2,3,4,5,6)
review<- c("the food was very delicious and hearty - perfect to warm up during a freezing winters day", "Excellent service as usual","Love this place!", "Service and quality of food first class"," Customer services was exceptional by all staff","excellent services")
df<-data.frame(id, review)
Now I am looking for a way (preferably without using a for loop) to find the part-of-speech labels in each customer's review in R.
This is a pretty straightforward adaption of the example on the Maxent_POS_Tag_Annotator help page.
df<-data.frame(id, review, stringsAsFactors=FALSE)
library(NLP)
library(openNLP)
review.pos <-
sapply(df$review, function(ii) {
a2 <- Annotation(1L, "sentence", 1L, nchar(ii))
a2 <- annotate(ii, Maxent_Word_Token_Annotator(), a2)
a3 <- annotate(ii, Maxent_POS_Tag_Annotator(), a2)
a3w <- subset(a3, type == "word")
tags <- sapply(a3w$features, `[[`, "POS")
sprintf("%s/%s", as.String(ii)[a3w], tags)
})
Which results in this output:
#[[1]]
# [1] "the/DT" "food/NN" "was/VBD" "very/RB" "delicious/JJ"
# [6] "and/CC" "hearty/NN" "-/:" "perfect/JJ" "to/TO"
#[11] "warm/VB" "up/RP" "during/IN" "a/DT" "freezing/JJ"
#[16] "winters/NNS" "day/NN"
#
#[[2]]
#[1] "Excellent/JJ" "service/NN" "as/IN" "usual/JJ"
#
#[[3]]
#[1] "Love/VB" "this/DT" "place/NN" "!/."
#
#[[4]]
#[1] "Service/NNP" "and/CC" "quality/NN" "of/IN" "food/NN"
#[6] "first/JJ" "class/NN"
#
#[[5]]
#[1] "Customer/NN" "services/NNS" "was/VBD" "exceptional/JJ"
#[5] "by/IN" "all/DT" "staff/NN"
#
#[[6]]
#[1] "excellent/JJ" "services/NNS"
It should be relatively straightforward to adapt this to whatever format you want.
Considerig in your example the id column is simply the row index, I believe you can obtain your desired output with the pos() function from the qdap package.
library(qdap)
pos(df$review)
If you do need grouping because of multiple reviews per customer, you can use
pos_by(df$review,df$id)
If you don't mind trying a GitHub package I have the tagger package to wrap NLP/openNLP to do a number of tasks quickly in the way Python users manipulate pos tags. Note that the output prints in the traditional word/tag format but in reality the object is actually a list of named vectors. This makes working with the words and tags easier. Here I demo how to get the tags and a few manipulations that tagger makes easy:
# First load your data and get the tagger package for those playing along at home
id<-c(1,2,3,4,5,6)
review<- c("the food was very delicious and hearty - perfect to warm up during a freezing winters day", "Excellent service as usual","Love this place!", "Service and quality of food first class"," Customer services was exceptional by all staff","excellent services")
df<-data.frame(id, review)
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/tagger")
# Now tag and manipulate
(out <- tag_pos(as.character(df[["review"]])))
## [1] "the/DT food/NN was/VBD very/RB delicious/JJ and/CC hearty/NN -/: perfect/JJ to/TO warm/VB up/RP during/IN a/DT freezing/JJ winters/NNS day/NN"
## [2] "Excellent/JJ service/NN as/IN usual/JJ"
## [3] "Love/VB this/DT place/NN !/."
## [4] "Service/NNP and/CC quality/NN of/IN food/NN first/JJ class/NN"
## [5] "Customer/NN services/NNS was/VBD exceptional/JJ by/IN all/DT staff/NN"
## [6] "excellent/JJ services/NNS"
c(out) ## True structure: list of named vectors
as_word_tag(out) ## Match the print method (less mutable)
count_tags(out, df[["id"]]) ## Get counts by row
plot(out) ## tag distribution (plot at end)
as_basic(out) ## basic pos tags
## [1] "the/article food/noun was/verb very/adverb delicious/adjective and/conjunction hearty/noun -/. perfect/adjective to/preposition warm/verb up/preposition during/preposition a/article freezing/adjective winters/noun day/noun"
## [2] "Excellent/adjective service/noun as/preposition usual/adjective"
## [3] "Love/verb this/adjective place/noun !/."
## [4] "Service/noun and/conjunction quality/noun of/preposition food/noun first/adjective class/noun"
## [5] "Customer/noun services/noun was/verb exceptional/adjective by/preposition all/adjective staff/noun"
## [6] "excellent/adjective services/noun"
select_tags(out, c("NN", "NNP", "NNPS", "NNS"))
## [1] "food/NN hearty/NN winters/NNS day/NN"
## [2] "service/NN"
## [3] "place/NN"
## [4] "Service/NNP quality/NN food/NN class/NN"
## [5] "Customer/NN services/NNS staff/NN"
## [6] "services/NNS"
Everything works pretty nicely within a magrittr pipeline as well, which is my preference. The Examples Section of the README has a nice overview of the package's usage.
Related
I'm trying to use R to fetch all the links to data files on the Eurostat website. While my code currently "works", I seem to get a duplicate result for every link.
Note, the use of download.file is to get around my company's firewall, per this answer
library(dplyr)
library(rvest)
myurl <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?dir=data&sort=1&sort=2&start=all"
download.file(myurl, destfile = "eurofull.html")
content <- read_html("eurofull.html")
links <- content %>%
html_nodes("a") %>% #Note that I dont know the significance of "a", this was trial and error
html_attr("href") %>%
data.frame()
# filter to only get the ".tsv.gz" links
files <- filter(links, grepl("tsv.gz", .))
Looking at the top of the dataframe
files$.[1:6]
[1] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali01.tsv.gz
[2] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali01.tsv.gz
[3] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_ali02.tsv.gz
[4] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_ali02.tsv.gz
[5] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&file=data%2Faact_eaa01.tsv.gz
[6] /eurostat/estat-navtree-portlet-prod/BulkDownloadListing?
sort=1&downfile=data%2Faact_eaa01.tsv.gz
The only difference between 1 and 2 is that 1 says "...file=data..." while 2 says "...downfile=data...". This pattern continues for all pairs down the dataframe.
If I download 1 and 2 and read the files into R, an identical check confirms they are the same.
Why are two links to the same data being returned? Is there a way (other than filtering for "downfile") to only return one of the links?
As noted, you can just do some better node targeting. This uses XPath vs CSS selectors and picks the links with downfile in the href:
html_nodes(content, xpath = ".//a[contains(#href, 'downfile')]") %>%
html_attr("href") %>%
sprintf("http://ec.europa.eu/%s", .) %>%
head()
## [1] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali01.tsv.gz"
## [2] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_ali02.tsv.gz"
## [3] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa01.tsv.gz"
## [4] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa02.tsv.gz"
## [5] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa03.tsv.gz"
## [6] "http://ec.europa.eu//eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2Faact_eaa04.tsv.gz"
My question is almost same as here. I want to download all files from this page. But the difference is I do not have the same pattern to be able to download all the files.
Any idea to get the download in R ?
# use the FTP mirror link provided on the page
mirror <- "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/"
# read the file listing
pg <- readLines(mirror)
# take a look
head(pg)
## [1] "06-18-09 06:18AM 713075 srtm_01_02.zip"
## [2] "06-18-09 06:18AM 130923 srtm_01_07.zip"
## [3] "06-18-09 06:18AM 130196 srtm_01_12.zip"
## [4] "06-18-09 06:18AM 156642 srtm_01_15.zip"
## [5] "06-18-09 06:18AM 317244 srtm_01_16.zip"
## [6] "06-18-09 06:18AM 160847 srtm_01_17.zip"
# clean it up and make them URLs
fils <- sprintf("%s%s", mirror, sub("^.*srtm", "srtm", pg))
head(fils)
## [1] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_02.zip"
## [2] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_07.zip"
## [3] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_12.zip"
## [4] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_15.zip"
## [5] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_16.zip"
## [6] "ftp://srtm.csi.cgiar.org/SRTM_v41/SRTM_Data_GeoTIFF/srtm_01_17.zip"
# test download
download.file(fils[1], basename(fils[1]))
# validate it worked before slamming the server (your job)
# do the rest whilst being kind to the mirror server
for (f in fils[-1]) {
download.file(f, basename(f))
Sys.sleep(5) # unless you have entitlement issues, space out the downloads by a few seconds
}
If you don't mind using a non-base package, curl can help you just get the file names vs doing the sub above:
unlist(strsplit(rawToChar(curl::curl_fetch_memory(mirror, curl::new_handle(dirlistonly=TRUE))$content), "\n"))
This is not the most elegant solution, but it appears to be working when I try it on random subsets of helplinks.
library(rvest)
#Grab filenames from separate URL
helplinks <- read_html("http://rdf.muninn-project.org/api/elevation/datasets/srtm/") %>% html_nodes("a") %>% html_text(trim = T)
#Keep only filenames relevant for download
helplinks <- helplinks[grepl("srtm", helplinks)]
#Download files - make sure to adjust the `destfile` argument of the download.file function.
lapply(helplinks, function(x) download.file(sprintf("http://srtm.csi.cgiar.org/SRT-ZIP/SRTM_V41/SRTM_Data_GeoTiff/%s", x), sprintf("C:/Users/aud/Desktop/%s", x)))
This is the second time that I have faced this recently, so I wanted to reach out to see if there is a better way to parse dataframes returned from jsonlite when one of elements is an array stored as a column in the dataframe as a list.
I know that this part of the power with jsonlite, but I am not sure how to work with this nested structure. In the end, I suppose that I can write my own custom parsing, but given that I am almost there, I wanted to see how to work with this data.
For example:
## options
options(stringsAsFactors=F)
## packages
library(httr)
library(jsonlite)
## setup
gameid="2015020759"
SEASON = '20152016'
BASE = "http://live.nhl.com/GameData/"
URL = paste0(BASE, SEASON, "/", gameid, "/PlayByPlay.json")
## get the data
x <- GET(URL)
## parse
api_response <- content(x, as="text")
api_response <- jsonlite::fromJSON(api_response, flatten=TRUE)
## get the data of interest
pbp <- api_response$data$game$plays$play
colnames(pbp)
And exploring what comes back:
> class(pbp$aoi)
[1] "list"
> class(pbp$desc)
[1] "character"
> class(pbp$xcoord)
[1] "integer"
From above, the column pbp$aoi is a list. Here are a few entries:
> head(pbp$aoi)
[[1]]
[1] 8465009 8470638 8471695 8473419 8475792 8475902
[[2]]
[1] 8470626 8471276 8471695 8476525 8476792 8477956
[[3]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[4]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[5]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[6]]
[1] 8469619 8471695 8473492 8474625 8475727 8475902
I don't really care if I parse these lists in the same dataframe, but what do I have for options to parse out the data?
I would prefer to take the data out of out lists and parse them into a dataframe that can be "related" to the original record it came from.
Thanks in advance for your help.
From #hrbmstr above, I was able to get what I wanted using unnest.
select(pbp, eventid, aoi) %>% unnest() %>% head
I am attempting to extract all words that start with a particular phrase from a website. The website I am using is:
http://docs.ggplot2.org/current/
I want to extract all the words that start with "stat_". I should get 21 names like "stat_identity" in return. I have the following code:
stats <- readLines("http://docs.ggplot2.org/current/")
head(stats)
grep("stat_{1[a-z]", stats, value=TRUE)
I am returned every line containing the phrase "stat_". I just want to extract the "stat_" words. So I tried something else:
gsub("\b^stat_[a-z]+ ", "", stats)
I think the output I got was an empty string, " ", where a "stat_" phrase would be? So now I'm trying to think of ways to extract all the text and set everything that is not a "stat_" phrase to empty strings. Does anyone have any ideas on how to get my desired output?
rvest & stringr to the rescue:
library(xml2)
library(rvest)
library(stringr)
pg <- read_html("http://docs.ggplot2.org/current/")
unique(str_match_all(html_text(html_nodes(pg, "body")),
"(stat_[[:alnum:]_]+)")[[1]][,2])
## [1] "stat_bin" "stat_bin2dCount"
## [3] "stat_bindot" "stat_binhexBin"
## [5] "stat_boxplot" "stat_contour"
## [7] "stat_density" "stat_density2d"
## [9] "stat_ecdf" "stat_functionSuperimpose"
## [11] "stat_identity" "stat_qqCalculation"
## [13] "stat_quantile" "stat_smooth"
## [15] "stat_spokeConvert" "stat_sum"
## [17] "stat_summarySummarise" "stat_summary_hexApply"
## [19] "stat_summary2dApply" "stat_uniqueRemove"
## [21] "stat_ydensity" "stat_defaults"
Unless you need the links (then you can use other rvest functions), this removes all the markup for you and just gives you the text of the website.
I have twitter data. Using library(stringr) i have extracted all the weblinks. However, when I try to do the same I am getting error. The same code had worked some days ago. The following is the code:
library(stringr)
hash <- "#[a-zA-Z0-9]{1, }"
hashtag <- str_extract_all(travel$texts, hash)
The following is the error:
Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
Error in {min,max} interval. (U_REGEX_BAD_INTERVAL)
I have re-installed stringr package....but doesn't help.
The code that I used for weblink is:
pat1 <- "http://t.co/[a-zA-Z0-9]{1,}"
twitlink <- str_extract_all(travel$texts, pat1)
The reproduceable example is as follows:
rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36")))
Your problem comes from the whitespace in the hash:
#Not working (look the whitespace after the comma)
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1, }")
#working
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1,}")
You may want to consider usig the qdapRegex package that I maintain for this task. It makes extracting urls and hash tags easy. qdapRegex is a package that contains a bunch of canned regex and the uses the amazing stringi package as a backend to do the regex task.
rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36")))
library(qdapRegex)
## first combine the built in url + twitter regexes into a function
rm_twitter_n_url <- rm_(pattern=pastex("#rm_twitter_url", "#rm_url"), extract=TRUE)
rm_twitter_n_url(rtt$texts)
rm_hash(rtt$texts, extract=TRUE)
Giving the following output:
## > rm_twitter_n_url(rtt$texts)
## [[1]]
## [1] "httptcoLPihj2sNEP"
##
## [[2]]
## [1] "httptconMHNlDqv69"
##
## [[3]]
## [1] "httptcolMvChSpaBT"
## > rm_hash(rtt$texts, extract=TRUE)
## [[1]]
## [1] "#stevenewman"
##
## [[2]]
## [1] "#Job" "#Canada" "#Marlin" "#St"
##
## [[3]]
## [1] "#Fiji" "#NewZealand"