Combine tidy text with synonyms to create dataframe - r

I have sample data frame as below:
quoteiD <- c("q1","q2","q3","q4", "q5")
quote <- c("Unthinking respect for authority is the greatest enemy of truth.",
"In the middle of difficulty lies opportunity.",
"Intelligence is the ability to adapt to change.",
"Science is not only a disciple of reason but, also, one of romance and passion.",
"If I have seen further it is by standing on the shoulders of Giants.")
library(dplyr)
quotes <- tibble(quoteiD = quoteiD, quote= quote)
quotes
I have created some tidy text as below
library(tidytext)
data(stop_words)
tidy_words <- quotes %>%
unnest_tokens(word, quote) %>%
anti_join(stop_words) %>%
count( word, sort = TRUE)
tidy_words
Further, I have searched the synonyms using qdap package as below
library(qdap)
syns <- synonyms(tidy_words$word)
The qdap out put is a list , and I am looking to pick the first 5 synonym for each word in the tidy data frame and create a column called synonyms as below:
word n synonyms
ability 1 adeptness, aptitude, capability, capacity, competence
adapt 1 acclimatize, accommodate, adjust, alter, apply,
authority 1 ascendancy, charge, command, control, direction
What is an elegant way of merging the list of 5 words from qdap synonym function and separate by commas?

One way this can be done using a tidyverse solution is
library(plyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:plyr':
#>
#> arrange, count, desc, failwith, id, mutate, rename, summarise,
#> summarize
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
library(qdap)
#> Loading required package: qdapDictionaries
#> Loading required package: qdapRegex
#>
#> Attaching package: 'qdapRegex'
#> The following object is masked from 'package:dplyr':
#>
#> explain
#> Loading required package: qdapTools
#>
#> Attaching package: 'qdapTools'
#> The following object is masked from 'package:dplyr':
#>
#> id
#> The following object is masked from 'package:plyr':
#>
#> id
#> Loading required package: RColorBrewer
#>
#> Attaching package: 'qdap'
#> The following object is masked from 'package:dplyr':
#>
#> %>%
#> The following object is masked from 'package:base':
#>
#> Filter
library(tibble)
library(tidyr)
#>
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:qdap':
#>
#> %>%
quotes <- tibble(quoteiD = paste0("q", 1:5),
quote= c(".\n\nthe ebodac consortium consists of partners: janssen (efpia), london school of hygiene and tropical medicine (lshtm),",
"world vision) mobile health software development and deployment in resource limited settings grameen\n\nas such, the ebodac consortium is well placed to tackle.",
"Intelligence is the ability to adapt to change.",
"Science is a of reason of romance and passion.",
"If I have seen further it is by standing on ."))
quotes
#> # A tibble: 5 x 2
#> quoteiD quote
#> <chr> <chr>
#> 1 q1 ".\n\nthe ebodac consortium consists of partners: janssen (efpia~
#> 2 q2 "world vision) mobile health software development and deployment~
#> 3 q3 Intelligence is the ability to adapt to change.
#> 4 q4 Science is a of reason of romance and passion.
#> 5 q5 If I have seen further it is by standing on .
data(stop_words)
tidy_words <- quotes %>%
unnest_tokens(word, quote) %>%
anti_join(stop_words) %>%
count( word, sort = TRUE)
#> Joining, by = "word"
tidy_words
#> # A tibble: 33 x 2
#> word n
#> <chr> <int>
#> 1 consortium 2
#> 2 ebodac 2
#> 3 ability 1
#> 4 adapt 1
#> 5 change 1
#> 6 consists 1
#> 7 deployment 1
#> 8 development 1
#> 9 efpia 1
#> 10 grameen 1
#> # ... with 23 more rows
syns <- synonyms(tidy_words$word)
#> no match for the following:
#> consortium, ebodac, consists, deployment, efpia, grameen, janssen, london, lshtm, partners, settings, software, tropical
#> ========================
syns %>%
plyr::ldply(data.frame) %>% # Change the list to a dataframe (See https://stackoverflow.com/questions/4227223/r-list-to-data-frame)
rename("Word_DefNumber" = 1, "Syn" = 2) %>% # Rename the columns with a name that is more intuitive
separate(Word_DefNumber, c("Word", "DefNumber"), sep = "\\.") %>% # Find the word part of the word and definition number
group_by(Word) %>% # Group by words, so that when we select rows it is done for each word
slice(1:5) %>% # Keep the first 5 rows for each word
summarise(synonyms = paste(Syn, collapse = ", ")) %>% # Combine the synonyms together comma separated using paste
ungroup() # So there are not unintended effects of having the data grouped when using the data later
#> # A tibble: 20 x 2
#> Word synonyms
#> <chr> <chr>
#> 1 ability adeptness, aptitude, capability, capacity, competence
#> 2 adapt acclimatize, accommodate, adjust, alter, apply
#> 3 change alter, convert, diversify, fluctuate, metamorphose
#> 4 development advance, advancement, evolution, expansion, growth
#> 5 health fitness, good condition, haleness, healthiness, robustness
#> 6 hygiene cleanliness, hygienics, sanitary measures, sanitation
#> 7 intelligence acumen, alertness, aptitude, brain power, brains
#> 8 limited bounded, checked, circumscribed, confined, constrained
#> 9 medicine cure, drug, medicament, medication, nostrum
#> 10 mobile ambulatory, itinerant, locomotive, migrant, motile
#> 11 passion animation, ardour, eagerness, emotion, excitement
#> 12 reason apprehension, brains, comprehension, intellect, judgment
#> 13 resource ability, capability, cleverness, ingenuity, initiative
#> 14 romance affair, affaire (du coeur), affair of the heart, amour, at~
#> 15 school academy, alma mater, college, department, discipline
#> 16 science body of knowledge, branch of knowledge, discipline, art, s~
#> 17 standing condition, credit, eminence, estimation, footing
#> 18 tackle accoutrements, apparatus, equipment, gear, implements
#> 19 vision eyes, eyesight, perception, seeing, sight
#> 20 world earth, earthly sphere, globe, everybody, everyone
Created on 2019-04-05 by the reprex package (v0.2.1)
Please note that plyr should be loaded before dplyr

Related

I need help merging two rows based on certain string character, the string is complaint

I am trying to calculate the fraction of the construction noise per zip code across NY city. The data is from NYC 311.
I am using dplyr and have grouped the data per zip.
However, I am finding difficulties merging the row for the complain column, I have to merge the data as per the string "construction" it appear anywhere meaning middle, front or end.
My solution, this is just the beginning
comp_types <- df %>% select(complaint_type,descriptor,incident_zip) %>%
group_by(incident_zip)
can you help me merge the row if unique value in descriptor contains any construction value.
Can you clarify what you mean by "merging"? I don't think you actually want to merge because you only have one dataframe. The term "merging" is used to describe the joining of two dataframes.
See ?base::merge:
Merge two data frames by common columns or row names, or do other versions of database join operations.
If I understand correctly, you want to look into the descriptor variable and see if it contains the string "construction" anywhere in the cell, so you can determine if the person's complaint was construction-related; same for "music". I don't believe you need to use complaint_type since complaint_type never contains the string "construction" or "music"; only descriptor does.
You can use a combination of ifelse and grepl to create a new variable that indicates whether the complaint was construction-related, music-related, or other.
library(tidyverse)
library(janitor)
url <- "https://data.cityofnewyork.us/api/views/p5f6-bkga/rows.csv"
df <- read.csv(url, nrows = 10000) %>%
clean_names() %>%
select(complaint_type, descriptor, incident_zip)
comp_types <- df %>%
select(complaint_type, descriptor, incident_zip) %>%
group_by(incident_zip)
head(comp_types)
#> # A tibble: 6 × 3
#> # Groups: incident_zip [6]
#> complaint_type descriptor incident_zip
#> <chr> <chr> <int>
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
table(df$complaint_type)
#>
#> Noise Noise - Commercial Noise - Helicopter
#> 555 591 145
#> Noise - House of Worship Noise - Park Noise - Residential
#> 20 72 5675
#> Noise - Street/Sidewalk Noise - Vehicle
#> 2040 902
df <- df %>%
mutate(descriptor_misc = ifelse(grepl("Construction", descriptor), "Construction",
ifelse(grepl("Music", descriptor), "Music", "Other")))
df %>%
group_by(descriptor_misc) %>%
count()
#> # A tibble: 3 × 2
#> # Groups: descriptor_misc [3]
#> descriptor_misc n
#> <chr> <int>
#> 1 Construction 328
#> 2 Music 6354
#> 3 Other 3318
head(df)
#> complaint_type descriptor incident_zip
#> 1 Noise - Residential Banging/Pounding 11364
#> 2 Noise - Residential Loud Music/Party 11222
#> 3 Noise - Residential Banging/Pounding 10033
#> 4 Noise - Residential Loud Music/Party 11208
#> 5 Noise - Residential Loud Music/Party 10037
#> 6 Noise Noise: Construction Before/After Hours (NM1) 11238
#> descriptor_misc
#> 1 Other
#> 2 Music
#> 3 Other
#> 4 Music
#> 5 Music
#> 6 Construction

split char currency into two separate columns in data frame

I have this data frame df
Items Item Code Prices
1 Beds 1630 $135.60
2 Big Shelve 1229 89.5USD
3 Small Shelve 1229 ¥3680.03
4 Chair 445 92.63€
5 Desk 802 206.43 euro
6 Lamp 832 25307.1 JPY
I want to split the prices column into three column: Prices and Currency and Exchange rate from USD using
Items Item Code Prices Currency Exchange rates
1 Beds 1630 135.60 USD 1.00
2 Big Shelve 1229 89.50 USD 1.00
3 Small Shelve 1229 3680.03 JPY 115.71
4 Chair 445 92.63 EUR 0.90
5 Desk 802 206.43 EUR 0.90
6 Lamp 832 25307.10 JPY 115.71
I tried using dplyr::separate() but instead it would separate at comma instead.
If I try using the gsub() it gives me this error
> df2 <- df %>%
+ mutate(price = as.numeric(gsub'[$,€,¥,]','', df$Col3))
Error: unexpected string constant in:
"df2 <- df %>%
mutate(price = as.numeric(gsub'[$,€,¥,]'"
Any ideas what to do? Also, how would I able to reference the currency to correct items?
This should solve the problem. Using the quantmod package, you can get the current exchange rate and add that into the data:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stringr)
library(tidyr)
library(quantmod)
#> Loading required package: xts
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#>
#> Attaching package: 'xts'
#> The following objects are masked from 'package:dplyr':
#>
#> first, last
#> Loading required package: TTR
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
dat <- tibble::tribble(
~Items, ~"Item Code", ~Prices,
"Beds", 1630, "$135.60",
"Big Shelve", 1229, "89.5USD",
"Small Shelve", 1229, "¥3680.03",
"Chair", 445, "92.63€",
"Desk", 802, "206.43 euro",
"Lamp", 832, "25307.1 JPY")
dat <- dat %>%
mutate(currency = c(trimws(str_extract_all(Prices, "[^\\d\\.]+", simplify = TRUE))),
currency = case_when(currency %in% c("€", "euro") ~ "EUR",
currency == "$" ~ "USD",
currency == "¥" ~ "JPY",
TRUE ~ currency),
Prices = as.numeric(str_extract_all(Prices, "\\d+\\.\\d+", simplify=TRUE)),
xr = paste0("USD", currency, "=X")) %>%
left_join(getQuote(unique(.$xr)) %>% as_tibble(rownames = "xr") %>% select(xr, Last)) %>%
select(-xr) %>%
rename("Exchange rates" = "Last")
#> Joining, by = "xr"
dat
#> # A tibble: 6 × 5
#> Items `Item Code` Prices currency `Exchange rates`
#> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 Beds 1630 136. USD 1
#> 2 Big Shelve 1229 89.5 USD 1
#> 3 Small Shelve 1229 3680. JPY 116.
#> 4 Chair 445 92.6 EUR 0.902
#> 5 Desk 802 206. EUR 0.902
#> 6 Lamp 832 25307. JPY 116.
Created on 2022-03-03 by the reprex package (v2.0.1)

fct_collapse in R?

I have a factor that's words (instances of words that difference participants said). I want to collapse it so that there are the categories "that" (every instance of the word "that") and notThat (all other words combined into one category). Naturally there are a lot of other words, and I don't want to go through and type them all. I've tried using != in various places, but it won't work. Maybe I just have the syntax wrong?
Anyway, is there a way to do this? That is, collapse all words that aren't "that" into one group?
How about this:
library(forcats)
x <- c("that", "something", "else")
fct_collapse(x, that = c("that"), other_level="notThat")
#> [1] that notThat notThat
#> Levels: that notThat
Created on 2022-02-15 by the reprex package (v2.0.1)
Edit to show in a data frame
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(forcats)
dat <- data.frame(
gender = factor(c(1,0,1,1,1,0), labels=c("male", "female")),
age = round(runif(6, 18,85)),
word = c("that", "something", "altogether", "different", "entirely", "that"))
dat %>%
mutate(word_collapse = fct_collapse(word, that="that", other_level="notThat"))
#> gender age word word_collapse
#> 1 female 74 that that
#> 2 male 72 something notThat
#> 3 female 57 altogether notThat
#> 4 female 44 different notThat
#> 5 female 79 entirely notThat
#> 6 male 81 that that
Created on 2022-02-15 by the reprex package (v2.0.1)

Extract Model Description from a mable

I have a mable object that is like so:
models
# A mable: 1 x 3
ets arima nnetar
<model> <model> <model>
1 <ETS(M,Ad,M)> <ARIMA(2,1,2)(0,0,2)[12]> <NNAR(14,1,10)[12]>
I just want the models descriptions so I can place them in a plot. So I ran the following code:
model_desc <- models %>%
gather() %>%
select(key, value) %>%
set_names("model","model_desc") %>%
mutate(model_desc_char = model_desc %>% as.character())
as_tibble() %>%
select(model, model_desc)
This still gives me back a tibble where model_desc is still a list object. I think this is because of how a mable is constructed and how its structure is supposed to be.
** UPDATE **
I solved the problem by doing the following:
model_desc <- models %>%
as_tibble() %>%
gather() %>%
mutate(model_desc = print(value)) %>%
select(key, model_desc) %>%
set_names("model", "model_desc")
For anybody else who will encounter this going forward, I have pasted a solution that works for me with the latest versions of fable/fabletools.
library(fable)
#> Loading required package: fabletools
library(tsibble)
library(tsibbledata)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:tsibble':
#>
#> interval
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
aus_retail %>%
filter(
State %in% c("New South Wales", "Victoria"),
Industry == "Department stores"
) %>%
model(
ets = ETS(box_cox(Turnover, 0.3)),
arima = ARIMA(log(Turnover)),
snaive = SNAIVE(Turnover)
) %>%
pivot_longer(cols = -c(State, Industry),
names_to = "model_type",
values_to = "model_specifics_mdl") %>%
mutate(model_specifics = format(model_specifics_mdl)) %>%
select(-model_specifics_mdl)
#> # A tibble: 6 x 4
#> State Industry model_type model_specifics
#> <chr> <chr> <chr> <chr>
#> 1 New South Wales Department stores ets <ETS(A,Ad,A)>
#> 2 New South Wales Department stores arima <ARIMA(2,1,1)(2,1,1)[12]>
#> 3 New South Wales Department stores snaive <SNAIVE>
#> 4 Victoria Department stores ets <ETS(A,A,A)>
#> 5 Victoria Department stores arima <ARIMA(2,1,1)(1,1,2)[12]>
#> 6 Victoria Department stores snaive <SNAIVE>
Created on 2020-09-07 by the reprex package (v0.3.0)
This ended up solving my issue:
model_desc <- models %>%
as_tibble() %>%
gather() %>%
mutate(model_desc = print(value)) %>%
select(key, model_desc) %>%
set_names("model", "model_desc")

r rvest webscraping hltv

Yes, that's just another "how-to-scrape" question. Sorry for that, but I've read the previous answers and the manual for rvest as well.
I'm doing web-scraping for my homework (so I do not plan to use the data for any commercial issue). The idea is to show that average skill of team affect individual skill. I'm trying to use CS:GO data from HLTV.org for it.
The information is available at http://www.hltv.org/?pageid=173&playerid=9216
I need two tables: Keystats (data only) and Teammates (data and URLs). I try to use CSS selectors generated by SelectorGadget and I also tryed to analyze the source code of webpage. I've failed. I'm doing the following:
library(rvest)
library(dplyr)
url <- 'http://www.hltv.org/?pageid=173&playerid=9216'
info <- html_session(url) %>% read_html()
info %>% html_node('.covSmallHeadline') %>% html_text()
Can you please tell me that is the right CSS selector?
If you look at the source, those tables aren't HTML tables, but just piles of divs with inconsistent nesting and inline CSS for alignment. Thus, it's easiest to just grab all the text and fix the strings afterwards, as the data is either all numeric or not at all.
library(rvest)
library(tidyverse)
h <- 'http://www.hltv.org/?pageid=173&playerid=9216' %>% read_html()
h %>% html_nodes('.covGroupBoxContent') %>% .[-1] %>%
html_text(trim = TRUE) %>%
strsplit('\\s*\\n\\s*') %>%
setNames(map_chr(., ~.x[1])) %>% map(~.x[-1]) %>%
map(~data_frame(variable = gsub('[.0-9]+', '', .x),
value = parse_number(.x)))
#> $`Key stats`
#> # A tibble: 9 × 2
#> variable value
#> <chr> <dbl>
#> 1 Total kills 9199.00
#> 2 Headshot %% 46.00
#> 3 Total deaths 6910.00
#> 4 K/D Ratio 1.33
#> 5 Maps played 438.00
#> 6 Rounds played 11242.00
#> 7 Average kills per round 0.82
#> 8 Average deaths per round 0.61
#> 9 Rating (?) 1.21
#>
#> $TeammatesRating
#> # A tibble: 4 × 2
#> variable value
#> <chr> <dbl>
#> 1 Gabriel 'FalleN' Toledo 1.11
#> 2 Fernando 'fer' Alvarenga 1.11
#> 3 Joao 'felps' Vasconcellos 1.09
#> 4 Epitacio 'TACO' de Melo 0.98

Resources