using Regex and/or removing duplicate - r

I am scraping the website and as a result, I have half cleaned code:
[3] "2♠2:2♠2: Texas:28,,845:25,46,5:4.4%:36♠36:55,32:9,23:698,53:8.68%"*
Above is one example and I am trying to remove a number before or after that heart.
Desired output is:
[3] "2:2: Texas:28,,845:25,46,5:4.4%:36:55,32:9,23:698,53:8.68%"
Basically removing numbers between heart and colon including heart.
I will greatly appreciate any help. I have tried the following codes, but they did not work.
str_replace_all(dataSet, "♠*:", "", fixed = T)
gsub("*♠", "", data, fixed = T)
website <- read_html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population")
results <- website %>% html_nodes("table")
data_body <- results[1] %>% html_nodes("tbody")
rows <- data_body %>% html_nodes("tr")
clean_rows_text <- str_replace_all(rows_text,"[7000100000000000000]", "")
clean_rows_text <- str_replace_all(clean_rows_text, "\n\n", ":")
clean_rows_text <- str_replace_all(clean_rows_text, "\n", "")
Desired output is:
[3] "2:2: Texas:28,,845:25,46,5:4.4%:36:55,32:9,23:698,53:8.68%"
From this point, I can handle the rest.

This should do it:
data <- "2♠2:2♠2: Texas:28,,845:25,46,5:4.4%:36♠36:55,32:9,23:698,53:8.68%*"
gsub("♠.+?(?=:)", "", data, perl=T)

Related

R: Convert type "List" to dataframe to convert to excel - Text Mining

When trying to stem and tokenize my list of reviews, it will automatically make it a list. It is a "character" type variable at first, but when applying the following code it turns it into a "list":
reviews <- tokenize_word_stems(reviews)
I want to eventually convert this into excel, but my write_xlsx function can only convert dataframes, and not lists.
the rest of my code looks like this, but it goes "wrong" when trying to stem the words:
reviews <- readLines("Reviewlist.csv")
reviews <- gsub(pattern = "\\W", replace = " ", reviews)
reviews <- tolower(reviews)
reviews <- gsub(pattern="\\b[A-z]\\b{1}", replace=" ", reviews)
reviews <- stripWhitespace(reviews)
reviews <- removeWords(reviews, stopwords())
reviews <- tokenize_word_stems(reviews)
the file:
Thanks in advance!
Creating a lorem-ipsum dummy input here, based on my assumption what your "Reviewlist.csv" seems to look like.
library(dplyr)
library(stringi)
stri_rand_lipsum(5) %>%
writeLines("Reviewlist.csv")
Then, this here is just your original code without alterations, but using dplyr grammar and explicitly stating the libraries necessary:
library(tm)
library(tokenizers)
reviews <- readLines("Reviewlist.csv") %>%
gsub(pattern = "\\W", replace = " ", .) %>%
tolower() %>%
gsub(pattern="\\b[A-z]\\b{1}", replace=" ", .) %>%
stripWhitespace() %>%
removeWords(stopwords()) %>%
tokenize_word_stems()
Now, what you can do is to bind your list items into a dataframe before being able to write it as an xlsx-file:
library(purrr)
library(writexl)
reviews_df <- reviews %>%
map_dfr(~ setNames(., sprintf("word_%04d", seq_along(.))))
reviews_df %>%
write_xlsx("Reviewlist.xlsx")
And that might create a very wide xlsx for you.
No idea whether Excel really is able to open it, but there you go :)

Error in 1:maxResults : argument of length 0 in gdscrapeR

I am using gdscrapeR to extract Glassdoor Reviews. But, I cannot get past the very first code in gdscrapeR:
library(gdscrapeR)
df <- get_reviews(companyNum = "E40371")
Number of web pages to scrape:
StartingError in 1:maxResults : argument of length 0
I went to the creator's blog here. The blog shows a breakdown of the get_reviews function. I think the problem is occurring here:
totalReviews <- read_html(paste(baseurl, companyNum, sort, sep = "")) %>%
html_nodes(".tightVert.floatLt strong, .margRtSm.minor") %>%
html_text() %>%
sub("Found | reviews", "", .) %>%
sub(",", "", .) %>%
as.integer()
maxResults <- as.integer(ceiling(totalReviews/10)) #10 reviews per page, round up to whole number
I don't know what I have to do to fix this issue. I just want to extract Glassdoor Reviews. Please help!

How properly convert vector of numbers written as character into numeric type?

I have a problem with my R code.
I want to convert numbers written as a characters in vector powpow into real numbers. As usually I used as.numeric() function, but I have no idea why it doesn't work.
Here is my code, if anyone knows how to solve my problem, please write.
Thanks in advance.
The problematic part is start with comment "# średnia i kwantyle powierzchni powiatów woj. wlkp."
############################################################
### Zadanie 1 ###
library(rvest)
library(tidyverse)
library(magrittr)
url <- "https://pl.wikipedia.org/wiki/Wojew%C3%B3dztwo_wielkopolskie"
website_html <- url %>% read_html()
tbls <- website_html %>% html_nodes("table")
tabele <- tbls[11] %>% html_table() %>% as.data.frame()
head(tabele)
tabele <- tabele[, -1]
head(tabele)
length(colnames(tabele))
nazwy <- colnames(tabele)
nazwy[1] <- 'powiat'
nazwy[2] <- 'siedziba'
nazwy[3] <- 'ludnosc'
nazwy[4] <- 'powierzchnia'
nazwy[5] <- 'gestosc'
nazwy[6] <- 'urbanizacja'
nazwy[7] <- 'wyd_budzet'
nazwy[8] <- 'doch_budzet'
nazwy[9] <- 'zadluzenie'
nazwy[10] <- 'stopa'
nazwy -> colnames(tabele)
head(tabele)
powiaty <- tabele # rm(tabele)
# średnia i kwantyle powierzchni powiatów woj. wlkp.
str(powiaty$powierzchnia)
powpow <- powiaty$powierzchnia
str(powpow)
for(i in 1:length(powpow))
{
powpow[i] <- powpow[i] %>% gsub("\\,", "\\.", ., perl=TRUE) %>% as.numeric()
print(str(powpow[i]))
}
What I want is a powpow vector of numbers, not characters.
Depending on your global settings, you may need to replace , to . as decimal separators. An easy solution is as.numeric():
# if your global settings accept "," as a decimal separator
powpow_numeric <- as.numeric(powpow)
# if your global settings do NOT accept "," as a decimal separator
powpow_numeric <- as.numeric(sub(",", ".", powpow, fixed = T))
There is also a way to change your global settings if the first option doesn't work, but I don't know it off the top of my head. Maybe someone else can help with this.
You already loaded the tidyverse package. you can use de parse_number() function from readr and get a numeric vector out of powpow.
parse_number(powpow)
as.numeric(powpow) can do the same, but parse numbers will work in cases where the vector contains non numeric characters, like letters.
Anyway, base in what you have done I did as follow with all others variables that you will have to change:
powiaty <- powiaty %>%
mutate(powierzchnia = parse_number(powierzchnia),
urbanizacja = parse_number(urbanizacja),
wyd_budzet = parse_number(wyd_budzet),
doch_budzet = parse_number(doch_budzet),
# in the case of "zadluzenie" and "stopa" we have to change ',' by dots before parsing
zadluzenie = str_replace(zadluzenie, ",", "\\."),
stopa = str_replace(stopa, ",", "\\."),
zadluzenie = parse_number(zadluzenie),
stopa = parse_number(stopa))
glimpse(powiaty)

How do I avoid 'NA' values when coercing a .tsv column into numeric via as.numeric?

I have a dataframe with several columns from a .tsv file and want to transform one of them into the 'numeric' type for analysis. However, I keep getting the 'NAs' introduced by coercion warning all the time and do not know exactly why. There is some unnecessary info at the beginning of another column, which is pretty much the only formatting I did.
Originally, I thought the file might have added some extra tabs or spaces, which is why I tried to delete these via giving sub() as an argument.
I should also mention that I get the NA errors also when I do not replace the values and run the dataframe as is:
library(tidyverse)
data_2018 <- read_tsv('teina230.tsv')
data_1995 <- read_csv('OECD_1995.csv')
#get rid of long colname & select only columns containing %GDP
clean_data_2018 <- data_2018 %>%
select('na_item,sector,unit,geo','2018Q1','2018Q2','2018Q3','2018Q4') %>%
rename(country = 'na_item,sector,unit,geo')
clean_data_2018 <- clean_data_2018[grep("PC_GDP", clean_data_2018$'country'), ]
#remove unnecessary info
clean_data_2018 <- clean_data_2018 %>%
mutate(country=gsub('\\GD,S13,PC_GDP,','',country))
clean_data_2018 <- clean_data_2018 %>%
mutate(
'2018Q1'=as.numeric(sub("", "", '2018Q1', fixed = TRUE)),
'2018Q2'=as.numeric(sub(" ", "", '2018Q2', fixed = TRUE)),
'2018Q3'=as.numeric(sub(" ", "", '2018Q3', fixed = TRUE)),
'2018Q4'=as.numeric(sub(" ", "", '2018Q4', fixed = TRUE))
)
Is there another way to get around the problem and convert the column without replacing all the values with 'NA'?
Thanks guys :)
Thanks for the hint #divibisan !
Renaming the columns via rename() actually solved the problem. Here the code which finally worked:
library(tidyverse)
data_2018 <- read_tsv('teina230.tsv')
#get rid of long colname & select only columns containing %GDP
clean_data_2018 <- data_2018 %>%
select('na_item,sector,unit,geo','2018Q1','2018Q2','2018Q3','2018Q4') %>%
rename(country = 'na_item,sector,unit,geo',
quarter_1 = '2018Q1',
quarter_2 = '2018Q2',
quarter_3 = '2018Q3',
quarter_4 = '2018Q4')
clean_data_2018 <- clean_data_2018[grep("PC_GDP", clean_data_2018$'country'), ]
#remove unnecessary info
clean_data_2018 <- clean_data_2018 %>%
mutate(country=gsub('\\GD,S13,PC_GDP,','',country))
clean_data_2018 <- clean_data_2018 %>%
mutate(
quarter_1 = as.numeric(quarter_1),
quarter_2 = as.numeric(quarter_2),
quarter_3 = as.numeric(quarter_3),
quarter_4 = as.numeric(quarter_4)
)

Inserting NA in blank values from web scraping

I am working on scraping some data into a data frame, and am getting some empty fields, where I would instead prefer to have NA. I have tried na.strings, but am either placing it in the wrong place or it just isn't working, and I tried to gsub anything that was whitespace from beginning of line to end, but that didn't work.
htmlpage <- read_html("http://www.gourmetsleuth.com/features/wine-cheese-pairing-guide")
sugPairings <- html_nodes(htmlpage, ".meta-wrapper")
suggestions <- html_text(sugPairings)
suggestions <- gsub("\\r\\n", '', suggestions)
How can I sub out the blank fields with NA, either once it is added to the data frame, or before adding it.
rvest::html_text has an build in trimming option setting trim=TRUE.
After you have done this you can use e.g. ifelse to test for an empty string (=="") or use nzchar.
I full you could do this:
html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE) %>% ifelse(. == "", NA, .)
or this:
res <- html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE)
res[!nzchar(res)] <- NA_character_
#Richard Scriven improvement:
html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE) %>% replace(!nzchar(.), NA)

Resources