Inserting NA in blank values from web scraping - r

I am working on scraping some data into a data frame, and am getting some empty fields, where I would instead prefer to have NA. I have tried na.strings, but am either placing it in the wrong place or it just isn't working, and I tried to gsub anything that was whitespace from beginning of line to end, but that didn't work.
htmlpage <- read_html("http://www.gourmetsleuth.com/features/wine-cheese-pairing-guide")
sugPairings <- html_nodes(htmlpage, ".meta-wrapper")
suggestions <- html_text(sugPairings)
suggestions <- gsub("\\r\\n", '', suggestions)
How can I sub out the blank fields with NA, either once it is added to the data frame, or before adding it.

rvest::html_text has an build in trimming option setting trim=TRUE.
After you have done this you can use e.g. ifelse to test for an empty string (=="") or use nzchar.
I full you could do this:
html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE) %>% ifelse(. == "", NA, .)
or this:
res <- html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE)
res[!nzchar(res)] <- NA_character_
#Richard Scriven improvement:
html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE) %>% replace(!nzchar(.), NA)

Related

How can I remove a specific symbol from for an entire column

I am wondering how can I delete a specific symbol for an entire column. Here is what the original data look like: original data.
The only element I want to get are the first words.
Here is what my full dataset look like:
Below are data background info
library("dplyr")
library("stringr")
library("tidyverse")
library("ggplot2")
# load the .csv into R studio, you can do this 1 of 2 ways
#read.csv("the name of the .csv you downloaded from kaggle")
spotiify_origional <- read.csv("charts.csv")
spotiify_origional <- read.csv("https://raw.githubusercontent.com/info201a-au2022/project-group-1-section-aa/main/data/charts.csv")
View(spotiify_origional)
# filters down the data
# removes the track id, explicit, and duration columns
spotify_modify <- spotiify_origional %>%
select(name, country, date, position, streams, artists, genres = artist_genres)
#returns all the data just from 2022
#this is the data set you should you on the project
spotify_2022 <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
arrange(date) %>%
group_by(date)
spotify_2022_global <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
filter(country == "global") %>%
arrange(date) %>%
group_by(streams)
View(spotify_2022_global)
This is what I did,
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
top_15 <- top_15[1:15,]
top_15$streams <- as.numeric(top_15$streams)
View(top_15)
top_15 <- top_15 %>%
separate(genres, c("genres"), sep = ',')
top_15$genres<-gsub("]","",as.character(top_15$genres))
View(top_15)
And now the name look like this:
name now look like this
I tried use the same gsub function to remove the rest of the brackets and quotation marks, but it didn't work.
I wonder what should I do at this point? Any recommendations will be hugely help! Thank you!
you could do this with a combination of sub to remove unwanted characters with string::word() which is a nice thing to extract a word.
w <- "[firstWord, secondWord, thirdWord]"
stringr::word(gsub('[\\[,\']', '', w),1)
#> [1] "firstWord"
This works also for w <- "['firstWord', 'secondWord', 'thirdWord']".
top_15$genres <- gsub("]|\\[|[']","",as.character(top_15$genres))
where the regex expression "]|\\[|[']" used the | character, OR, to match multiple things namely:
] closing square bracket
\\[ opening square bracket
['] single quotations
tidyversing up the "This is what I did" code, gives you:
spotify_2022_global %>%
arrange(desc(streams)) %>%
head(15) %>%
mutate(streams = as.numeric(streams),
genres = gsub("]|\\[|[']|,","",genres), # remove brackets and quote marks
genres = str_split(genres, ",")[[1]][1])) # get first word from list
gives:

Problems extracting data using JSON in R (getting a lexical error)

Related to the question asked here: R - Using SelectorGadget to grab a dataset
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
I am trying to grab the Hawaii dataset from the State tab. The code was working before but now it is throwing an error with this part of the code:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
I am getting the error:
Error: lexical error: invalid char in json text. NA (right here) ------^
Any proposed solutions? It seems that the website has remained the same for the year but what type of change is causing the code to break?
EDIT: The solution proposed by #QHarr:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
This was working for a while but then it seems that their website again changed the underlying HTML codes.
Change the regex pattern as shown below to ensure it correctly captures the desired string within the response text i.e. the JavaScript object to use for all_data
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
Note: in R the single escape is doubled e.g. \\s rather than shown \s above.

R: Convert type "List" to dataframe to convert to excel - Text Mining

When trying to stem and tokenize my list of reviews, it will automatically make it a list. It is a "character" type variable at first, but when applying the following code it turns it into a "list":
reviews <- tokenize_word_stems(reviews)
I want to eventually convert this into excel, but my write_xlsx function can only convert dataframes, and not lists.
the rest of my code looks like this, but it goes "wrong" when trying to stem the words:
reviews <- readLines("Reviewlist.csv")
reviews <- gsub(pattern = "\\W", replace = " ", reviews)
reviews <- tolower(reviews)
reviews <- gsub(pattern="\\b[A-z]\\b{1}", replace=" ", reviews)
reviews <- stripWhitespace(reviews)
reviews <- removeWords(reviews, stopwords())
reviews <- tokenize_word_stems(reviews)
the file:
Thanks in advance!
Creating a lorem-ipsum dummy input here, based on my assumption what your "Reviewlist.csv" seems to look like.
library(dplyr)
library(stringi)
stri_rand_lipsum(5) %>%
writeLines("Reviewlist.csv")
Then, this here is just your original code without alterations, but using dplyr grammar and explicitly stating the libraries necessary:
library(tm)
library(tokenizers)
reviews <- readLines("Reviewlist.csv") %>%
gsub(pattern = "\\W", replace = " ", .) %>%
tolower() %>%
gsub(pattern="\\b[A-z]\\b{1}", replace=" ", .) %>%
stripWhitespace() %>%
removeWords(stopwords()) %>%
tokenize_word_stems()
Now, what you can do is to bind your list items into a dataframe before being able to write it as an xlsx-file:
library(purrr)
library(writexl)
reviews_df <- reviews %>%
map_dfr(~ setNames(., sprintf("word_%04d", seq_along(.))))
reviews_df %>%
write_xlsx("Reviewlist.xlsx")
And that might create a very wide xlsx for you.
No idea whether Excel really is able to open it, but there you go :)

Whitespace string can't be replaced with NA in R

I want to substitute whitespaces with NA. A simple way could be df[df == ""] <- NA, and that works for most of the cells of my data frame....but not for everyone!
I have the following code:
library(rvest)
library(dplyr)
library(tidyr)
#Read website
htmlpage <- read_html("http://www.soccervista.com/results-Liga_MX_Apertura-2016_2017-844815.html")
#Extract table
df <- htmlpage %>% html_nodes("table") %>% html_table()
df <- as.data.frame(df)
#Set whitespaces into NA's
df[df == ""] <- NA
I figured out that some whitespaces have a little whitespace between the quotation marks
df[11,1]
[1] " "
So my solution was to do the next: df[df == " "] <- NA
However the problem is still there and it has the little whitespace! I thought the trim function would work but it didn't...
#Trim
df[,c(1:10)] <- sapply(df[,c(1:10)], trimws)
However, the problem can't go off.
Any ideas?
We need to use lapply instead of sapply as sapply returns a matrix instead of a list and this can create problems in the quotes.
df[1:10] <- lapply(df[1:10], trimws)
and another option if we have spaces like " " is to use gsub to replace those spaces to ""
df[1:10] <- lapply(df[,c(1:10)], function(x) gsub("^\\s+|\\s+$", "", x))
and then change the "" to NA
df[df == ""] <- NA
Or instead of doing the two replacements, we can do this one go and change the class with type.convert
df[] <- lapply(df, function(x)
type.convert(replace(x, grepl("^\\s*$", trimws(x)), NA), as.is = TRUE))
NOTE: We don't have to specify the column index when all the columns are looped
I just spent some time trying to determine a method usable in a pipe.
Here is my method:
df <- df %>%
dplyr::mutate_all(funs(sub("^\\s*$", NA, .)))
Hope this helps the next searcher.

Convert character to numeric without NA in r

I know this question has been asked many times (Converting Character to Numeric without NA Coercion in R, Converting Character\Factor to Numeric without NA Coercion in R, etc.) but I cannot seem to figure out what is going on in this one particular case (Warning message:
NAs introduced by coercion). Here is some reproducible data I'm working with.
#dependencies
library(rvest)
library(dplyr)
library(pipeR)
library(stringr)
library(translateR)
#scrape data from website
url <- "http://irandataportal.syr.edu/election-data"
ir.pres2014 <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="content"]/div[16]/table') %>%
html_table(fill = TRUE)
ir.pres2014<-ir.pres2014[[1]]
colnames(ir.pres2014)<-c("province","Rouhani","Velayati","Jalili","Ghalibaf","Rezai","Gharazi")
ir.pres2014<-ir.pres2014[-1,]
#Get rid of unnecessary rows
ir.pres2014<-ir.pres2014 %>%
subset(province!="Votes Per Candidate") %>%
subset(province!="Total Votes")
#Get rid of commas
clean_numbers = function (x) str_replace_all(x, '[, ]', '')
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)
#remove any possible whitespace in string
no_space = function (x) gsub(" ","", x)
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(no_space), -province)
This is where things start going wrong for me. I tried each of the following lines of code but I got all NA's each time. For example, I begin by trying to convert the second column (Rouhani) to numeric:
#First check class of vector
class(ir.pres2014$Rouhani)
#convert character to numeric
ir.pres2014$Rouhani.num<-as.numeric(ir.pres2014$Rouhani)
Above returns a vector of all NA's. I also tried:
as.numeric.factor <- function(x) {seq_along(levels(x))[x]}
ir.pres2014$Rouhani2<-as.numeric.factor(ir.pres2014$Rouhani)
And:
ir.pres2014$Rouhani2<-as.numeric(levels(ir.pres2014$Rouhani))[ir.pres2014$Rouhani]
And:
ir.pres2014$Rouhani2<-as.numeric(paste(ir.pres2014$Rouhani))
All those return NA's. I also tried the following:
ir.pres2014$Rouhani2<-as.numeric(as.factor(ir.pres2014$Rouhani))
That created a list of single digit numbers so it was clearly not converting the string in the way I have in mind. Any help is much appreciated.
The reason is what looks like a leading space before the numbers:
> ir.pres2014$Rouhani
[1] " 1052345" " 885693" " 384751" " 1017516" " 519412" " 175608" …
Just remove that as well before the conversion. The situation is complicated by the fact that this character isn’t actually a space, it’s something else:
mystery_char = substr(ir.pres2014$Rouhani[1], 1, 1)
charToRaw(mystery_char)
# [1] c2 a0
I have no idea where it comes from but it needs to be replaced:
str_replace_all(x, rawToChar(as.raw(c(0xc2, 0xa0))), '')
Furthermore, you can simplify your code by applying the same transformation to all your columns at once:
mystery_char = rawToChar(as.raw(c(0xc2, 0xa0)))
to_replace = sprintf('[,%s]', mystery_char)
clean_numbers = function (x) as.numeric(str_replace_all(x, to_replace, ''))
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)

Resources