I am using this example to conduct sentiment analysis of a collection of txt documents in R. The code is:
library(tm)
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
library(dplyr)
library(wordcloud)
require(reshape2)
files <- list.files(inputdir,pattern="*.txt")
GetNrcSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>% # positive - negative
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
return(sentiment)
}
The .txt files are stored in inputdirand have names AB-City.0000, where AB is an abbreviation of a country, City is a city name and 0000 is year (ranges from 2000 to 2017).
The function works for a single file as expected, i.e. GetNrcSentiment(files[1]) gives me a tibble with proper counts per sentiment. However, when i try to run it for the whole set, i.e.
nrc_sentiments <- data_frame()
for(i in files){
nrc_sentiments <- rbind(nrc_sentiments, GetNrcSentiment(i))
}
I get the following error message:
Joining, by = "word"
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The exact same code works well with longer documents, but gives an error when dealing with shorter texts. It seems that not all sentiments are found in small documents and as a result the number of columns vary for each document, which might lead to this error, but I am not sure. I would appreciate any advice on how to fix the problem. If a sentiment is not found, I would want the entry to be equal to zero (if it is the cause of my problem).
As an aside, bing sentiment function runs through about two dozen of files and gives a different error, which seems to point to the same problem (negative sentiment not found?):
GetBingSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>%
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
# return our sentiment dataframe
return(sentiment)
}
Error in mutate_impl(.data, dots) :
Evaluation error: object 'negative' not found.
EDIT: Following the recommendation by David Klotz I edited the code to
for(i in files){ nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i)) }
As a result, instead of throwing an error the nrc generates NA if words from a certain sentiment are not found, however after 22 joinings i get a different error:
Error in mutate_impl(.data, dots) : Evaluation error: object 'negative' not found.
The same error shows up when run the bing function with dplyr. Both dataframes by the time the functions reaches 22nd document contain columns for all sentiments. What may cause the error and how to can diagnose it?
dplyr's bind_rows function is more flexible than rbind, at least when it comes to missing columns:
nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i))
The input might be missing the "negative" column that is used in the expression
Related
I am wondering how can I delete a specific symbol for an entire column. Here is what the original data look like: original data.
The only element I want to get are the first words.
Here is what my full dataset look like:
Below are data background info
library("dplyr")
library("stringr")
library("tidyverse")
library("ggplot2")
# load the .csv into R studio, you can do this 1 of 2 ways
#read.csv("the name of the .csv you downloaded from kaggle")
spotiify_origional <- read.csv("charts.csv")
spotiify_origional <- read.csv("https://raw.githubusercontent.com/info201a-au2022/project-group-1-section-aa/main/data/charts.csv")
View(spotiify_origional)
# filters down the data
# removes the track id, explicit, and duration columns
spotify_modify <- spotiify_origional %>%
select(name, country, date, position, streams, artists, genres = artist_genres)
#returns all the data just from 2022
#this is the data set you should you on the project
spotify_2022 <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
arrange(date) %>%
group_by(date)
spotify_2022_global <- spotify_modify %>%
filter(date >= "2022-01-01") %>%
filter(country == "global") %>%
arrange(date) %>%
group_by(streams)
View(spotify_2022_global)
This is what I did,
top_15 <- spotify_2022_global[order(spotify_2022_global$streams, decreasing = TRUE), ]
top_15 <- top_15[1:15,]
top_15$streams <- as.numeric(top_15$streams)
View(top_15)
top_15 <- top_15 %>%
separate(genres, c("genres"), sep = ',')
top_15$genres<-gsub("]","",as.character(top_15$genres))
View(top_15)
And now the name look like this:
name now look like this
I tried use the same gsub function to remove the rest of the brackets and quotation marks, but it didn't work.
I wonder what should I do at this point? Any recommendations will be hugely help! Thank you!
you could do this with a combination of sub to remove unwanted characters with string::word() which is a nice thing to extract a word.
w <- "[firstWord, secondWord, thirdWord]"
stringr::word(gsub('[\\[,\']', '', w),1)
#> [1] "firstWord"
This works also for w <- "['firstWord', 'secondWord', 'thirdWord']".
top_15$genres <- gsub("]|\\[|[']","",as.character(top_15$genres))
where the regex expression "]|\\[|[']" used the | character, OR, to match multiple things namely:
] closing square bracket
\\[ opening square bracket
['] single quotations
tidyversing up the "This is what I did" code, gives you:
spotify_2022_global %>%
arrange(desc(streams)) %>%
head(15) %>%
mutate(streams = as.numeric(streams),
genres = gsub("]|\\[|[']|,","",genres), # remove brackets and quote marks
genres = str_split(genres, ",")[[1]][1])) # get first word from list
gives:
I have been trying to understand how to use possibly() to wrap a lambda/anonymous function within map_dfr() so that my iterations continue on should an error be encountered. I am currently iterating over a large amount of webpages and using rvest to scrape them, however some are not compiled correctly or do not work. I would simply like to note that error so that I can return to it at a later time while continuing collecting data from the remainder of the webpages. My current code is posted below in addition to what I've tried:
df <- tibble(df, map_dfr(df$link, ~ {
# Replicate Human Input by Forcing Random Pauses
Sys.sleep(runif(1,1,3))
# Read in the html links
url <- .x %>% html_session(user_agent(user_agents)) %>% read_html()
# Full Job Description Text
description <- url %>%
html_elements(xpath = "//div[#id = 'jobDescriptionText']") %>%
html_text() %>% tolower()
description <- as.character(description)
# Hiring Insights
hiring_insights <- url %>%
html_elements(xpath = "//div[#id = 'hiringInsightsSectionRoot']") %>%
html_text() %>% str_extract("#REGEX") %>%
str_extract("#REGEX") %>%
str_trim()
hiring_insights <- as.character(hiring_insights)
### Extract Number of Hires
hiring_insights <- str_trim(str_extract(hiring_insights,"#REGEX"))
hiring_insights <- tolower(hiring_insights)
### Fill in all Missing Values with 1
hiring_insights[which(is.na(hiring_insights))] <- "1"
tibble(description, hiring_insights)
}))
I have tried wrapping the lambda function a few different ways but without success:
# First Attempt
df <- tibble(df, map_dfr(df$link, possibly(~ {——}, otherwise = "error)))
# Second Attempt
df <- tibble(df, map_dfr(df$link, possibly(function(x) {——}, otherwise = "error)))
# Third Attempt
df <- tibble(df, possibly(map_dfr(df$link, ~ {——}), otherwise = "error"))
# Fourth Attempt
df <- tibble(df, possibly(map_dfr(df$link, function(x) {——}), otherwise = "error"))
When writing the function with function(x) rather than with the ~ I update the .x to x within the lambda function when defining the url variable. However with each of these iterations I encountered a bad link and receive the HTTP 403 error, which then stops the iteration and discards all of data scraped from the previous variables. What I would like is to either have a dummy variable which notes whether or not the link was bad and then if it is bad fill in the values for the scraped variables with or simply whatever the otherwise argument is set too. Thank you in advance! I've really hit a wall here
map_dfr() expects a dataframe or named vector on every iteration. Your otherwise value isn’t named, so it throws an error. To illustrate:
library(purrr)
vals <- list(1, 2, "bad", 4, 5)
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = NA_real_
)
)
Error in `dplyr::bind_rows()`:
! Argument 3 must have names.
But if you change otherwise to return a dataframe:
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = data.frame(x = NA_real_)
)
)
x
1 1
2 4
3 NA
4 16
5 25
I have a complex .txt file, of which I'll add a screenshot .txt file. I need to have each line as its own character string in order to group the lines of code by the 5 letter code near the beginning of each line (group together all GPGGA lines, for example; see screenshot) in order to process it as I need to. Here's what I've run so far:
df <- data.frame(Weather_data)
df %>%
mutate("Entry" = gsub(".*\\$([A-Z]+),.*", "\\1", text)) %>%
group_by(Entry) %>%
filter(Entry == "GPGGA")
This received the error:
"Error: Problem with mutate() column Entry. i Entry = gsub(".*\\$([A-Z]+),.*", "\\1", text). x cannot coerce type 'closure' to vector of type 'character'"
I had success filtering as I needed to when I copied and pasted the first few lines in and manually made then character strings to see if I could get the code to function, so making each line a character string NOT manually (there are over 3000 lines) is the next step. Anyone have a function to do this?
Here are some of the lines produced when I load the imported txt file:
HEADER
<chr>
13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68
13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72
13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E
13:30:00.827: <- $HCHDG,74.8,0.0,E,3.6,W*6E
13:30:01.003: <- $WIMDA,29.9641,I,1.0147,B,26.5,C,,,48.2,,14.6,C,323.0,T,326.6,M,1.4,N,0.7,M*66
13:30:01.051: <- $WIMWV,248.4,R,1.1,N,A*29
13:30:01.114: <- $WIMWV,255.6,T,1.3,N,A*23
13:30:01.195: <- $YXXDR,A,-53.9,D,PTCH,A,-34.2,D,ROLL*57
13:30:01.307: <- $YXXDR,A,0.571,G,XACC,A,0.783,G,YACC,A,-0.181,G,ZACC*57
13:30:01.578: <- $GPGGA,183001.30,4415.6242,N,08823.9769,W,1,7,1.7,225.9,M,-33.4,M,,*64
You referenced the variable text which does not exist in your data.frame. Your column is named HEADER.
df %>%
mutate("Entry" = gsub(".*\\$([A-Z]+),.*", "\\1", HEADER)) %>%
group_by(Entry) %>%
filter(Entry == "GPGGA")
Hi I am trying scrape the data from ebay in R, I used the code mentioned below but I encountered with a problem wherein there were missing values for a particular selector elements, to get round it I used a for loop as shown(inspecting each listing and giving the number for which there was data missing) since the data scraped was less it was possible to inspect but how to do it when there's large amounts of data to be scraped.
Thanks in advance
library(rvest)
url<-"https://www.ebay.in/sch/i.html_from=R40&_sacat=0&LH_ItemCondition=4&_ipg=100&_nkw=samsung+j7"
web<- read_html(url)
subdescp<- html_nodes(web, ".lvsubtitle+ .lvsubtitle")
subdescp1<-html_text(subdescp)
head(subdescp1)
library(stringr)
subdescp1<- str_replace_all(subdescp1, "[\t\n\r]" , "")
head(subdescp1)
for (i in c(5,6,10,19,33,34,35)){
a<-subdescp1[1:(i-1)]
b<-subdescp1[i:length(subdescp1)]
subdescp1<-append(a,list("NA"))
subdescp1<-append(subdescp1,b)
}
Z<-as.character(subdescp1)
Z
webpage <- read_html(url)
Descp_data_html <- html_nodes(webpage,'.vip')
Descp_data <- html_text(Descp_data_html)
head(Descp_data)
price_data_html <- html_nodes(web,'.prc .bold')
price_data <- html_text(price_data_html)
head(price_data)
library(stringr)
price_data<-str_replace_all(price_data, "[\t\n]" , "")
price_data<-gsub("Rs. ","",price_data)
price_data<-gsub(",","",price_data)
price_data<- as.numeric(price_data)
price_data
Desc_data_html <- html_nodes(webpage,'.lvtitle+ .lvsubtitle')
Desc_data <- html_text(Desc_data_html, trim = TRUE)
head(Desc_data)
j7_f2<-data.frame(Title = Descp_data, Description= Desc_data, Sub_Description= Z, Pirce = price_data)
For instance you can use something like this.
data <- read_html("url.xml")
var <- data %>% html_nodes("//node") %>% xml_text()
# observations that don´t have certain nodes - fill them with NA
var_pair <- data %>% html_nodes("node_var_pair")
var_missing_clean = sapply(var_pair, function(x) {
tryCatch(xml_text(html_nodes(x, "./var_missing")),
error=function(err) NA)
})
df = data.frame(var, var_pair, var_missing)
Here there are three types of nodes that you may consider. var gathers the nodes that do not have missing data. var_pair includes the nodes that you want to pair with the nodes that contain missing observation and var_missing refers to the nodes with missing information. You can create variables and aggregate them in a data data frame (df)
The process here is simple and in two steps -- First extract all nodes at the block level (not each element and don't convert to text). This is a list of length equal to the number of blocks. Second from this extracted list extract each element as text and clean it. Since this is being done from a list, NA's where applicable are automatically coerced in the right places. See an example from the same ebay India site:
library(rvest)
library(stringr)
# specify the url
url <-"https://www.ebay.in/sch/Mobile-Phones"
# read the page
web <- read_html(url)
# define the supernode that has the entire block of information
super_node <- '.li'
# read as vector of all blocks of supernode (imp: use html_nodes function)
super_node_read <- html_nodes(web, super_node)
# define each node element that you want
node_model_details <- '.lvtitle'
node_description_1 <- '.lvtitle+ .lvsubtitle'
node_description_2 <- '.lvsubtitle+ .lvsubtitle'
node_model_price <- '.prc .bold'
node_shipping_info <- '.bfsp'
# extract the output for each as cleaned text (imp: use html_node function)
model_details <- html_node(super_node_read, node_model_details) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_1 <- html_node(super_node_read, node_description_1) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_2 <- html_node(super_node_read, node_description_2) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
model_price <- html_node(super_node_read, node_model_price) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
shipping_info <- html_node(super_node_read, node_shipping_info) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
# create the data.frame
mobile_phone_data <- data.frame(
model_details,
description_1,
description_2,
model_price,
shipping_info
)
I know this question has been asked many times (Converting Character to Numeric without NA Coercion in R, Converting Character\Factor to Numeric without NA Coercion in R, etc.) but I cannot seem to figure out what is going on in this one particular case (Warning message:
NAs introduced by coercion). Here is some reproducible data I'm working with.
#dependencies
library(rvest)
library(dplyr)
library(pipeR)
library(stringr)
library(translateR)
#scrape data from website
url <- "http://irandataportal.syr.edu/election-data"
ir.pres2014 <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="content"]/div[16]/table') %>%
html_table(fill = TRUE)
ir.pres2014<-ir.pres2014[[1]]
colnames(ir.pres2014)<-c("province","Rouhani","Velayati","Jalili","Ghalibaf","Rezai","Gharazi")
ir.pres2014<-ir.pres2014[-1,]
#Get rid of unnecessary rows
ir.pres2014<-ir.pres2014 %>%
subset(province!="Votes Per Candidate") %>%
subset(province!="Total Votes")
#Get rid of commas
clean_numbers = function (x) str_replace_all(x, '[, ]', '')
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)
#remove any possible whitespace in string
no_space = function (x) gsub(" ","", x)
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(no_space), -province)
This is where things start going wrong for me. I tried each of the following lines of code but I got all NA's each time. For example, I begin by trying to convert the second column (Rouhani) to numeric:
#First check class of vector
class(ir.pres2014$Rouhani)
#convert character to numeric
ir.pres2014$Rouhani.num<-as.numeric(ir.pres2014$Rouhani)
Above returns a vector of all NA's. I also tried:
as.numeric.factor <- function(x) {seq_along(levels(x))[x]}
ir.pres2014$Rouhani2<-as.numeric.factor(ir.pres2014$Rouhani)
And:
ir.pres2014$Rouhani2<-as.numeric(levels(ir.pres2014$Rouhani))[ir.pres2014$Rouhani]
And:
ir.pres2014$Rouhani2<-as.numeric(paste(ir.pres2014$Rouhani))
All those return NA's. I also tried the following:
ir.pres2014$Rouhani2<-as.numeric(as.factor(ir.pres2014$Rouhani))
That created a list of single digit numbers so it was clearly not converting the string in the way I have in mind. Any help is much appreciated.
The reason is what looks like a leading space before the numbers:
> ir.pres2014$Rouhani
[1] " 1052345" " 885693" " 384751" " 1017516" " 519412" " 175608" …
Just remove that as well before the conversion. The situation is complicated by the fact that this character isn’t actually a space, it’s something else:
mystery_char = substr(ir.pres2014$Rouhani[1], 1, 1)
charToRaw(mystery_char)
# [1] c2 a0
I have no idea where it comes from but it needs to be replaced:
str_replace_all(x, rawToChar(as.raw(c(0xc2, 0xa0))), '')
Furthermore, you can simplify your code by applying the same transformation to all your columns at once:
mystery_char = rawToChar(as.raw(c(0xc2, 0xa0)))
to_replace = sprintf('[,%s]', mystery_char)
clean_numbers = function (x) as.numeric(str_replace_all(x, to_replace, ''))
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)