subgraph text analysis in R (igraph)

subgraph text analysis in R (igraph) - r

I am curious how to access additional attributes for a graph which are associated with the edges. To follow along here is a minimal example:
library("igraph")
library("SocialMediaLab")
myapikey =''
myapisecret =''
myaccesstoken = ''
myaccesstokensecret = ''
tweets <- Authenticate("twitter",
apiKey = myapikey,
apiSecret = myapisecret,
accessToken = myaccesstoken,
accessTokenSecret = myaccesstokensecret) %>%
Collect(searchTerm="#trump", numTweets = 100,writeToFile=FALSE,verbose=TRUE)
g_twitter_actor <- tweets %>% Create("Actor", writeToFile=FALSE)
c <- igraph::components(g_twitter_actor, mode = 'weak')
subCluster <- induced.subgraph(g_twitter_actor, V(g_twitter_actor)[which(c$membership == which.max(c$csize))])
The initial tweets contains the following columns
colnames(tweets)
[1] "text" "favorited" "favoriteCount" "replyToSN" "created_at" "truncated" "replyToSID" "id"
[9] "replyToUID" "statusSource" "screen_name" "retweetCount" "isRetweet" "retweeted" "longitude" "latitude"
[17] "from_user" "reply_to" "users_mentioned" "retweet_from" "hashtags_used"
How can I access the text property for the subgraph in order to perform text analysis?
E(subCluster)$text does not work

E(subCluster)$text does not work because the values for tweets$text are not added to the graph when it is made. So you have to do that manually. It's a bit of a pain, but doable. Requires some subsetting of the tweets data frame and matching based on user names.
First, notice that the edge types are in a particular order: retweets, mentions, replies. The same text from a particular user can apply to all three of these. So I think it makes sense to add text serially.
> unique(E(g_twitter_actor)$edgeType)
[1] "Retweet" "Mention" "Reply"
Using dplry and reshape2 makes this easier.
library(reshape2); library(dplyr)
#Make data frame for retweets, mentions, replies
rts <- tweets %>% filter(!is.na(retweet_from))
ms <- tweets %>% filter(users_mentioned!="character(0)")
rpls <- tweets %>% filter(!is.na(reply_to))
Since users_mentioned can contain a list of individuals, we have to unlist it. But we want to associate the users mentioned with the user who mentioned them.
#Name each element in the users_mentioned list after the user who mentioned
names(ms$users_mentioned) <- ms$screen_name
ms <- melt(ms$users_mentioned) #melting creates a data frame for each user and the users they mention
#Add the text
ms$text <- tweets[match(ms$L1,tweets$screen_name),1]
Now add each of these to the network as an edge attribute by matching the edge type.
E(g_twitter_actor)$text[E(g_twitter_actor)$edgeType %in% "Retweet"] <- rts$text
E(g_twitter_actor)$text[E(g_twitter_actor)$edgeType %in% "Mention"] <- ms$text
E(g_twitter_actor)$text[E(g_twitter_actor)$edgeType %in% "Reply"] <- rpls$text
Now you can subset and get the edge value for text.
subCluster <- induced.subgraph(g_twitter_actor,
V(g_twitter_actor)[which(c$membership == which.max(c$csize))])

Related

Applying Function to Multiple Files & Handling the resulting dataframes

I've got a function which I'm trying to apply in a for loop that extracts a dataframe from multiple files and combines them into a single one.
This is how, from what I've read, I thought would be the best way to attack it but I get an empty list returned, when I was hoping for a list of dataframes which could be combined using bind_rows.
This is the code I'm using:
combined_functions <- function(file_name) {
#combines the get_dfm_df and get corp function: get dfm tibble straight from the file name
data_frame_returned<- get_dfm_df(getcorp(file_name))
data_frame_returned
}
list_of_dataframes <- list()
file.list <- dir(pattern ="DOCX$")
for (file in file.list) {
dataframe_of_file <- combined_function(file)
append(list_of_dataframes,dataframe_of_file)
}
bind_rows(list_of_dataframes, .id = "column_label") #https://stackoverflow.com/questions/2851327/convert-a-list-of-data-frames-into-one-data-frame
It creates an empty list, gets a list of the file names which the function combined_function uses to create a data frame out of the file and should, to my understanding, append this dataframe to the list. After all the files in the directory have been matched, bind_rows should combine it into one overall dataframe but it only returns an empty tibble. list_of_dataframes is also empty.
I've tried the solution in this answer but it didn't help:
Append a data frame to a list
https://www.dropbox.com/sh/z8vh50b370gcb1j/AAAcbnfAUOM6-y8uWn4-lUWLa?dl=0
This a link to the raw files I am using in this case, but I think the problem is a general one.
Appendix:
These are the functions combined_functions refer too. They work on the individual cases so I'm confident this is not the cause of the problem but I've included them for completeness anyway.
rm(list = ls())
library(quanteda)
library(quanteda.corpora)
library(readtext)
library(LexisNexisTools)
library(tidyverse)
library(tools)
getcorp<- function(file_name){
#function to take the lexis word document, convert it into quanteda corpus object, returns duplicate df and date from filename in list
LNToutput <- lnt_read(file_name)
duplicates_df <- lnt_similarity(LNToutput = LNToutput,
threshold = 0.99)
duplicates_df <- duplicates_df[duplicates_df$Similarity > 0.99] #https://github.com/JBGruber/LexisNexisTools creates dataframe of duplicate articles
LNToutput <- LNToutput[!LNToutput#meta$ID %in% duplicates_df$ID_duplicate, ] #removes these duplicates from the main dataframe
corp <- lnt_convert(LNToutput, to = "quanteda") #to return multiple values from the r function, must be placed in a list
corp_date_from_file_name <- basename(file_name)
file_date <- as.Date(corp_date_from_file_name, format ="%d_%m_%y")
list_of_returns <-list(duplicates_df, corp,file_date) #list returns has duplicate df in first position, corpus in second and the file date in third
list_of_returns
}
get_dfm_df <- function(corp_list){
# takes the corp from getcorp, applies lexicoder dictionary, adds the neg_pos etc to their equivalent columns,
# calculates the percentage each category is of the total number of sentiment bearing words, adds the date specified from the file name
corpus_we_want <- corp_list[[2]]
sentiment_df <- dfm(corpus_we_want, dictionary = data_dictionary_LSD2015) %>% #applies the dictionary
convert("data.frame") %>%
cbind(docvars(corpus_we_want)) %>% #https://stackoverflow.com/questions/60419692/how-to-convert-dfm-into-dataframe-but-keeping-docvars
as_tibble() %>%
mutate(combined_negative = negative + neg_positive, combined_positive = positive + neg_negative) %>%
mutate(pos_percentage = combined_positive/(combined_positive + combined_negative ), neg_percentage =combined_negative/(combined_positive + combined_negative ) ) %>%
mutate(date = corp_list[[3]])
sentiment_df
}

Fetching data from OECD into R via SDMX(XML)

I want to extract data from the OECD website particularily the dataset "REGION_ECONOM" with the dimensions "GDP" (GDP of the respective regions) and "POP_AVG" (the average population of the respective region).
This is the first time I am doing this:
I picked all the required dimensions on the OECD website and copied the SDMX (XML) link.
I tried to load them into R and convert them to a data frame with the following code:
(in the link I replaced the list of all regions with "ALL" as otherwise the link would have been six pages long)
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url2 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.ALL.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx2 <- readSDMX(url2)
stats2 <- as.data.frame(sdmx2)
head(stats2)
Unfortunately, this returns a "400 Bad request" error.
When just selecting a couple of regions the error does not appear:
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url1 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.AUS+AU1+AU101+AU103+AU104+AU105.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx1 <- readSDMX(url1)
stats1 <- as.data.frame(sdmx1)
head(stats1)
I also tried to use the "OECD" package to get the data. There I had the same problem. ("400 Bad Request")
if (!require(OECD)) install.packages('OECD') + library(OECD)
df1<-get_dataset("REGION_ECONOM", filter = "GDP+POP_AVG",
start_time = 2008, end_time = 2009, pre_formatted = TRUE)
However, when I use the package for other data sets it does work:
df <- get_dataset("FTPTC_D", filter = "FRA+USA", pre_formatted = TRUE)
Does anyone know where my mistake could lie?

the sdmx-ml api does not seem to work as explained (using the all parameter), whereas the json API works just fine. The following query returns the values for all countries and returns them as json - I simply replaced All by an empty field.
query <- https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?
Transforming it to a readable format is not so trivial. I played around a bit to find the following work-around:
# send a GET request using httr
library(httr)
query <- "https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
dat_raw <- GET(query)
dat_parsed <- parse_json(content(dat_raw, "text")) # parse the content
Next, access the observations from the nested list and transform them to a matrix. Also extract the features from the keys:
dat_obs <- dat_parsed[["dataSets"]][[1]][["observations"]]
dat0 <- do.call(rbind, dat_obs) # get a matrix
new_features <- matrix(as.numeric(do.call(rbind, strsplit(rownames(dat0), ":"))), nrow = nrow(dat0))
dat1 <- cbind(new_features, dat0) # add feature columns
dat1_df <- as.data.frame(dat1) # optionally transform to data frame
Finally you want to find out about the keys. Those are hidden in the "structure". This one you also need to parse correctly, so I wrote a function for you to easier extract the values and ids:
## Get keys of features
keys <- dat_parsed[["structure"]][["dimensions"]][["observation"]]
for (i in 1:length(keys)) print(paste("id position:", i, "is feature", keys[[i]]$id))
# apply keys
get_features <- function(data_input, keys_input, feature_index, value = FALSE) {
keys_temp <- keys_input[[feature_index]]$values
keys_temp_matrix <- do.call(rbind, keys_temp)
keys_temp_out <- keys_temp_matrix[, value + 1][unlist(data_input[, feature_index])+1] # column 1 is id, 2 is value
return(unlist(keys_temp_out))
}
head(get_features(dat1_df, keys, 7))
head(get_features(dat1_df, keys, 2, value = FALSE))
head(get_features(dat1_df, keys, 2, value = TRUE))
I hope that helps you in your project.
Best, Tobias

how to extract title from a pdf documment with R

I need help to extract information from a pdf file in r
(for example https://arxiv.org/pdf/1701.07008.pdf)
I'm using pdftools, but sometimes pdf_info() doesn't work and in that case I can't manage to do it automatically with pdf_text()
NB notice that tabulizer didn't work on my PC.
Here is the treatment I'm doing (Sorry you need to save the pdf and do it with your own path):
info <- pdf_info(paste0(path_folder,"/",pdf_path))
title <- c(title,info$keys$Title)
key <- c(key,info$keys$Keywords)
auth <- c(auth,info$keys$Author)
dom <- c(dom,info$keys$Subject)
metadata <- c(metadata,info$metadata)
I would like to get title and abstract most of the time.

We will need to make some assumptions about the structure of the pdf we wish to scrape. The code below makes the following assumptions:
Title and abstract are on page 1 (fair assumption?)
Title is of height 15
The abstract is between the first occurrence of the word "Abstract" and first occurrence of the word "Introduction"
library(tidyverse)
library(pdftools)
data = pdf_data("~/Desktop/scrape.pdf")
#Get First page
page_1 = data[[1]]
# Get Title, here we assume its of size 15
title = page_1%>%
filter(height == 15)%>%
.$text%>%
paste0(collapse = " ")
#Get Abstract
abstract_start = which(page_1$text == "Abstract.")[1]
introduction_start = which(page_1$text == "Introduction")[1]
abstract = page_1$text[abstract_start:(introduction_start-2)]%>%
paste0(collapse = " ")
You can, of course, work off of this and impose stricter constraints for your scraper.

R: find a specific string next to another string with for loop

I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.

You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135

Create data frame from lists that are uneven

My code below is me scraping data from IMDB from multiple pages, however, when I try to combine the data into one data frame it is giving me an error telling me the differing rows for gross and meta. I was wondering how would I go about inserting NA values to those empty places so the strings are equal in length? (Note, I have to remove some links because I need certain rep to post more links)
urls <- c("https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=51&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=101&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=151&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=201&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=251&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=301&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=351&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=401&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=451&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=501&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=551&ref_=adv_nxt",
"https://www.imdb.com/search/title?
)
results_list <- list()
for(.page in seq_along(urls)){
webpage <- read_html(urls[[.page]])
titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)
runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime <- gsub(" min","",runtime)
ratinghtml <- html_nodes(webpage,'.ratings-imdb-rating strong')
rating<- html_text(ratinghtml)
voteshtml <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')
votes <- html_text(voteshtml)
votes<-gsub(",","",votes)#removing commas
metascorehtml <- html_nodes(webpage,'.metascore')
metascore <- html_text(metascorehtml)
metascore<-gsub(" ","",metascore)#removing extra space in metascore
grosshtml <- html_nodes(webpage,'.ghost~ .text-muted+ span')
gross <- html_text(grosshtml)
gross<-gsub("M","",gross)#removing '$' and 'M' signs
gross<-substring(gross,2,6)
results_list[[.page]] <- data.frame(Title = title,
Runtime = as.numeric(runtime),
Rating = as.numeric(rating),
Metascore = as.numeric(metascore),
Votes = as.numeric(votes),
Gross_Earning_in_Mil = as.numeric(unlist(gross))
)
}
final_results <- plyr::ldply(results_list)
Error in data.frame(Title = title, Runtime = as.numeric(runtime), Rating = as.numeric(rating), :
arguments imply differing number of rows: 50, 49, 48

You need to know where your data is missing, so you need to know which items belong together. Right now you just have seperate vectors of values, so you don't know which belong together.
Looking at the page, it looks they are neatly organized into "lister-item-content"-nodes, so the clean thing to do is first extract those nodes, and only then pull out more info from each unit seperately. Something like this works for me:
items <- html_nodes(webpage,'.lister-item-content')
gross <- sapply(items, function(i) {html_text(html_node(i, '.ghost~ .text-muted+ span'))})
It inserts NA at every place where 'items' does not contain the header you're looking for.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

subgraph text analysis in R (igraph) - r

Related

Applying Function to Multiple Files & Handling the resulting dataframes

Fetching data from OECD into R via SDMX(XML)

how to extract title from a pdf documment with R

R: find a specific string next to another string with for loop

Create data frame from lists that are uneven

Categories

Resources