I am running into an error while trying to make a corpus object from the tm package in R.
The data have been scraped from a website and I have included the full code below so you can run and see how the data were gathered and the tibble was created. The very last line of code is where I am getting stuck! (I have modified the loop so it should run in a few seconds).
Any help would be appreciated. :)
# create loop that iteratively adds page numbers onto
# keep the loop numbers small for testing before full data is pulled in
output <- character()
for (i in 1:2) {
article.links <- paste0("https://scholarlykitchen.sspnet.org/archives/page/", i ,"/") %>%
read_html() %>%
html_nodes(".list-article__title") %>%
html_nodes("a") %>%
output <- c(output, article.links)
# get all comments
get.comments <- function(output) {
article.page <- read_html(output)
article.comments <- article.page %>% html_nodes(".comment") %>% html_text() %>% trimws(which = "both")
text <- sapply(output, FUN = get.comments, USE.NAMES = FALSE)
# get all dates
get.dates <- function(output) {
article.page <- read_html(output)
article.comments <- article.page %>% html_nodes(".comment__meta__date") %>% html_text() %>% trimws(which = "both")
dates <- sapply(output, FUN = get.dates, USE.NAMES = FALSE)
# create the made df for the analysis
df <- tibble(
text = unlist(text, recursive = TRUE), # unlist is needed because sapply (for some reason) creates a list
dates = unlist(dates, recursive = TRUE)
# extract dates from meta data
df$dates <- as.character(gsub(",","",df$dates))
df$dates <- as.Date(df$dates, "%B%d%Y")
# create df ready for topic modelling
# this needs to have very specifically names columns
df.tm <- df[-2] # create dupelicate for backup (dates not needed for topic modelling yet)
df.tm$doc_id <- row.names(df) # create a unique id for each row as is needed by the tm package
df.tm <- df.tm[c(2,1)] # reorders the columns
# From the comments text, create the corpus
corpus <- VCorpus(DataframeSource(df))
Error is the below
Error in DataframeSource(df) :
all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE
DataframeSource() requires the df to have a document index in its first column, and it must be labeled "doc_id".
df_with_id <- rowid_to_column(df, var = "doc_id") # Alternatively, generate a doc index that better represents your collection of documents.
corpus <- VCorpus(DataframeSource(df))
Metadata: corpus specific: 0, document level (indexed): 1
Content: documents: 141
I have a script that works fine for one file. It takes the information from a json file, extracts a list and a sublist of it (A), and then another list B with the third element of list A. It creates a data frame with list B and compares it with a master file. Finally, it provides two numbers: the number of elements in the list B and the number of matching elements of that list when comparing with the master file.
However, I have 180 different json files in a folder and I need to run the script for all of them, and build a data frame with the results for each file. So the final result should be something like this (note that the last line's figures are correct, the first two are fictitious):
The code I have so far is the following:
#load data from file
file <- "./raw_data/whf.json"
json_data <- fromJSON(file = file)
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# create empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# load master file and select the needed columns
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
# merge data frames and present the results
org_list <- inner_join(psa_handles, org_handles)
My first attempt is to include this code at the beginning:
files <- list.files()
for(f in files){
json_data <- fromJSON(file = f)
# the rest of the script for one file here
but I do not know how to write the code for the data frame or even how to integrate both ideas -the working script and the loop for the file names. I took the idea from here.
The new code after Alvaro Morales' answer is the following
archivos <- list.files("./raw_data/")
calculate_accounts <- function(archivos){
#load data from file
path <- paste("./raw_data/", archivos, sep = "")
json_data <- fromJSON(file = path)
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# create empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# load master file and select the needed columns
psa_handles <- read_csv(file = "./psa_handles.csv") %>%
select(Name, AKA, Twitter)
# merge data frames and present the results
org_list <- inner_join(psa_handles, org_handles)
accounts_db_org <- length(org_list$Twitter)
accounts_total_org <- length(usernames$following)
table_psa <- map_dfr(archivos, calculate_accounts)
However, now there is an error when Joining, by = "Twitter", it says subindex out of limits.
Links to 3 test files to put together in raw_data folder:
Link to the master file to compare:
<<<<< UPDATE >>>>>>
I am trying to find the solution and I did the code work and provide a valide output (a 180x3 data frame), but the columns that should be filled with the values of the objects accounts_db_org and accounts_total_org are showing NA. When checking the value stored in those objects, the values are correct (for the last iteration). So the output now is in its right format, but with NA instead of numbers.
I am really close, but I am not being able to make the code to show the right numbers. My last attempt is:
archivos <- list.files("./raw_data", pattern = "json", full.names = TRUE)
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv", show_col_types = FALSE) %>%
select(Name, AKA, Twitter)
nr_archivos <- length(archivos)
psa_result <- matrix(nrow = nr_archivos, ncol = 3)
# loop for working with all files, one by one
for(f in 1:nr_archivos){
# load file
json_data <- fromJSON(file = archivos[f])
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# empty vector
longitud = length(following)
names <- vector(length = longitud)
# loop to populate with the third element of each i item of the sublist
for(i in 1:longitud){
names[i] <- following[[i]][3]
# convert the list into a data frame
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# applying some format prior to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# merge tables and calculate the results for each iteration
org_list <- inner_join(psa_handles, org_handles)
accounts_db_org <- length(org_list$Twitter)
accounts_total_org <- length(usernames$following)
# populate the matrix row by row
psa_result[f] <- c(org_name, accounts_db_org, accounts_total_org)
# create a data frame from the matrix and save the result
psa_result <- data.frame(psa_result)
write_csv(psa_result, file = "./outputs/cuentas_seguidas_en_psa.csv")
The subscript out of bounds error was caused by a json file with 0 records. That was fixed deleting the file.
You can do it with purrr::map or purrr::map_dfr.
Is this what you looking for?
archivos <- list.files("./raw_data", pattern = "json", full.names = TRUE)
# load master file and select the needed columns. This needs to be out of "calculate_accounts" because you only read it once.
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
# calculate accounts
calculate_accounts <- function(archivo){
json_data <- rjson::fromJSON(file = archivo)
org_handles <- json_data %>%
pluck("twitter", "following") %>%
map_chr("username") %>%
as_tibble() %>%
rename(usernames = value) %>%
mutate(Twitter = str_c("#", usernames)) %>%
org_list <- inner_join(psa_handles, org_handles)
org_list %>%
mutate(accounts_db_org = length(Twitter),
accounts_total_org = nrow(org_handles)) %>%
table_psa <- map_dfr(archivos, calculate_accounts)
# A tibble: 53 x 4
Name AKA accounts_db_org accounts_total_org
<chr> <chr> <int> <int>
1 Association of American Medical Colleges AAMC 20 2924
2 American College of Cardiology ACC 20 2924
3 American Heart Association AHA 20 2924
4 British Association of Dermatologists BAD 20 2924
5 Canadian Psoriasis Network CPN 20 2924
6 Canadian Skin Patient Alliance CSPA 20 2924
7 European Academy of Dermatology and Venereology EADV 20 2924
8 European Society for Dermatological Research ESDR 20 2924
9 US Department of Health and Human Service HHS 20 2924
10 International Alliance of Dermatology Patients Organisations (Global Skin) IADPO 20 2924
# ... with 43 more rows
Unfortunately, the answer provided by Álvaro does not work as expected, since the output repeats the same number with different organisation names, making it really difficult to read. Actually, the number 20 is repeated 20 times, the number 11, 11 times, and so on. The information is there, but it is not accessible without further data treatment.
I was doing my own research in the meantime and I got to the following code. Finally I made it to work, but the data format was "matrix" "array", really confusing. Fortunately, I wrote the last lines to transpose the data, unlist the array and convert in a matrix, which is able to be converted in a data frame and manipulated as usual.
Maybe my explanation is not very useful, and since I am a newbie, I am sure the code is far from being elegant and optimised. Anyway, please review the code below:
# Load data from files.
archivos <- list.files("./raw_data/json_files",
pattern = ".json",
full.names = TRUE)
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
nr_archivos <- length(archivos)
calcula_cuentas <- function(a){
# Extract lists
json_data <- fromJSON(file = a)
org_aka <- json_data$id
org_meta <- json_data$metadata
org_name <- org_meta$company
twitter <- json_data$twitter
following <- twitter$following
# create an empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# Create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#",
colnames(org_handles) <- "Twitter"
# merge tables
org_list <- inner_join(psa_handles, org_handles)
cuentas_db_org <- length(org_list$Twitter)
cuentas_total_org <- length(twitter$following)
results <- data.frame(Name = org_name,
AKA = org_aka,
Cuentas_db = cuentas_db_org,
Total = cuentas_total_org)
# apply function to list of files and unlist the result
psa <- sapply(archivos, calcula_cuentas)
psa1 <- t(as.data.frame(psa))
psa2 <- matrix(unlist(psa1), ncol = 4) %>%
colnames(psa2) <- c("Name", "AKA", "tw_int_outbound", "tw_ext_outbound")
# Save the results.
saveRDS(psa2, file = "rda/psa.RDS")
I am new to Webscraping. The url I am working with is this (https://tsmc.tripura.gov.in/doc_list). At present, I am able to extract data from the first page. Since, the url is unchanging, I don't have an identifier for the other pages to create a loop for data table extraction.
Here is my code:
url1<- getURL("https://tsmc.tripura.gov.in/doc_list",.opts =
list(ssl.verifypeer = FALSE))
table1<- readHTMLTable(url1)
table1<- list.clean(table1, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(table1, function(t) dim(t)[1]))
table11= table1[["NULL"]]
Please help. Thanks!
Perhaps try this solution:
url <- "https://tsmc.tripura.gov.in/doc_list?page="
sq <- seq(1, 30) # There appears to be 30 pages so we create a sequence of 1:30 results
links <- paste0(url, sq) #Paste the sequence after the url "page="
store <- NULL
tbl <- NULL
library(rvest) #extract the tables
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
df <- ldply(tbl, data.frame) #combine the list of data frames into one large data frame
df$`.id` <- gsub("https://tsmc.tripura.gov.in/doc_list?page=", " ", df$`.id`, fixed = TRUE)
Which gives 846 observations across 8 variables.
EDIT: I found that the first url does not have a sequence. In order to add the first page and rbind it with the rest of the data use the following:
firsturl <- "https://tsmc.tripura.gov.in/doc_list"
first_store = read_html(firsturl)
first_tbl = html_table(first_store)
first_df <- as.data.frame(first_tbl)
first_df$`.id` <- 0
df2 <- rbind(first_df, df)
I have a set of XML files that I am reading in, and wanted to know the best way to deal with the following:
<DecisionReasons xmlns:a="http://schemas.datacontract.org/2004/07/Contracts">
At the moment, I am running some R code but get the response:
Error: Duplicate identifiers for rows (2, 3, 4)
The intended output is a dataframe that looks something like:
fieldname |contents
MyDecision_Decision_DecisionID |X1234
MyDecision_Decision_DecisionReasons_Reason_Description_DOBMismatch |DOBMismatch
MyDecision_Decision_DecisionReasons_Reason_Description_PrimaryChecksFail |PrimaryChecksFail
MyDecision_Decision_DecisionReasons_Reason_Description_IncomeReferral |IncomeReferral
My current code is as below:
df <- data.frame()
transposed.df1 <- data.frame()
allxmldata <- data.frame()
inputfiles <- as.character('test.xml')
findchildren<-function(nodes, df) {
numchild <- sapply(nodes, function(x){length(xml_children(x))})
xmlvalue <- xml_text(nodes[numchild==0])
xmlname <- xml_name(nodes[numchild==0])
xmlpath <- sapply(nodes[numchild==0], function(x) {gsub(', ','_', toString(rev(xml_name(xml_parents(x)))))})
if (isTRUE(xmlpath == 'MyDecision_Decision_DecisionReasons_Reason')) {
fieldname <- paste(xmlpath,xmlname,xmlvalue,sep = '_')
} else {
fieldname <- paste(xmlpath,xmlname,sep = '_')
contents <- sapply(xmlvalue, function(f){is.na(f)<-which(f == '');f})
dftemp <- data.frame(fieldname, contents)
df <- rbind(df, dftemp)
if (sum(numchild)>0){
findchildren(xml_children(nodes[numchild>0]), df) }
else{ return(df)}
for (x in inputfiles) {
df1 <- findchildren(xml_children(read_xml(x)),df)
xml.df1 <- data.frame(spread(df1, key = fieldname, value = contents), fix.empty.names = TRUE)
allxmldata <- rbind.fill(allxmldata,xml.df1)
I hope that there is someone that can point out what I have done wrong...
I came up with the following solution that works for your exemplary dataset. Hopefully, this approach can also be used for your larger dataset.
# we need these two packages
# read in the xml-file
xml <- read_xml(
<DecisionReasons xmlns:a="http://schemas.datacontract.org/2004/07/Contracts">
xml %>%
# first find all nodes that doesn't contain any child nodes
xml_find_all(".//node()[not(node())]") %>%
# find the parent of each node
map(xml_parent) %>%
# extract name and text of each of the childless nodes
map(~list(name = xml_name(.x), text = xml_text(.x))) %>%
# bind rows.
This produces the following output:
# A tibble: 4 x 2
name text
<chr> <chr>
1 DecisionID X1234
2 Description DOBMismatch
3 Description PrimaryChecksFail
4 Description IncomeReferral
Hi I am trying scrape the data from ebay in R, I used the code mentioned below but I encountered with a problem wherein there were missing values for a particular selector elements, to get round it I used a for loop as shown(inspecting each listing and giving the number for which there was data missing) since the data scraped was less it was possible to inspect but how to do it when there's large amounts of data to be scraped.
Thanks in advance
web<- read_html(url)
subdescp<- html_nodes(web, ".lvsubtitle+ .lvsubtitle")
subdescp1<- str_replace_all(subdescp1, "[\t\n\r]" , "")
for (i in c(5,6,10,19,33,34,35)){
webpage <- read_html(url)
Descp_data_html <- html_nodes(webpage,'.vip')
Descp_data <- html_text(Descp_data_html)
price_data_html <- html_nodes(web,'.prc .bold')
price_data <- html_text(price_data_html)
price_data<-str_replace_all(price_data, "[\t\n]" , "")
price_data<-gsub("Rs. ","",price_data)
price_data<- as.numeric(price_data)
Desc_data_html <- html_nodes(webpage,'.lvtitle+ .lvsubtitle')
Desc_data <- html_text(Desc_data_html, trim = TRUE)
j7_f2<-data.frame(Title = Descp_data, Description= Desc_data, Sub_Description= Z, Pirce = price_data)
For instance you can use something like this.
data <- read_html("url.xml")
var <- data %>% html_nodes("//node") %>% xml_text()
# observations that don´t have certain nodes - fill them with NA
var_pair <- data %>% html_nodes("node_var_pair")
var_missing_clean = sapply(var_pair, function(x) {
tryCatch(xml_text(html_nodes(x, "./var_missing")),
error=function(err) NA)
df = data.frame(var, var_pair, var_missing)
Here there are three types of nodes that you may consider. var gathers the nodes that do not have missing data. var_pair includes the nodes that you want to pair with the nodes that contain missing observation and var_missing refers to the nodes with missing information. You can create variables and aggregate them in a data data frame (df)
The process here is simple and in two steps -- First extract all nodes at the block level (not each element and don't convert to text). This is a list of length equal to the number of blocks. Second from this extracted list extract each element as text and clean it. Since this is being done from a list, NA's where applicable are automatically coerced in the right places. See an example from the same ebay India site:
# specify the url
url <-"https://www.ebay.in/sch/Mobile-Phones"
# read the page
web <- read_html(url)
# define the supernode that has the entire block of information
super_node <- '.li'
# read as vector of all blocks of supernode (imp: use html_nodes function)
super_node_read <- html_nodes(web, super_node)
# define each node element that you want
node_model_details <- '.lvtitle'
node_description_1 <- '.lvtitle+ .lvsubtitle'
node_description_2 <- '.lvsubtitle+ .lvsubtitle'
node_model_price <- '.prc .bold'
node_shipping_info <- '.bfsp'
# extract the output for each as cleaned text (imp: use html_node function)
model_details <- html_node(super_node_read, node_model_details) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_1 <- html_node(super_node_read, node_description_1) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
description_2 <- html_node(super_node_read, node_description_2) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
model_price <- html_node(super_node_read, node_model_price) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
shipping_info <- html_node(super_node_read, node_shipping_info) %>%
html_text() %>%
str_replace_all("[\t\n\r]" , "")
# create the data.frame
mobile_phone_data <- data.frame(
I have a file with multiple HTML links in it and now want to use dplyr and rvest to get the link to the image per every link of each row.
When I do it manually it works fine and returns the row but when the same code is called within a function it fails with the following error:
Error: no applicable method for 'xml_find_all' applied to an object of
class "factor"
I don't know what I'm doing wrong. Any help is appreciated. In order to make my question more clear I have added (in comments) a few example rows and also shown the manual approach.
library(httr) # contains function stop_for_status()
#get html links from file
# "_id",url
# 560fc55c65818bee0b77ec33,http://www.seriouseats.com/recipes/2011/01/sriracha-ceviche-recipe.html
# 560fc57e65818bee0b78d8b7,http://www.seriouseats.com/recipes/2008/07/pasta-arugula-tomatoes-recipe.html
# 560fc57e65818bee0b78dcde,http://www.seriouseats.com/recipes/2007/08/cook-the-book-minty-boozy-chic.html
# 560fc57e65818bee0b78de93,http://www.seriouseats.com/recipes/2010/02/chipped-beef-gravy-on-toast-stew-on-a-shingle-recipe.html
# 560fc57e65818bee0b78dfe6,http://www.seriouseats.com/recipes/2011/05/dinner-tonight-quinoa-salad-with-lemon-cream.html
# 560fc58165818bee0b78e65e,http://www.seriouseats.com/recipes/2010/10/dinner-tonight-spicy-quinoa-salad-recipe.html
#load into SE
SE <- read.csv("~/Desktop/SeriousEats.csv")
#function to retrieve imgPath per URL
#using rvest
getImgPath <- function(x) {
imgPath <- x %>% html_nodes(".photo") %>% html_attr("src")
#This works fine
#UrlPage <- read_html ("http://www.seriouseats.com/recipes/2011/01/sriracha-ceviche-recipe.html")
#imgPath <- UrlPage %>% html_nodes(".photo") %>% html_attr("src")
#This throws an error msg
S <- mutate(SE, imgPath = getImgPath(SE$url))
This works:
# SE <- data_frame(url = c(
# "http://www.seriouseats.com/recipes/2011/01/sriracha-ceviche-recipe.html",
# "http://www.seriouseats.com/recipes/2008/07/pasta-arugula-tomatoes-recipe.html"
# ))
SE <- read.csv('/path/to/SeriousEats.csv', stringsAsFactors = FALSE)
getImgPath <- function(x) {
# x must be "a document, a node set or a single node" per rvest documentation; cannot be a factor or character
imgPath <- read_html(x) %>% html_nodes(".photo") %>% html_attr("src")
# httr::stop_for_status(res) OP said this is not necessary, so I removed
S <- SE %>%
rowwise() %>%
mutate(imgPath = getImgPath(url))
Thanks for the help and patience and #Jubbles. For the benefit of others here is the complete answer.
SE <- read.csv("~/Desktop/FILE.txt", stringsAsFactors = FALSE)
getImgPath <- function(x) {
if (try(url.exists(x))) {
imgPath <- html(x) %>%
html_nodes(".photo") %>%
else {
imgPath = "NA"
SE1 <- SE %>%
rowwise() %>%
mutate(imgPath = getImgPath(url))