PROJECT - collecting topic information from Australian High Court decisions published online to analyse trends over time.
CURRENT PROBLEM - trying to iterate 'case_importer' function (which returns relevant character strings from decisions) over each year a number of times equal to the number of cases decided in that year.
I can achieve this manually by using a for loop which iterates as many times as there were cases each year - using 2020 (which has had 14 cases so far) for example:
for (i in 14) {
hc_cases_2020_list[[i]] <- case_importer(2020, i)
}
However, I would rather not have to rewrite the number of iterations for each year manually.
I have stored the number of cases per year in an excel sheet called 'hc_cases_per_year' which has two columns, 'year' and 'num_cases', and created a function which gives the 'num_cases' value with a year input.
Any advice in solving the specific problem or taking a better approach would be beyond appreciated.
My code so far is below - thanks for taking the time to read!
library(rvest)
library(dplyr)
library(readxl)
# Function to import html data and turn it into text
html_to_text <- function(url) {
url %>%
read_html %>%
html_nodes("li , p , blockquote , b , h2") %>%
html_text
}
# Function to extract topic paragraphs using grep
grep_topics <- function(text) {
grep("^\\n?(:?\\w+\\s)+\\(?C?t?h?\\)?\\s?–.*", text, perl = TRUE, value = TRUE)
}
# Function to wrap the above funtions into a single function that takes year and case number
case_importer <- function(year, case_num) {
url <- url_maker(year, case_num)
text <- html_to_text(url)
grep_topics(text)
}```
# Function + command to create empty list for each year
list_namer <- function(year) {
assign(paste0("hc_cases_", year, "_list", sep = ""), list(), envir = .GlobalEnv)
}
for (i in 1948:2020) {
list_namer(i)
}
# Function to change list to dataframe
list_to_df <- function(list) {
num_obs <- sapply(list, length)
seq_max <- seq_len(max(num_obs))
t(sapply(list, "[", i = seq_max))
}
# Function to return number of cases in given year
cases_in_year <- function(year) {
hc_cases_per_year$num_cases[year - 1947]
}
# Function to create a URL for any given year and case number
url_maker <- function(year, case_num) {
paste("http://classic.austlii.edu.au/au/cases/cth/HCA/", year, "/", case_num, ".html", sep = "")
}
# Importing hc_cases_per_year
hc_cases_per_year <- read_excel("PATH/hc_cases_per_year.xlsx")
Related
I'm writing a small package for scraping lyrics. The problem right now is that it acts in a very clumsy way with no control. I would like to be able to specify what range to affect in the for loop that does the scraping.
I have two functions, one which presents the list of songs and the other that scrapes them:
songlist <- function(x) {
url <- paste0("https://www.azlyrics.com/", substring(x, 1, 1),"/",x, ".html")
page <- url
songs <- page %>%
xml2::read_html() %>%
rvest::html_nodes(xpath = "/html/body/div[2]/div/div[2]/div[4]/div/a") %>%
rvest::html_text() %>%
as.data.frame()
chart <- cbind(songs)
names(chart) <- c("Songs")
chart <- tibble::as_tibble(chart)
return(chart)
}
This gives me a tibble like this:
Songs
<chr>
1 Meet Me In The Hallway
2 Sign Of The Times
3 Carolina
4 Two Ghosts
5 Sweet Creature
6 Only Angel
7 Kiwi
8 Ever Since New York
9 Woman
Another function, called songscrape() takes a string argument used to build the url and scrapes all songs from that artist. It has a copy of songlist() inside it generates urls for Like so:
songscrape <- function(x) {
url <- paste0("https://www.azlyrics.com/", substring(x, 1, 1),"/",x, ".html")
artist <- x
SongsListScrapper <- function(x) {
page <- x
songs <- page %>%
xml2::read_html() %>%
rvest::html_nodes(xpath = "/html/body/div[2]/div/div[2]/div[4]/div/a") %>%
rvest::html_text() %>%
as.data.frame()
chart <- cbind(songs)
names(chart) <- c("Songs")
chart <- tibble::as_tibble(chart)
return(chart)
}
SongsList <- purrr::map_df(url, SongsListScrapper)
SongsList
SongsList %<>%
dplyr::mutate(
Songs = as.character(Songs)
,Songs = gsub("[[:punct:]]", "", Songs)
,Songs = tolower(Songs)
,Songs = gsub(" ", "", Songs)
)
SongsList$Songs
#Scrape Lyrics
wipe_html <- function(str_html) {
gsub("<.*?>", "", str_html)
}
lyrics <- c()
for(i in seq_along(SongsList$Songs)) {
for_url_name <- SongsList$Songs[i]
#clean name
for_url_name <- tolower(gsub("[[:punct:]]\\s", "", for_url_name))
#create url
paste_url <- paste0("https://www.azlyrics.com/lyrics/", artist,"/", for_url_name, ".html")
tryCatch( {
#open connection to url
for_html_code <- xml2::read_html(paste_url)
for_lyrics <- rvest::html_node(for_html_code, xpath = "/html/body/div[2]/div/div[2]/div[5]")
}
As you can see, the for loop says for(i in seq_along(SongsList$Songs)), which basically means all songs. Is it possible to be able to enter a range instead. Like if I choose up to 5 songs or from the 10th song to the 15th (based on the numbers in the tibble).
So the updated function call would look something like songscrape("ironwine", from = 5, to = 15) or songscrape("ironwine", n = 20) #first 20 songs
How can I do this?
Figured it out using function overloading! If you notice, the number of loops is decided by the SongList dataframe. If we subset that by the appropriate amount, the loop should be done correctly. I won't post the full code but here is the gist:
songscrape <- function(x,y,z) {
#If only the string argument is passed
if(missing(y) && missing(x)) {
y == FALSE
}
#if only one numeric argument is passed
if(missing(z)) {
z == FALSE
}
#Now just modify the to fit the numbers passed.
if(y == FALSE) {
#User wants all songs
} else if (z == FALSE) {
#User wants top N songs given by the songlist() function.
#Slice the SongList data frame by y value
SongList <- SongList %>% slice(1:y)
} else {
#User wants between a range, so slice between y and z.
SongList <- SongList %>% slice(y:z)
}
Since the for loop uses the SongList dataframe to run through, the only thing you need to do is slice it appropriately and it'll run between those values.
I've got a function which I'm trying to apply in a for loop that extracts a dataframe from multiple files and combines them into a single one.
This is how, from what I've read, I thought would be the best way to attack it but I get an empty list returned, when I was hoping for a list of dataframes which could be combined using bind_rows.
This is the code I'm using:
combined_functions <- function(file_name) {
#combines the get_dfm_df and get corp function: get dfm tibble straight from the file name
data_frame_returned<- get_dfm_df(getcorp(file_name))
data_frame_returned
}
list_of_dataframes <- list()
file.list <- dir(pattern ="DOCX$")
for (file in file.list) {
dataframe_of_file <- combined_function(file)
append(list_of_dataframes,dataframe_of_file)
}
bind_rows(list_of_dataframes, .id = "column_label") #https://stackoverflow.com/questions/2851327/convert-a-list-of-data-frames-into-one-data-frame
It creates an empty list, gets a list of the file names which the function combined_function uses to create a data frame out of the file and should, to my understanding, append this dataframe to the list. After all the files in the directory have been matched, bind_rows should combine it into one overall dataframe but it only returns an empty tibble. list_of_dataframes is also empty.
I've tried the solution in this answer but it didn't help:
Append a data frame to a list
https://www.dropbox.com/sh/z8vh50b370gcb1j/AAAcbnfAUOM6-y8uWn4-lUWLa?dl=0
This a link to the raw files I am using in this case, but I think the problem is a general one.
Appendix:
These are the functions combined_functions refer too. They work on the individual cases so I'm confident this is not the cause of the problem but I've included them for completeness anyway.
rm(list = ls())
library(quanteda)
library(quanteda.corpora)
library(readtext)
library(LexisNexisTools)
library(tidyverse)
library(tools)
getcorp<- function(file_name){
#function to take the lexis word document, convert it into quanteda corpus object, returns duplicate df and date from filename in list
LNToutput <- lnt_read(file_name)
duplicates_df <- lnt_similarity(LNToutput = LNToutput,
threshold = 0.99)
duplicates_df <- duplicates_df[duplicates_df$Similarity > 0.99] #https://github.com/JBGruber/LexisNexisTools creates dataframe of duplicate articles
LNToutput <- LNToutput[!LNToutput#meta$ID %in% duplicates_df$ID_duplicate, ] #removes these duplicates from the main dataframe
corp <- lnt_convert(LNToutput, to = "quanteda") #to return multiple values from the r function, must be placed in a list
corp_date_from_file_name <- basename(file_name)
file_date <- as.Date(corp_date_from_file_name, format ="%d_%m_%y")
list_of_returns <-list(duplicates_df, corp,file_date) #list returns has duplicate df in first position, corpus in second and the file date in third
list_of_returns
}
get_dfm_df <- function(corp_list){
# takes the corp from getcorp, applies lexicoder dictionary, adds the neg_pos etc to their equivalent columns,
# calculates the percentage each category is of the total number of sentiment bearing words, adds the date specified from the file name
corpus_we_want <- corp_list[[2]]
sentiment_df <- dfm(corpus_we_want, dictionary = data_dictionary_LSD2015) %>% #applies the dictionary
convert("data.frame") %>%
cbind(docvars(corpus_we_want)) %>% #https://stackoverflow.com/questions/60419692/how-to-convert-dfm-into-dataframe-but-keeping-docvars
as_tibble() %>%
mutate(combined_negative = negative + neg_positive, combined_positive = positive + neg_negative) %>%
mutate(pos_percentage = combined_positive/(combined_positive + combined_negative ), neg_percentage =combined_negative/(combined_positive + combined_negative ) ) %>%
mutate(date = corp_list[[3]])
sentiment_df
}
I designed the following scripts months ago and it worked without any issue. The last few days I tried to rerun the same script but always got the same error. I've changed my script and updating the packages, but i'm unable to make it work again. The script should give me all the delays in the Belgian railroad stations.
I have add my two separate scripts (one is filled with the functions) and the error/traceback.
library(httr)
library(jsonlite)
library(tidyverse)
load.stations <- function(){
a <- GET("https://api.irail.be/stations/?format=json") #get command for all stations from irail api
parsed <- jsonlite::fromJSON(content(a, "text"), flatten=TRUE) #parse json into r
stations <- parsed$station %>%
filter(grepl("^BE.NMBS.0088",id)) #keep only stations in Belgium. Regular expression ^ is begins with
return(stations)
}
get.time <- function(){
time <- paste(format(Sys.time(),"%d/%m/%y %H:%M:%S")) #formats system time in dd/mm/yyyy hh:mm:ss in a string
strpt <- strptime(time,"%d/%m/%y %H:%M:%S") #takes time-string and converts to interpretable date and time
return(strpt)
}
get.temp_df <- function(stations, i){
goget <- paste0("https://api.irail.be/liveboard/?format=json&id=",stations$id[i]) #http for get command, get liveboard (similar to screens in station i)
c <- GET(goget) #get the data
parsed_c <- jsonlite::fromJSON(content(c, "text"), flatten=TRUE) #parse from json
temp_df <- parsed_c$departures$departure #get the dataframe with departures from the parsed json
return(temp_df)
}
add.to.all <- function(all_df, temp_df){
all_df <- rbind(all_df,temp_df)%>% #add temporary dataframe to master dataframe
group_by(stationneke,time,vehicle)%>% # group departure times by station - remove doubles
top_n(1,importtime)%>% #only keep the most recent observation - remove doubles 2
ungroup() #lift grouping
return(all_df)
}
save.day <- function(all_df){
strpt <- get.time()
saveRDS(all_df,file = paste(strpt$mday, strpt$mon+1, strpt$year+1900,"Punct.rda",sep = "-"))
Sys.sleep(time = 3600-(strpt$min*60+strpt$sec)) #sleep one hour minus number of secs in the sleep time
return(data.frame())
}
library(httr)
library(jsonlite)
library(tidyverse)
## all departures - scraper
loop.scraper <- function(hour_of_pause =3){
source("NMBS-punctuality-functions.R")
all_df <- data.frame() #leeg dataframe
stations <- load.stations()
while (TRUE) { #infinite loop
strpt <- get.time()
while(strpt$hour != hour_of_pause){ #enters loop when hour is not "hour_of_pause"
# startloop <- (strpt$min*60 + strpt$sec)
for (i in 1:nrow(stations)) { #second loop through the stations
temp_df <- get.temp_df(stations, i)
if(is.null(temp_df)) next #skip if dataframe is empty (some stations have been closed in recent years)
temp_df$stationneke <- stations$name[i] #add departure station name i to the dataframe
temp_df$importtime <- Sys.time() # add variable with the time of import of the observation
all_df <- add.to.all(all_df, temp_df)
strpt <- get.time()
} #end of loop through stations
# stoploop <- (strpt$min*60 + strpt$sec)
} #end of hour-check loop, code below only executed when no trains active (at night)
all_df <- save.day(all_df) #saves file and returns empty dataframe
}
}
Error: lexical error: invalid char in json text.
<br /> <b>Fatal error</b>: Unc
(right here) ------^
5.
parse_string(txt, bigint_as_char)
4.
parseJSON(txt, bigint_as_char)
3.
parse_and_simplify(txt = txt, simplifyVector = simplifyVector,
simplifyDataFrame = simplifyDataFrame, simplifyMatrix = simplifyMatrix,
flatten = flatten, ...)
2.
jsonlite::fromJSON(content(c, "text"), flatten = TRUE)
1.
loop.scraper(12)
So I am trying to make a basic sensitivity analysis script. The outputs come out as I want via the print I added to the end of the script. Issue is that I would like a tibble or object that has all the outputs appended together that I can export as a csv or xlsx.
I created two functions, sens_analysis which runs all the code, and multiply_across which multiplies across each possible percentage across each possible column of your table. You need multiply_across to run the sens_analysis.
I would normally like a title but instead I just added an indicator column instead that I can sort by.
I made everything with mtcars so it should be easy to replicate, the issue is that I just have a huge print at the end; not an object that I can manipulate or pull from for other analysis.
I have been trying the rbind, bind_row, appending rows in a variety of ways.
Or building a new object. As you can see in the code at line (18) I make something called output that I have tried to populate, which hasn't gone well.
rm(list = ls())
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(magrittr)
library(xtable)
data<-mtcars
percent<-c(.05,.1,.15)
goods<-c("hp","gear","wt")
weight<-c(6,7,8)
disagg<-"cyl"
func<-median
sens_analysis<-function(data=data, goods=goods, weight=weight, disagg=disagg, precent=percent, func=func){
output<-NULL%>%
as.tibble()
basket<-(rbind(goods,weight))
percent<-c(0,percent,(percent*-1))
percent_to_1<-percent+1
data_select<-data%>%
dplyr::select(c(goods,disagg))%>%
group_by_at(disagg)%>%
summarise_at(.vars = goods ,.funs = func)%>%
as_tibble()
data_select_weight<-purrr::map2(data_select[,-1], as.numeric(basket[2,]),function(var, weight){
var*weight
})%>% as_tibble %>%
add_column(data_select[,1], .before = 1)
colnames(data_select_weight)[1]<-disagg
multiply_across(data_select_weight,percent_to_1)
return(output)
#output2<-rbind(output2,output)
}
############################
multiply_across<-function(data=data_select_weight,list=percent_to_1){
varlist<-names(data[,-1])
for(i in varlist){
df1 = data[,i]
for(j in list){
df<-data
df[,i]<-round(df1*j,2)
df<-mutate(df, total = round(rowSums(df[,-1]),2))%>%
mutate(type=paste0(i," BY ",(as.numeric(j)-1)*100,"% OVER ",disagg))%>%
print(df)
#output<-bind_rows(output,df)
#output<-bind_rows(output,df)
#output[[j]]<-df[[j]]
}
}
}
##############################################################################################
sens_analysis(data,goods,weight,disagg,percent,func)
The expected result if you just run the code straight-up should just be a bunch of printed tibbles, that arent in an object. But ideally, for future analysis on the data or easy of use, a table of the outputs appended together would be best.
So I figured it out and will add my answer here in case someone else hits this issues.
I created a list within loops and then binded those lists together.
Just focus on the binding rows outside the right for-loop.
multiply_across<-function(data=data_select_weight,
list=percent_to_1){
varlist <- colnames(data[, -1])
output_list <- list()
for (i in varlist) {
df1 <- data[,i]
for (j in list) {
name <- paste0(i, " BY ", (as.numeric(j)-1)*100, "% OVER ", disagg)
df <- as_tibble(data)
df[,i] <- round(df1*j, 2)
df <- mutate(df, total = round(rowSums(df[,-1]),2))%>%
mutate(type = paste0(i, " BY ", (as.numeric(j)-1)*100, "% OVER ", disagg))
df<-df[,c(6,1,2,3,4,5)]
output_list[[paste0(i," BY ",(as.numeric(j)-1)*100)]] <- (assign(paste0(i," BY ",(as.numeric(j)-1)*100,"% OVER ",disagg),df))
}
}
bind_rows(lapply(output_list,
as.data.frame.list,
stringsAsFactors=F))
}
I have the following .csv file:
https://drive.google.com/open?id=0Bydt25g6hdY-RDJ4WG41VFpyX1k
And I would like to be able to take the date and agent name(pasting its constituent parts) and append them as columns to the right of the table, up until it finds a different name and date, doing the same for the remaining name and date items, to get the following result:
The only thing I have been able to do with the dplyr package is the following:
library(dplyr)
library(stringr)
report <- read.csv(file ="test15.csv", head=TRUE, sep=",")
date_pattern <- "(\\d+/\\d+/\\d+)"
date <- str_extract(report[,2], date_pattern)
report <- mutate(report, date = date)
Which gives me the following result:
The difficulty I am finding is probably using conditionals in order make the script get the appropriate string and append it as a column at the end of the table.
This might be crude, but I think it illustrates several things: a) setting stringsAsFactors=F; b) "pre-allocating" the columns in the data frame; and c) using the column name instead of column number to set the value.
report<-read.csv('test15.csv', header=T, stringsAsFactors=F)
# first, allocate the two additional columns (with NAs)
report$date <- rep(NA, nrow(report))
report$agent <- rep(NA, nrow(report))
# step through the rows
for (i in 1:nrow(report)) {
# grab current name and date if "Agent:"
if (report[i,1] == 'Agent:') {
currDate <- report[i+1,2]
currName=paste(report[i,2:5], collapse=' ')
# otherwise append the name/date
} else {
report[i,'date'] <- currDate
report[i,'agent'] <- currName
}
}
write.csv(report, 'test15a.csv')