Is there a different method to increase run performance in R? - r

I'm collecting some Economic indicator data. In this process, I also want to collect hourly tweet counts with the script. I asked a similar question with simple data before. As the historical data grows, the run times will get longer. Since the result table will be a dataframe, can I run this script more effectively with functions such as apply family or do.call?
library(httr)
library(dplyr)
library(lubridate)
library(tidyverse)
library(stringr)
sel1<-c('"#fed"','"#usd"','"#ecb"','"#eur"')
for (i in sel1)
{
for (ii in 1:20){
headers = c(
`Authorization` = 'Bearer #enter your Bearer token#'
)
params = list(
`query` =i,
#my sys.time is different
`start_time` = strftime(Sys.time()-(ii+1)*60*60, "%Y-%m-%dT%H:%M:%SZ",tz ='GMT'),
`end_time` =strftime(Sys.time()-ii*60*60, "%Y-%m-%dT%H:%M:%SZ",tz ='GMT'),
`granularity` = 'hour'
)
res1<- httr::GET(url = 'https://api.twitter.com/2/tweets/counts/recent', httr::add_headers(.headers=headers), query = params) %>%
content( as = 'parsed')
x1<-cbind(data.frame(res1),topic=str_replace_all(i, "([\n\"#])", ""))
if(!exists("appnd1")){
appnd1 <- x1
} else{
appnd1 <- rbind(appnd1, x1)
}
}
}

In general, iteratively rbind-ing data in a for loop will always get worse with time: each time you do one rbind, it copies all of the previous frame into memory, so you have two copies of everything. With small numbers this is not so bad, but you can imagine that copying a lot of data around in memory can be a problem. (This is covered in the R Inferno, chapter 2, Growing objects. It's good reading, even if it is not a recent document.)
The best approach is to create a list of frames (see https://stackoverflow.com/a/24376207/3358227), add contents to it, and then when you are done combine all frames within the list into a single frame.
Untested, but try this modified process:
library(httr)
library(dplyr)
library(lubridate)
library(tidyverse)
library(stringr)
sel1<-c('"#fed"','"#usd"','"#ecb"','"#eur"')
listofframes <- list()
for (i in sel1) {
for (ii in 1:20){
headers = c(
`Authorization` = 'Bearer #enter your Bearer token#'
)
params = list(
`query` =i,
#my sys.time is different
`start_time` = strftime(Sys.time()-(ii+1)*60*60, "%Y-%m-%dT%H:%M:%SZ",tz ='GMT'),
`end_time` =strftime(Sys.time()-ii*60*60, "%Y-%m-%dT%H:%M:%SZ",tz ='GMT'),
`granularity` = 'hour'
)
res1<- httr::GET(url = 'https://api.twitter.com/2/tweets/counts/recent', httr::add_headers(.headers=headers), query = params) %>%
content( as = 'parsed')
x1<-cbind(data.frame(res1),topic=str_replace_all(i, "([\n\"#])", ""))
listofframes <- c(listofframes, list(x1))
}
}
# choose one of the following based on your R-dialect/package preference
appnd1 <- do.call(rbind, listofframes)
appnd1 <- dplyr::bind_rows(listofframes)
appnd1 <- data.table::rbindlist(listofframes)

Related

r Web scraping: Unable to read the main table

I am new to web scraping. I am trying to scrape a table with the following code. But I am unable to get it. The source of data is
https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])
I copied this apporach from the link (Web scraping of key stats in Yahoo! Finance with R) where it is applied on yahoo finance data.
In my opinion, in readHTMLTable(tableNodes[[12]]), it should be Table 12. But when I try giving tableNodes[[12]], it always gives me an error.
Error in do.call(data.frame, c(x, alis)) :
variable names are limited to 10000 bytes
Please suggest me the way to extract the table and combine the data from other tabs as well (Fundamental, Technical and Performance).
This data is returned dynamically as json. In R (behaves differently from Python requests) you get html from which you can extract a given page's results as json. A page includes all the tabs info and 50 records. From the first page you are given the total record count and therefore can calculate the total number of pages to loop over to get all results. Perhaps combine them info a final dataframe during a loop to total number of pages; where you alter the pn param of the XHR POST body to the appropriate page number for desired results in each new POST request. There are two required headers.
Probably a good idea to write a function that accepts a page number in signature and returns a given page's json as a dataframe. Apply that via a tidyverse package to handle loop and combining of results to final dataframe?
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)
My current attempt at combining which I would welcome feedback on. This is all the returned columns, for all pages, and I have to handle missing columns and different ordering of columns as well as 1 column being a data.frame. As the returned number is far greater than those visible on page, you could simply revise to subset returned columns with a mask just for the columns present in the tabs.
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
get_data <- function(page_number){
data['pn'] = page_number
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>% read_html() %>% html_node('p') %>% html_text()
if(page_number==1){ return(s) }
else{return(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))}
}
clean_df <- function(df){
interim <- df['viewData']
df_minus <- subset(df, select = -c(viewData))
df_clean <- cbind.data.frame(c(interim, df_minus))
return(df_clean)
}
initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)
dfs <- map(.x = 2:num_pages,
.f = ~clean_df(get_data(.)))
r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')

Getting Looped Output into an Appended Object

So I am trying to make a basic sensitivity analysis script. The outputs come out as I want via the print I added to the end of the script. Issue is that I would like a tibble or object that has all the outputs appended together that I can export as a csv or xlsx.
I created two functions, sens_analysis which runs all the code, and multiply_across which multiplies across each possible percentage across each possible column of your table. You need multiply_across to run the sens_analysis.
I would normally like a title but instead I just added an indicator column instead that I can sort by.
I made everything with mtcars so it should be easy to replicate, the issue is that I just have a huge print at the end; not an object that I can manipulate or pull from for other analysis.
I have been trying the rbind, bind_row, appending rows in a variety of ways.
Or building a new object. As you can see in the code at line (18) I make something called output that I have tried to populate, which hasn't gone well.
rm(list = ls())
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(magrittr)
library(xtable)
data<-mtcars
percent<-c(.05,.1,.15)
goods<-c("hp","gear","wt")
weight<-c(6,7,8)
disagg<-"cyl"
func<-median
sens_analysis<-function(data=data, goods=goods, weight=weight, disagg=disagg, precent=percent, func=func){
output<-NULL%>%
as.tibble()
basket<-(rbind(goods,weight))
percent<-c(0,percent,(percent*-1))
percent_to_1<-percent+1
data_select<-data%>%
dplyr::select(c(goods,disagg))%>%
group_by_at(disagg)%>%
summarise_at(.vars = goods ,.funs = func)%>%
as_tibble()
data_select_weight<-purrr::map2(data_select[,-1], as.numeric(basket[2,]),function(var, weight){
var*weight
})%>% as_tibble %>%
add_column(data_select[,1], .before = 1)
colnames(data_select_weight)[1]<-disagg
multiply_across(data_select_weight,percent_to_1)
return(output)
#output2<-rbind(output2,output)
}
############################
multiply_across<-function(data=data_select_weight,list=percent_to_1){
varlist<-names(data[,-1])
for(i in varlist){
df1 = data[,i]
for(j in list){
df<-data
df[,i]<-round(df1*j,2)
df<-mutate(df, total = round(rowSums(df[,-1]),2))%>%
mutate(type=paste0(i," BY ",(as.numeric(j)-1)*100,"% OVER ",disagg))%>%
print(df)
#output<-bind_rows(output,df)
#output<-bind_rows(output,df)
#output[[j]]<-df[[j]]
}
}
}
##############################################################################################
sens_analysis(data,goods,weight,disagg,percent,func)
The expected result if you just run the code straight-up should just be a bunch of printed tibbles, that arent in an object. But ideally, for future analysis on the data or easy of use, a table of the outputs appended together would be best.
So I figured it out and will add my answer here in case someone else hits this issues.
I created a list within loops and then binded those lists together.
Just focus on the binding rows outside the right for-loop.
multiply_across<-function(data=data_select_weight,
list=percent_to_1){
varlist <- colnames(data[, -1])
output_list <- list()
for (i in varlist) {
df1 <- data[,i]
for (j in list) {
name <- paste0(i, " BY ", (as.numeric(j)-1)*100, "% OVER ", disagg)
df <- as_tibble(data)
df[,i] <- round(df1*j, 2)
df <- mutate(df, total = round(rowSums(df[,-1]),2))%>%
mutate(type = paste0(i, " BY ", (as.numeric(j)-1)*100, "% OVER ", disagg))
df<-df[,c(6,1,2,3,4,5)]
output_list[[paste0(i," BY ",(as.numeric(j)-1)*100)]] <- (assign(paste0(i," BY ",(as.numeric(j)-1)*100,"% OVER ",disagg),df))
}
}
bind_rows(lapply(output_list,
as.data.frame.list,
stringsAsFactors=F))
}

Iterating through values in R

I'm new-ish to R and am having some trouble iterating through values.
For context: I have data on 60 people over time, and each person has his/her own dataset in a folder (I received the data with id #s 00:59). For each person, there are 2 values I need - time of response and picture response given (a number 1 - 16). I need to convert this data from wide to long format for each person, and then eventually append all of the datasets together.
My problem is that I'm having trouble writing a loop that will do this for each person (i.e. each dataset). Here's the code I have so far:
pam[x] <- fromJSON(file = "PAM_u[x].json")
pam[x]df <- as.data.frame(pam[x])
#Creating long dataframe for times
pam[x]_long_times <- gather(
select(pam[x]df, starts_with("resp")),
key = "time",
value = "resp_times"
)
#Creating long dataframe for pic_nums (affect response)
pam[x]_long_pics <- gather(
select(pam[x]df, starts_with("pic")),
key = "picture",
value = "pic_num"
)
#Combining the two long dataframes so that I have one df per person
pam[x]_long_fin <- bind_cols(pam[x]_long_times, pam[x]_long_pics) %>%
select(resp_times, pic_num) %>%
add_column(id = [x], .before = 1)
If you replace [x] in the above code with a person's id# (e.g. 00), the code will run and will give me the dataframe I want for that person. Any advice on how to do this so I can get all 60 people done?
Thanks!
EDIT
So, using library(jsonlite) rather than library(rjson) set up the files in the format I needed without having to do all of the manipulation. Thanks all for the responses, but the solution was apparently much easier than I'd thought.
I don't know the structure of your json files. If you are not in the same folder, like the json files, try that:
library(jsonlite)
# setup - read files
json_folder <- "U:/test/" #adjust you folder here
files <- list.files(path = paste0(json_folder), pattern = "\\.json$")
# import data
pam <- NULL
pam_df <- NULL
for (i in seq_along(files)) {
pam[[i]] <- fromJSON(file = files[i])
pam_df[[i]] <- as.data.frame(pam[[i]])
}
Here you generally read all json files in the folder and build a vector of a length of 60.
Than you sequence along that vector and read all files.
I assume at the end you can do bind_rowsor add you code in the for loop. But remember to set the data frames to NULL before the loop starts, e.g. pam_long_pics <- NULL
Hope that helped? Let me know.
Something along these lines could work:
#library("tidyverse")
#library("jsonlite")
file_list <- list.files(pattern = "*.json", full.names = TRUE)
Data_raw <- tibble(File_name = file_list) %>%
mutate(File_contents = map(File_name, fromJSON)) %>% # This should result in a nested tibble
mutate(File_contents = map(File_contents, as_tibble))
Data_raw %>%
mutate(Long_times = map(File_contents, ~ gather(key = "time", value = "resp_times", starts_with("resp"))),
Long_pics = map(File_contents, ~ gather(key = "picture", value = "pic_num", starts_with("pic")))) %>%
unnest(Long_times, Long_pics) %>%
select(File_name, resp_times, pic_num)
EDIT: you may or may not need not to include as_tibble() after reading in the JSON files, depending on how your data looks like.

API Query for loop

I'm trying to pull some data from an API throw it all into a single data frame. I'm trying to put a variable into the URL I'm pulling from and then loop it to pull data from 54 keys. Here's what I have so far with notes.
library("jsonlite")
library("httr")
library("lubridate")
options(stringsAsFactors = FALSE)
url <- "http://api.kuroganehammer.com"
### This gets me a list of 58 observations, I want to use this list to
### pull data for each using an API
raw.characters <- GET(url = url, path = "api/characters")
## Convert the results from unicode to a JSON
text.raw.characters <- rawToChar(raw.characters$content)
## Convert the JSON into an R object. Check the class of the object after
## it's retrieved and reformat appropriately
characters <- fromJSON(text.raw.characters)
class(characters)
## This pulls data for an individual character. I want to get one of
## these for all 58 characters by looping this and replacing the 1 in the
## URL path for every number through 58.
raw.bayonetta <- GET(url = url, path = "api/characters/1/detailedmoves")
text.raw.bayonetta <- rawToChar(raw.bayonetta$content)
bayonetta <- fromJSON(text.raw.bayonetta)
## This is the function I tried to create, but I get a lexical error when
## I call it, and I have no idea how to loop it.
move.pull <- function(x) {
char.x <- x
raw.x <- GET(url = url, path = cat("api/characters/",char.x,"/detailedmoves", sep = ""))
text.raw.x <- rawToChar(raw.x$content)
char.moves.x <- fromJSON(text.raw.x)
char.moves.x$id <- x
return(char.moves.x)
}
The first part of this:
library(jsonlite)
library(httr)
library(lubridate)
library(tidyverse)
base_url <- "http://api.kuroganehammer.com"
res <- GET(url = base_url, path = "api/characters")
content(res, as="text", encoding="UTF-8") %>%
fromJSON(flatten=TRUE) %>%
as_tibble() -> chars
Gets you a data frame of the characters.
This:
pb <- progress_estimated(length(chars$id))
map_df(chars$id, ~{
pb$tick()$print()
Sys.sleep(sample(seq(0.5, 2.5, 0.5), 1)) # be kind to the free API
res <- GET(url = base_url, path = sprintf("api/characters/%s/detailedmoves", .x))
content(res, as="text", encoding="UTF-8") %>%
fromJSON(flatten=TRUE) %>%
as_tibble()
}, .id = "id") -> moves
Gets you a data frame of all the "moves" and adds the "id" for the character. You get a progress bar for free, too.
You can then either left_join() as needed or group & nest the moves data into a separate list-nest column. If you want that to begin with you can use map() vs map_df().
Leave in the time pause code. It's a free API and you should likely increase the pause times to avoid DoS'ing their site.

R, creating variables on the fly in a list using assign statement

I want to create variable names on the fly inside a list and assign them values in R, but I am unable to get the desired result. Here is the logic of my code:
Upon the function call: dat_in <- readf(1,2), an input file is read based on a product and site. After reading, a particular column (13th, here) is assigned to a variable aot500. I want to have this variable return from the function for each combination of product and site. For example, I need variables name in the list statement as aot500.AF, aot500.CM, aot500.RB to be returned from this function. I am having trouble in the return statement. There is no error but there is nothing in dat_in. I expect it to have dat_in$aot500.AF etc. Please inform what is wrong in the return statement. Furthermore, I want to read files for all combinations in a single call to the function, say using a for loop and I wonder how would the return statement handle list of more variables.
prod <- c('inv','tot')
site <- c('AF','CM','RB')
readf <- function(pp, kk) {
fname.dsa <- paste("../data/site_data_",prod[pp],"/daily_",site[kk],".dat",sep="")
inp.aod <- read.csv(fname.dsa,skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
aot500 <- inp.aod[,13]
return(list(assign(paste("aot500",siteabbr[kk],sep="."),aot500)))
}
Almost always there is no need to use assign(), we can solve the problem in two steps, read the files into a list, then give names.
(Not tested as we don't have your files)
prod <- c('inv', 'tot')
site <- c('AF', 'CM', 'RB')
# get combo of site and prod
prod_site <- expand.grid(prod, site)
colnames(prod_site) <- c("prod", "site")
# Step 1: read the files into a list
res <- lapply(1:nrow(prod_site), function(i){
fname.dsa <- paste0("../data/site_data_",
prod_site[i, "prod"],
"/daily_",
prod_site[i, "site"],
".dat")
inp.aod <- read.csv(fname.dsa,
skip = 4,
stringsAsFactors = FALSE,
na.strings = "N/A")
inp.aod[, 13]
})
# Step 2: assign names to a list
names(res) <- paste("aot500", prod_site$prod, prod_site$site, sep = ".")
I propose two answers, one based on dplyr and one based on base R.
You'll probably have to adapt the filename in the readAOT_500 function to your particular case.
Base R answer
#' Function that reads AOT_500 from the given product and site file
#' #param prodsite character vector containing 2 elements
#' name of a product and name of a site
readAOT_500 <- function(prodsite,
selectedcolumn = c("AOT_500"),
path = tempdir()){
cat(path, prodsite)
filename <- paste0(path, prodsite[1],
prodsite[2], ".csv")
dtf <- read.csv(filename, stringsAsFactors = FALSE)
dtf <- dtf[selectedcolumn]
dtf$prod <- prodsite[1]
dtf$site <- prodsite[2]
return(dtf)
}
# Load one file for example
readAOT_500(c("inv", "AF"))
listofsites <- list(c("inv","AF"),
c("tot","AF"),
c("inv", "CM"),
c( "tot", "CM"),
c("inv", "RB"),
c("tot", "RB"))
# Load all files in a list of data frames
prodsitedata <- lapply(listofsites, readAOT_500)
# Combine all data frames together
prodsitedata <- Reduce(rbind,prodsitedata)
dplyr answer
I use Hadley Wickham's packages to clean data.
library(dplyr)
library(tidyr)
daily_CM <- read.csv("~/downloads/daily_CM.dat",skip=4,sep=",",stringsAsFactors=F,na.strings="N/A")
# Generate all combinations of product and site.
prodsite <- expand.grid(prod = c('inv','tot'),
site = c('AF','CM','RB')) %>%
# Group variables to use do() later on
group_by(prod, site)
Create 6 fake files by sampling from the data you provided
You can skip this section when you have real data.
I used various sample length so that the number of observations
differs for each site.
prodsite$samplelength <- sample(1:495,nrow(prodsite))
prodsite %>%
do(stuff = write.csv(sample_n(daily_CM,.$samplelength),
paste0(tempdir(),.$prod,.$site,".csv")))
Read many files using dplyr::do()
prodsitedata <- prodsite %>%
do(read.csv(paste0(tempdir(),.$prod,.$site,".csv"),
stringsAsFactors = FALSE))
# Select only the columns you are interested in
prodsitedata2 <- prodsitedata %>%
select(prod, site, AOT_500)

Resources