I am getting stuck trying to incorporate the "pages" element here, along with the "type" - the pages element changes but the type remains constant.
The current code I have maps over the tickers and downloads the data into a list format, however it only uses the default number of pages, how can I map over the tickers as well as the pages?
Adding company_filings(pages, type) doesn´t work.
library(edgarWebR)
tickers <- c("63908", "793952")
count <- 100
pages <- 1:4
type = "10-K"
custom_company_filings <- function(compfilings){
company_filings(compfilings)
}
x <- Map(custom_company_filings, tickers)
Edit:
What I am trying to achieve is to do something like:
for(tick in tickers){
for(i in pages){
company_filings("get company filings")
}
}
So for each ticker I Will collect pages 1,2,3,4 of the company_filing() function - which comes from the edgarWebr package.
The custom_company_filing() function was my attempt at trying to solve the problem but when I posted here I removed all my attempts. For instance one attempt was:
custom_company_filings <- function(compfilings, pages){
company_filings(compfilings, pages)
}
x <- Map(custom_company_filings(page = pages), tickers)
We can use a nested lapply:
library(edgarWebR)
tickers <- c("63908", "793952")
count <- 100
pages <- 1:4
type <- "10-K"
lapply(tickers, function(x){
lapply(pages, function(y){
company_filings(x, type = type, count = count, page = y)
})
})
Or with purrr:
library(purrr)
pmap(expand.grid(tickers, pages), ~company_filings(..1, type = type, count = count, page = ..2))
Related
I'm writing a wrapper for the YouTube Analytics API, and have created a function as follows:
yt_request <- function(dimensions = NULL, metrics = NULL, sort = NULL,
maxResults = NULL, filtr = NULL, startDate = Sys.Date() - 30,
endDate = Sys.Date(), token) {
url <- paste0("https://youtubeanalytics.googleapis.com/v2/reports?",
"&ids=channel%3D%3DMINE",
"&startDate=", startDate,
"&endDate=", endDate)
if(!is.null(dimensions)) url <- paste0(url, "&dimensions=", dimensions)
if(!is.null(metrics)) url <- paste0(url, "&metrics=", metrics)
if(!is.null(sort)) url <- paste0(url, "&sort=", sort)
if(!is.null(maxResults)) url <- paste0(url, "&maxResults=", maxResults)
if(!is.null(filtr)) url <- paste0(url, "&filters=", filtr)
r <- GET(url, token)
return(r)
}
This is meant to just be a flexible but not the most friendly of functions because I want to have wrapper functions that will contain yt_request() that will be much more user friendly. For example:
top_videos <- function(...) {
dim <- "video"
met <- "views,averageViewDuration"
maxRes <- 10
temp <- yt_request(dimensions = dim, metrics = met, maxResults = maxRes, token = myToken)
return(temp)
}
Which so far works fine and dandy, but I also want potential users to have a little flexibility with the results. For example, if they want to have maxResults <- 20 instead of 10 or they want different metrics than the ones I specify, I want them to be able to pass their own arguments in the ... of top_videos(...).
How can I do a check if someone passes an argument in the ellipsis? If they pass a metric, I want it to override the default I specify, otherwise, go with the default.
EDIT
To help clarify, I'm hoping that when the user decides to use the function, they could just write something like top_videos(maxResults = 20) and the function would ignore the line maxRes <- 10 and in the yt_request() function would assign maxResults = 20 instead of 10
We can capture the ... in a list and convert the whole elements to a key/value pair. Then, extract the elements based on the name. If we are not passing that particular named element, it will return NULL. We make use of this behavior of NULL to concatenate with the default value of 10 in maxRes and select the first element ([1]) so that if it is NULL, the default 10 is selected, or else the value passed will be selected. Likewise, do this on all those objects that the OP wanted to override
top_videos <- function(...) {
nm1 <- list(...)
lst1 <- as.list(nm1)
dim <- c(lst1[["dimensions"]], "video")[1]
met <- c(lst1[["metrics"]], "views,averageViewDuration")[1]
maxRes <- c(lst1[['maxResults']], 10)[1]
#temp <- yt_request(dimensions = dim,
metrics = met, maxResults = maxRes, token = myToken)
#temp
maxRes
}
-testing
top_videos(maxResults = 20)
#[1] 20
top_videos(hello = 5)
#[1] 10
I have been trying to turn pdf files into data frames using R. I start out by reading the text into R and using data.table to split the data into a list item per page. I am now having trouble writing a loop to combine the questions with their respective continued items. The txt.list object in the below code is a brief example of the format.
### Short list
txt.list <- list('Q1', 'Q2', 'continued page',
'Q3', 'continued page', 'continued page',
'Q4', 'Q5', 'continued page', 'continued page',
'Q6', 'continued page', 'Q7', 'continued page', 'continued page')
### Label pages that continue from the previous
is.continuation <- lapply(txt.list, function(x){ startsWith(x, 'continued')}) # find which pages are continuations
is.continuation <- c(unlist(is.continuation)) # unlist for list item naming
names(txt.list) <- as.character(is.continuation)
print(txt.list)
This result is that each page in the list that is a continuation of the corresponding question is given a "TRUE" character label (I know this can be done without list labeling, I'm just trying avoid referring to an external vector).
Since each pdf file from this website almost always uses the same format, I am trying to make this work (at least somewhat) for future uses. I've been trying something along the lines of:
new.list <- vector(mode = 'list',
length = length(which(names(txt.list) == 'TRUE')))
for(i in 1:length(txt.list)){
j = i + 1 # pg ahead
if(names(txt.list)[[j]] == "TRUE"){
new.list[[i]][[1]] <- txt.list[[i]]
m = 2 # index ahead
while(names(txt.list)[[j]] == "TRUE"){
new.list[[i]][[m]] <- txt.list[[j]]
m = m + 1
}
} else {
new.list[[i]] <- txt.list[[i]]
}
}
After a few tries, I'm just completely drawing blanks. Any help would be much appreciated!
It's been awhile since I've really worked in r, but am I misreading your for loop? Don't you need for (i in 1:length(...))? If you don't have the 1: part, then there's no range, and so you won't do any looping.
Your main issue outside of that is that you're pumping your newlist in at the 'i' location, when that variable is only appropriate for reading from txt.list. You should keep a separate tracker for new.list (such as nlSize), and tick it up whenever it's appropriate.
Another minor issue is that you have an anchor before your while loop that you can avoid.
Finally, I would definitely get away from setting the names as truth values. It would have been better to reference an external vector, though you don't have to do that either.
Just make a function and use it inside your loop.
I put my code in a function called normalizeList and then call it on txt.list. This way you can use it on other similar lists.
normalizeList <- function (lst) {
is.continuation <- function (x)
startsWith(x, 'continued');
new.list <- list()
nlSize <- 0
for(i in 1:length(lst)) {
isLast <- length(lst) == i
cur <- lst[[i]]
nxt <- ifelse(isLast, '', lst[[i+1]]);
if(is.continuation(cur)){
new.list[[nlSize]] <- c(new.list[[nlSize]], cur)
next
}
nlSize <- nlSize + 1
new.list[nlSize] <- ifelse(is.continuation(nxt), list(cur), cur)
}
new.list
}
normalizeList(txt.list);
I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks).
In order to learn scraping, I am exercising with a few pages on Psychology Today.
I have written a function that allows me to scrape information for one therapist and to create a data set with the information collected in this way:
install.packages('rvest') #Loading the rvest package
install.packages('xml2') #Loading the xml2 package
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
#Specifying the url for desired website to be scraped
url <- 'https://www.psychologytoday.com/us/therapists/THE_ONE_YOU_WANT'
#Reading the HTML code from the website
URL <- read_html(url)
#creating the function
getProfile <- function(profilescrape) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
##Structure of the data frame
str(onet_df)
}
View(onet_df)
This code seems to be working well for whatever therapist I choose.
Now, I would like to use this function on multiple profiles, to generate one data set, with name and modality of MHPs.
Let's say that I want to apply the above function "getProfile" to the first 20 therapists in Illinois and input the information for this 20 therapists in a data set called "onet_df"
j <- 1
MHP_codes <- c(324585 : 449807) #therapist identifier
withinpage_codes <- c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
record_profile <- getProfile <- function(profilescrape)
onet_df[[j]] <- rbind.fill(onet_df, record_profile)
j <- j + 1
}
}
EDITS START HERE:
This loop does not create any data set; moreover, it does not give any error message.
Would someone be able to help me de-bug this loop?
Please, keep in mind that I am a real beginner.
Following sueggetions, I have modified what follows at the beginning:
#creating the function
getProfile <- function(URL) {....}
Moreover, I have used three alternative loops:
1st alternative
j <- 1
MHP_codes <- c(324585 : 449807) #therapist identifier
withinpage_codes <- c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
record_profile <- getProfile(URL)
onet_df[[j]] <- rbind.fill(onet_df, record_profile)
j <- j + 1
}
}
which gives the followin errors message:
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "character"
2nd alternative
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpage_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- getProfile(URL)
j <- j + 1
}
}
final_df <- rbind.fill(df_list)
This loop gives the same error message (please, refer to the above one).
Now, I have just to figure out why no data set is produced with the loop. There might be two problems: First, something within the loop does not work (I have run both loops on only one existing page and no data set is produced) ; Second, when I run the loop on a series of link, some of them might be missing, which would produce an error message.
Consider several adjustments:
Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.
getProfile <- function(URL) {
...
}
Adjust the ending of function to return the needed object. Without return, R will return the last line read. Therefore, replace str(onet_df) with return(onet_df).
Pass dynamic URL in loop to method without calling function:
URL <- paste0(...)
record_profile <- getProfile(URL)
Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpade_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
}
Call rbind.fill once outside loop to combine all data frames together
final_df <- rbind.fill(df_list)
With that said, consider an apply family solution, specifically Map (wrapper to mapply). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.
# ALL POSSIBLE PAIRINGS
web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
withinpage_codes = c(1:20))
# MOVE URL ASSIGNMENT INSIDE FUNCTION
getProfile <- function(code1, code2) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
# ...same code as before...
}
# ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2),
error = function(e) NULL),
code1 = web_codes_df$MHP_codes,
code2 = web_codes_df$withinpage_codes)
final_df <- rbind.fill(df_list)
One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user.
Below I post the script. I apologize if it is not presicely commented.
Here is the code.
#Loading packages
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
library('plyr') #to bind together different data sets
#get working directory
getwd()
setwd("~/YOUR OWN FOLDER HERE")
#DEFINE SCRAPING FUNCTION
getProfile <- function(URL) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
#If I need to remove blank space do this:
#Data-Preprocessing: removing excess spaces
#variable<-gsub(" ","",variable)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
return(onet_df)
}
Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).
j <- 1
MHP_codes <- c(163805:163808) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")
I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5
I am using quantmod to adjust for dividends and splits. It seems to work but I have found the following problem: when adjusting my sma(200,0) historical values are wrong and they correct as the date approaches the current date. Please see the code below.
stockData <- new.env() #Make a new environment for quantmod to store data in
symbols = c("IWM","SPY","TLT","TSLA")
nr.of.positions<-3
getSymbols(symbols, src='yahoo',from = "2015-10-01",to = Sys.Date())
for (i in 1:length(symbols)) {
assign (symbols[i], adjustOHLC(get(symbols[i]),
adjust=c("split", "dividend"),
use.Adjusted=FALSE,
symbol.name=symbols[i]))
}
x <- list()
for (i in 1:length(symbols)) {
x[[i]] <- get(symbols[i], pos=stockData) # get data from stockData environment
x[[i]]$sma <-SMA(Cl(x[[i]]),10)
x[[i]]$smalong <-SMA(Cl(x[[i]]),200)
x[[i]]$adx<-ADX(HLC(x[[i]]),10)
x[[i]]$rsi <-RSI(Cl(x[[i]]),14)
x[[i]]$close <-(Cl(x[[i]]))
}
You're lucky that your code works. Or maybe you're unlucky, since an error would have let you know you did something wrong.
You create a stockData environment and the comment says you intended to store the data you pull in it. But you don't specify the stockData environment in your call to getSymbols, or your calls to assign and get in the first for loop. So they're all assigning and getting from the global environment.
Your code would be clearer if you avoid using get and assign within a for loop, and instead used convenience functions lapply and eapply.
stockData <- new.env()
symbols <- c("IWM","SPY","TLT","TSLA")
nr.of.positions <- 3
getSymbols(symbols, from = "2015-10-01", env = stockData)
# loop over objects in an environment with eapply
adj <- eapply(stockData, function(x) {
symbol <- sub("\\.Close$", "", colnames(Cl(x)))
adjustOHLC(x, symbol.name=symbol)
})
# loop over list returned by eapply
x <- lapply(adj, function(x) {
x$sma <- SMA(Cl(x),10)
x$smalong <- SMA(Cl(x),200)
x$adx <- ADX(HLC(x),10)
x$rsi <- RSI(Cl(x),14)
x$close <- Cl(x)
x
})
You'll notice the results of my code and your code are the same if you run them each in a clean R session. So the reason your code produced "wrong" results is probably because you had other objects in your workspace that were being assigned/accessed by your use of get and assign.