I am trying to write a for loop that provides the author's short ID to a database called RePEc and gets the affiliation data. So far, this is what I have (NB: USERCODE only works on my IP):
url <- "https://api.repec.org/call.cgi?code=USERCODE&getauthorrecordraw="
for(i in 1:length(df_affiliations)){
Sys.sleep(1)
affiliation_fun <- paste(url,df_affiliations$author_reg_1[i])
affiliation_run <- fromJSON(txt=affiliation_fun) %>% select("affiliation") %>% unlist(use.names=FALSE)
affiliation_1 <- paste(unlist(affiliation_run), collapse =" ")
df_affiliations$vector <- rbind(affiliation_1)
}
Every time I try this, I either get only the last value or a HTTP Error 400. For loops are not my strong suit. Can anyone figure out where I've gone wrong? Thanks in advance for your help!
Your loop is over writing your output variable on each iteration. A solution here is to initialize an empty vector outside of the loop and then assign the value to each index:
url <- "https://api.repec.org/call.cgi?code=USERCODE&getauthorrecordraw="
#define empty vector
affiliation_1<-vector(length=length(df_affiliations))
for(i in 1:length(df_affiliations)){
Sys.sleep(1)
affiliation_fun <- paste0(url,df_affiliations$author_reg_1[i])
affiliation_run <- fromJSON(affiliation_fun) %>% select(affiliation) %>% unlist(use.names=FALSE)
affiliation_1[i] <- paste(unlist(affiliation_run), collapse =" ")
}
This is untested since the question is not reproducible, but should provide a starting point for your final solution.
Related
I have written a script which uses a list of URL's as input, and then scrapes certain information from the websites. I have done this with a for loop, but already the process time is verry long, I expect the list to get bigger over time, so I wanted to re-code my script a more efficient way. My idea was to eliminate the for loop and use pipe operators to reduce the processing time. My original (working code) is as follows;
imo <- c()
mmsi <- c()
for(i in 1:nrow(data)){
url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME[i])
page <- read_html(url)
CSSextract1 <- html_nodes(page, '.td_imo')
CSSextract2 <- html_nodes(page, '.td_mmsi')
imos <- html_text(CSSextract1)[2]
imo[i] <- imos
mmsis <- html_text(CSSextract2)[2]
mmsi[i] <- mmsis
}
data$IMO <- gsub("[\r \n \t]", "", imo)
data$MMSI <- gsub("[\r \n \t]", "", mmsi)
data$NAME <- gsub("\\+", " ", data$NAME)
I have re-written the code, trying to eliminate the for loop as follows;
CSSex1 <- function(page){
CSSextract <- html_nodes(page,'.td_imo')
return(CSSextract)
}
data$url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME)
data$mmsi <- data$url %>% read_html() %>% CSSex1() %>% html_text()[2]
However it gives me the error;
Error: `x` must be a string of length 1
I assume, the way I coded, the list (data$url) as a whole is now taken as input, so my question is;
Is it possible, and if yes how, to take each element from data$url as a input without using a (for) loop?
You may wish to set up url as a column of a data frame (data) to try:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = mmsi_func(url))
or something along those lines. I am not sure what the expected output is supposed to look like, but if it is a list rather than a vector, you can use this minor adjustment for a list column in the dataframe:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = list(mmsi_func(url)))
I'm applying a for loop to generate multiple API requests from the RePEc database. Unfortunately, the data is not reproducible because my access code the administrator gave me only works on my IP address but I'm hoping someone can help me figure out what I've done wrong in my for loop (I'm not great at writing for loops).
Basically, I have a column of 80,000 cells df$author_reg_2with a whole lot of NA values. I want to skip the NA values and only generate API requests for cells with an actual value in them.
url <- "https://api.repec.org/call.cgi?code=USERCODE&getauthorrecordraw="
affiliation_2 <-vector(length=length(df$author_reg_2))
for(i in 1:length(df$author_reg_2))
try({
Sys.sleep(1)
if(is.na(df$author_reg_2)) next
affiliation_fun <- paste0(url,df$author_reg_2[i])
affiliation_run <- fromJSON(txt=affiliation_fun) %>% select("affiliation")
affiliation_2[i] <- paste(unlist(affiliation_run), collapse =" ")
print(i)
})
Each time I try to run this script, it still generates errors telling me that it's running the NA values.
Any help appreciated! Thank you in advance!
As we are looping over each element and passing the logic on if(takes a length of 1 and output 1), need the single element
url <- "https://api.repec.org/call.cgi?code=USERCODE&getauthorrecordraw="
affiliation_2 <-vector("list", length=length(df$author_reg_2))
for(i in seq_along(df$author_reg_2))
try({
Sys.sleep(1)
if(is.na(df$author_reg_2[i])) next
affiliation_fun <- paste0(url,df$author_reg_2[i])
affiliation_run <- fromJSON(txt=affiliation_fun) %>% select("affiliation")
affiliation_2[[i]] <- paste(unlist(affiliation_run), collapse =" ")
print(i)
})
The list would also hold vector of different length and there is no need to paste and collapse at the end (if that is the case)
If we need to change next to have a NA value in `affliation_2
for(i in seq_along(df$author_reg_2))
try({
Sys.sleep(1)
if(is.na(df$author_reg_2[i])) {
affiliation_2[[i]] <- NA_character_
} else{
affiliation_fun <- paste0(url,df$author_reg_2[i])
affiliation_run <- fromJSON(txt=affiliation_fun) %>% select("affiliation")
affiliation_2[[i]] <- paste(unlist(affiliation_run), collapse =" ")
}
print(i)
})
So I am trying to make a basic sensitivity analysis script. The outputs come out as I want via the print I added to the end of the script. Issue is that I would like a tibble or object that has all the outputs appended together that I can export as a csv or xlsx.
I created two functions, sens_analysis which runs all the code, and multiply_across which multiplies across each possible percentage across each possible column of your table. You need multiply_across to run the sens_analysis.
I would normally like a title but instead I just added an indicator column instead that I can sort by.
I made everything with mtcars so it should be easy to replicate, the issue is that I just have a huge print at the end; not an object that I can manipulate or pull from for other analysis.
I have been trying the rbind, bind_row, appending rows in a variety of ways.
Or building a new object. As you can see in the code at line (18) I make something called output that I have tried to populate, which hasn't gone well.
rm(list = ls())
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(magrittr)
library(xtable)
data<-mtcars
percent<-c(.05,.1,.15)
goods<-c("hp","gear","wt")
weight<-c(6,7,8)
disagg<-"cyl"
func<-median
sens_analysis<-function(data=data, goods=goods, weight=weight, disagg=disagg, precent=percent, func=func){
output<-NULL%>%
as.tibble()
basket<-(rbind(goods,weight))
percent<-c(0,percent,(percent*-1))
percent_to_1<-percent+1
data_select<-data%>%
dplyr::select(c(goods,disagg))%>%
group_by_at(disagg)%>%
summarise_at(.vars = goods ,.funs = func)%>%
as_tibble()
data_select_weight<-purrr::map2(data_select[,-1], as.numeric(basket[2,]),function(var, weight){
var*weight
})%>% as_tibble %>%
add_column(data_select[,1], .before = 1)
colnames(data_select_weight)[1]<-disagg
multiply_across(data_select_weight,percent_to_1)
return(output)
#output2<-rbind(output2,output)
}
############################
multiply_across<-function(data=data_select_weight,list=percent_to_1){
varlist<-names(data[,-1])
for(i in varlist){
df1 = data[,i]
for(j in list){
df<-data
df[,i]<-round(df1*j,2)
df<-mutate(df, total = round(rowSums(df[,-1]),2))%>%
mutate(type=paste0(i," BY ",(as.numeric(j)-1)*100,"% OVER ",disagg))%>%
print(df)
#output<-bind_rows(output,df)
#output<-bind_rows(output,df)
#output[[j]]<-df[[j]]
}
}
}
##############################################################################################
sens_analysis(data,goods,weight,disagg,percent,func)
The expected result if you just run the code straight-up should just be a bunch of printed tibbles, that arent in an object. But ideally, for future analysis on the data or easy of use, a table of the outputs appended together would be best.
So I figured it out and will add my answer here in case someone else hits this issues.
I created a list within loops and then binded those lists together.
Just focus on the binding rows outside the right for-loop.
multiply_across<-function(data=data_select_weight,
list=percent_to_1){
varlist <- colnames(data[, -1])
output_list <- list()
for (i in varlist) {
df1 <- data[,i]
for (j in list) {
name <- paste0(i, " BY ", (as.numeric(j)-1)*100, "% OVER ", disagg)
df <- as_tibble(data)
df[,i] <- round(df1*j, 2)
df <- mutate(df, total = round(rowSums(df[,-1]),2))%>%
mutate(type = paste0(i, " BY ", (as.numeric(j)-1)*100, "% OVER ", disagg))
df<-df[,c(6,1,2,3,4,5)]
output_list[[paste0(i," BY ",(as.numeric(j)-1)*100)]] <- (assign(paste0(i," BY ",(as.numeric(j)-1)*100,"% OVER ",disagg),df))
}
}
bind_rows(lapply(output_list,
as.data.frame.list,
stringsAsFactors=F))
}
I am trying to reverse a string without using extra space in R. Below is the code for the same. My question is how to get the ReverseString function change the input without using extra space. I even tried using <<- without any luck.
ReverseString <- function(TestString){
TestString <- unlist(strsplit(TestString, ""))
Left <- 1
Right <- length(TestString)
while (Left < Right){
Temp <- TestString[Left]
TestString[Left] <- TestString[Right]
TestString[Right] <- Temp
Left <- Left + 1
Right <- Right - 1
}
return(paste(TestString, collapse = ""))
}
## Input
a = "StackOverFlow"
## OutPut
ReverseString(a)
"wolFrevOkcatS"
##
a
"StackOverFlow"
It is always better to take advantage of the vectorization in R (instead of for or while loops). So, in base-R, without any packages, it would be something like:
ReverseString <- function(x) {
#splitstring splits every character, and rev reverses the order
out <- rev(strsplit(x, split = '')[[1]])
#paste to paste them together
paste(out, collapse = '')
}
a <- "StackOverFlow"
ReverseString(a)
#[1] "wolFrevOkcatS"
According to your comment you want to reverse the string without calling any function that does the reversal, i.e. no rev and co. Both of the solutions below do this.
I think you are also trying to modify global a from within the function, which is why you tried <<-. I'm not sure why it didn't work for you, but you might have used it incorrectly.
You should know that using <<- alone does not mean that you are using less space. To really save space you would have to call or modify global a at each step in your function where you call or modify TestString. This would entail some combination of assign, do.call, eval and parse - not to mention all the pasteing you would have to do to access elements of a by integer position. Your function would end up bulky, nearly unreadable, and very likley less efficient due to the numerous function calls, despite having saved a negligible amount of space by not storing a copy of a. If you're dead set on creating such an abomination, then take a look at the functions I just listed and figure out how to use them.
Your energy would be better spent by improving upon you string-reversing function in other ways. For example, you can shorten it quite a bit by using a numerical sequence such as 13:1 in sapply:
reverse_string <- function(string) {
vec <- str_split(string, "")[[1]]
paste(sapply(length(vec):1, function(i) vec[i]), collapse = "")
}
reverse_string("StackOverFlow")
#### OUTPUT ####
[1] "wolFrevOkcatS"
If your interviewers also have a problem with reverse sequences then here's another option that's closer to your original code, just a little cleaner. I also did my best to eliminate other areas where "extra space" was being used (indices stored in single vector, no more Temp):
reverse_string2 <- function(string){
vec <- str_split(string, "")[[1]]
i_vec <- c(1, length(vec))
while(i_vec[1] < i_vec[2]) {
vec[i_vec] <- vec[c(i_vec[2], i_vec[1])]
i_vec <- i_vec + c(1, -1)
}
paste(vec, collapse = "")
}
reverse_string2("StackOverFlow")
#### OUTPUT ####
[1] "wolFrevOkcatS"
It can be done easily with stringi
library(stringi)
a <- "StackOverFlow"
stri_reverse(a)
#[1] "wolFrevOkcatS"
I'm not sure I understood exactly the problem, but I think you're looking for a way to reverse the string object and automatically assign it to the original object without having to do a <- ReverseString(a) (assuming this is the reason why you tried using <<-). My solution to this is using deparse(substitute()) to read the original variable name inside the function and assign (using envir = .GlobalEnv) to assign your result over the original variable.
ReverseString <- function(TestString){
nm <- deparse(substitute(TestString))
TestString <- unlist(strsplit(TestString, ""))
Left <- 1
Right <- length(TestString)
while (Left < Right){
Temp <- TestString[Left]
TestString[Left] <- TestString[Right]
TestString[Right] <- Temp
Left <- Left + 1
Right <- Right - 1
}
assign(nm, paste(TestString, collapse = ""), envir = .GlobalEnv)
}
## Input
a = "StackOverFlow"
ReverseString(a)
a
#[1] "wolFrevOkcatS"
I am trying to scrape a website using the following:
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
I am mostly interested in the code between ########## and I want to apply a function instead of a for loop since I am running into time out issues with yahoo and I want to make it more human like to extract this data (it is not too much).
My question is, how can I take links apply a function and set a sort of delay timer to read in the contents of the for loop?
I can paste my own version of the for loop which does not work.
This is the function I came up with
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
Hope it helps