List of Data Frames to One Data Frame - r

Disclaimer: I know that this question has been asked before. The answer provided in this answer worked for me in the past, but for some reason has stopped now.
I am pulling Marketing email statistics from the Mailchimp API. I have been doing this for the last half year or so. However, in the past 2 months, I believe the structure of what I pull has changed and thus, my code no longer works and I cannot figure out why. I believe it has something to do with the nested data frames within my list of data frames that I receive.
Here is an example of my code and the resulting list of data frames. I have removed sensitive information from my code and image:
library(httr)
library(jsonlite)
library(plyr)
#Opens-----------
opens1 <- GET("https://us4.api.mailchimp.com/3.0/reports/***ReportNumber***/sent-to?count=4000",authenticate('***My Company***', '***My-Password***'))
opens1 <- content(opens1,"text")
opens1 <- fromJSON(opens1)
Then I run opens1 <- ldply(opens1, data.frame), and I receive the following error:
Error in allocate_column(df[[var]], nrows, dfs, var) :
Data frame column 'merge_fields' not supported by rbind.fill
I tried using and looking up rbind.fill() and the other methods described in the linked answer at the top of my post, to no avail. What am I interpreting incorrectly about the merge_fields variable, or am I way off, and how do I correct it?
I'm just trying to get one data frame of all of the variables from the opens1 list.
Thanks for any and all help, and please, feel free to ask any clarification questions!

On a quick glance, this seems to work for me:
library(httr)
campaign_id <- "-------"
apikey = "------"
url <- sprintf("https://us1.api.mailchimp.com/3.0/reports/%s/sent-to", campaign_id)
opens <- GET(url, query = list(apikey = apikey, count = 4000L))
lst <- rjson::fromJSON(content(opens, "text"))
df <- dplyr::bind_rows(
lapply(lst$sent_to, function(x)
as.data.frame(t(unlist(x)), stringsAsFactors = F)
))

Related

Looping over rows of self-reported job titles to gather publicly available data

Suppose I have a data frame with one case of self-reported job titles (this is done in R):
x <- data.frame("job.title" = c("psychologist"))
I'd like to have this job title entered into a search engine on a website (this part I can do) in order to have data on these jobs pulled into a data frame (this part I can also do).
The following function does this for me:
onet.sum <- function(x) {
obj1 <- as.list(ONETr::keySearch(x)) # enter self-reported job title into ONET's search engine
job.title <- obj1[["title"]][1] # pull best-matching title
soc.code <- obj1[["code"]][1] # pull best matching title's SOC code
obj4 <- as.data.frame(cbind(job.title,soc.code))
return(obj4)
}
However, once I add a second job title in a second row...
x <- data.frame("job.title" = c("psychologist", "social worker"))
...I get this system error that I'm not sure how to diagnose.
Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Any advice?
UPDATE
So it turns out that there are two solutions that work if I pass job titles that do not contain spaces:
Using lapply(). Make sure that the job titles do not contain spaces.
So this works:
final_data <- lapply(c("psychologist","socialworker"), onet.sum) %>%
bind_rows
...but this doesn't work:
final_data <- lapply(c("psychologist","social worker"), onet.sum) %>%
bind_rows
Use purrr's map_df() is more flexible.
result <- purrr::map_df(gsub('\s', '', x$job.title), onet.sum)
You can try with a lapply statement -
result <- do.call(rbind, lapply(x$job.title, onet.sum))
Using purrr::map_df might be shorter.
result <- purrr::map_df(x$job.title, onet.sum)

R Text data scraper loop through dates

I'm doing a little project where there goal is to retrieve data in text format from a website. (http://regsho.finra.org/regsho-Index.html)
The website was nice enough to provide it online but they sorted the data over several days in different links
I thought about looping through the dates and store the data with the following code:
#Download the needed data
my_data <- c()
for (i in 01:13){
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
}
head(my_data)
The problem here is that in line
for (i in 01:13){ # The date in the website is 01-02-03 and the loop seems to ommit the 0
I've used the sprintf() method so I can have a variable in a string.
and this line the empty variable my_data always seems to be overwritten by the last data downloaded.
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
# the empty variable my_data always seems to be overwritten by the last data downloaded.
Could somebody reassure me if i'm going in the right direction because i'm starting to doubt myself here
Any help would be greatly appreciated!
Thanks in advance
This should give you a leading 0 without using an extra package:
sprintf("%02d", i)
i.e.
sprintf("http://regsho.finra.org/CNMSshvol202005%02d.txt", i)

A continuation of... Extracting data from an API using R

I'm a super new at this and working on R for my thesis. The code in this answer finally worked for me (Extracting data from an API using R), but I can't figure out how to add a loop to it. I keep getting the first page of the API when I need all 3360.
Here's the code:
library(httr)
library(jsonlite)
r1 <- GET("http://data.riksdagen.se/dokumentlista/?
sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12- 31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s#soktraff")
r2 <- rawToChar(r1$content)
class(r2)
r3 <- fromJSON(r2)
r4 <- r3$dokumentlista$dokument
By the time I reach r4, it's already a data frame.
Please and thank you!
Edit: originally, I couldn't get a url that had the page as info within it. Now I have it (below). I still haven't been able to loop it.
"http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p="
I think you can extract the url of the next page from r3 as follows:
next_url <- r3$dokumentlista$`#nasta_sida`
# you need to re-check this, but sometimes I'm getting white spaces within the url,
# you may not face this problem, but in any case this line of code solved the issue
next_url <- gsub(' ', '', n_url)
GET(next_url)
Update
I tried the url with the page number with 10 pages and it worked
my_dfs <- lapply(1:10, function(i){
my_url <- paste0("http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p=", i)
r1 <- GET(my_url)
r2 <- rawToChar(r1$content)
r3 <- fromJSON(r2)
r4 <- r3$dokumentlista$dokument
return(r4)
})
Update 2:
The extracted data frames are complex (e.g. some columns are lists of data frames) which is why a simple rbind will not work here, you'll have to do some pre-processing before you stack up the data together, something like this would work
my_dfs %>% lapply(function(df_0){
# Do some stuff here with the data, and choose the variables you need
# I chose the first 10 columns to check that I got 200 different observations
df_0[1:10]
}) %>% do.call(rbind, .)

extracting list-in-a-list-in-a-list to build dataframe in R

I am trying to build a data frame with book id, title, author, rating, collection, start and finish date from the LibraryThing api with my personal data. I am able to get a nested list fairly easily, and I have figured out how to build a data frame with everything but the dates (perhaps in not the best way but it works). My issue is with the dates.
The list I'm working with normally has 20 elements, but it adds the startfinishdates element only if I added dates to the book in my account. This is causing two issues:
If it was always there, I could extract it like everything else and it would have NA most of the time, and I could use cbind to get it lined up correctly with the other information
When I extract it using the name, and get an object with less elements, I don't have a way to join it back to everything else (it doesn't have the book id)
Ultimately, I want to build this data frame and an answer that tells me how to pull out the book id and associate it with each startfinishdate so I can join on book id is acceptable. I would just add that to the code I have.
I'm also open to learning a better approach from the jump and re-designing the entire thing as I have not worked with lists much in R and what I put together was after much trial and error. I do want to use R though, as ultimately I am going to use this to create an R Markdown page for my web site (for instance, a plot that shows finish dates of books).
You can run the code below and get the data (no api key required).
library(jsonlite)
library(tidyverse)
library(assertr)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
books.lst<-data$books
#create df from json
create.df<-function(item){
df<-map_df(.x=books.lst,~.x[[item]])
df2 <- t(df)
return(df2)
}
ids<-create.df(1)
titles<-create.df(2)
ratings<-create.df(12)
authors<-create.df(4)
#need to get the book id when i build the date df's
startdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(started_stamp,started_date)
finishdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(finished_stamp,finished_date)
collections.df<-map_df(.x=books.lst,~.x$collections)
#from assertr: will create a vector of same length as df with all values concatenated
collections.v<-col_concat(collections.df, sep = ", ")
#assemble df
books.df<-as.data.frame(cbind(ids,titles,authors,ratings,collections.v))
names(books.df)<-c("ID","Title","Author","Rating","Collections")
books.df<-books.df %>% mutate(ID=as.character(ID),Title=as.character(Title),Author=as.character(Author),
Rating=as.character(Rating),Collections=as.character(Collections))
This approach is outside the tidyverse meta-package. Using base-R you can make it work using the following code.
Map will apply the user defined function to each element of data$books which is provided in the argument and extract the required fields for your data.frame. Reduce will take all the individual dataframes and merge them (or reduce) to a single data.frame booksdf.
library(jsonlite)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
booksdf=Reduce(function(x,y){rbind(x,y)},
Map(function(x){
lenofelements = length(x)
if(lenofelements>20){
if(!is.null(x$startfinishdates$started_date)){
started_date = x$startfinishdates$started_date
}else{
started_date=NA
}
if(!is.null(x$startfinishdates$started_stamp)){
started_stamp = x$startfinishdates$started_date
}else{
started_stamp=NA
}
if(!is.null(x$startfinishdates$finished_date)){
finished_date = x$startfinishdates$finished_date
}else{
finished_date=NA
}
if(!is.null(x$startfinishdates$finished_stamp)){
finished_stamp = x$startfinishdates$finished_stamp
}else{
finished_stamp=NA
}
}else{
started_stamp = NA
started_date = NA
finished_stamp = NA
finished_date = NA
}
book_id = x$book_id
title = x$title
author = x$author_fl
rating = x$rating
collections = paste(unlist(x$collections),collapse = ",")
return(data.frame(ID=book_id,Title=title,Author=author,Rating=rating,
Collections=collections,Started_date=started_date,Started_stamp=started_stamp,
Finished_date=finished_date,Finished_stamp=finished_stamp))
},data$books))

R Apply function to list and create new dataframe

I am wanting to retrieve data from several webpages that is in the same place on all the pages and put it all in one data frame.
I have the following code attempt:
library(XML)
library(plyr)
**##the urls**
raceyears<-list(url2013,url2012,url2011)
**##function that is not producing what I want**
raceyearfunction<-function(x){
page<-readLines(x)
stats<-page[10:19]
y<-read.table(textConnection(stats))
run<-data.frame(y$V1,y$V2)
colnames(run)<-c("Country","Participants")
rbind.fill(run)
}
data<-llply(raceyears,raceyearfunction)
This places all the data in multiple columns (two columns for each webpage) but I am wanting all the data in two columns (Participants, Country) one data frame not many columns in one data frame.
I haven't found a question quite like this already on the site but am open to follow a link. Thank you in advance.
You need to use rbindlist outside of raceyearfunction. Let it return(run) without rbind.fill(run).
You can use ldply instead, then it will return binded data.frame already.
library(XML)
library(plyr)
raceyears <- list(url2013,url2012,url2011)
raceyearfunction<-function(x)
{
page <- readLines(x)
stats <- page[10:19]
y <- read.table(textConnection(stats))
run <- data.frame(y$V1,y$V2)
colnames(run) <- c("Country","Participants")
return(run)
}
data<-ldply(raceyears, raceyearfunction)

Resources