R get the original index of data frame after subsetting - r

Is it possible to get the original index of a data frame after subsetting? It is being stored somewhere but I am not sure where and how to access it. I understand that there is a better solution if this is part of the algorithm design. I am just curious if anyone knows if it possible.
Example Scenario:
df = data.frame(atr1=integer(),atr2=integer())
for(i in 1:10) {
df <- rbind(df,data.frame(atr1=as.integer(i),atr2=as.integer(i)))
}
View(df)
Note the far left side of output of View function in R studio will show the index (I am not sure how to post an image that only exists on my local machine).
Create a data frame by taking subset of that original data frame:
df_subset <- df[which(df$atr1 > 4),]
View(df_subset)
The output of the View function doesn't index df_subset 1 to 6 as you would access them. The original indices 5 to 10 are maintained. I am curious if it is possible to accesses those indices in some fashion similar to:
df_subset[index,]$<some hidden attribute>

Related

Optimizing data scrape from NHL API using R

I am a novice with R and a total newbie with the NHL API. I wrote an R program to extract all of the goals recorded in the NHL's data repository accessed through the NHL API using the R "nhlapi" package. I have code that works, but it's ugly and slow, and I wanted to see if anyone has suggestions for improving it. I am using the nhl_games_feed function provided by nhlapi to pull all events, from which I select the goals. This function returns a JSON blob (list of lists of lists of lists ...) in R, which I want to convert into a proper data.table.
I pasted a stripped-down version of my code below. I understand that normal practice here would be to include a sample data blob with the code so that other users can recreate my problems, but the data blob is the problem.
When I ran the full version of my code last night, the "Loop through games" portion took about 11 hours, and the "Convert players list to columns" took about 2 hours. Unless I can find a way to push the column or row filtering into the NHL's system, I don't think I am likely to find a way to speed up the "Loop through games" portion. So my first question: Does anyone have any thoughts about how to extract a subset of columns or rows using the NHL API, or do I need to pull everything and process it on my end?
My other question related to the second chunk of code ("Convert players..."), which converts the resulting event data into a single row of scalar elements per event. The event data shows up in lblob_feed[[1]]$liveData$plays$allPlays, which contains one row of scalar elements per event, except that one of the elements is ..$allPlays$players, which is itself a 4x5 dataframe. As a result, the only way that I could find to extract that data into scalar elements is the "Convert players..." loop. Is there a better way to convert this into a simple data.table?
Finally, any tips on other ways to end up with a comprehensive database of NHL events?
require("nhlapi")
require("data.table")
require("tidyverse")
require("hms")
assign("last.warning", NULL, envir = baseenv())
# create small list of selected games, using NHL API game code format
cSelGames <- c(2021020001, 2021020002, 2021020003, 2021020004)
liNumGames <- length(cSelGames)
print(liNumGames)
# 34370 games in the full database
# =============================================================================
# Loop through games
# Pull data for one game per call
Sys.time()
dtGoals <- data.table()
for (liGameNum in 1:liNumGames) {
# Pull the NHL feed blob for one selected game
# 11 hours in the full version
lblob_feed <- nhl_games_feed(gameId = cSelGames[liGameNum])
# Select only the play (event) portion of the feed blob
ldtFeed <- as.data.table(c(lblob_feed[[1]]$gamePk, lblob_feed[[1]]$liveData$plays$allPlays))
setnames(ldtFeed, 1, "gamePk")
# Check for games with no play data - 1995020006 has none and would kill execution
if ('result.eventCode' %in% colnames(ldtFeed)) {
# Check for missing elements in allPlays list
# team.triCode is missing for at least one game, probably for all-star games
if (!('team.triCode' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (team.triCode = NA)]}
if (!('result.strength.code' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.strength.code = NA)]}
if (!('result.emptyNet' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.emptyNet = NA)]}
# Select the events and columns for the output data table
ldtGoals_new <- ldtFeed[(result.eventTypeId == 'GOAL')
,list(gamePk, result.eventCode, players, result.description
, team.triCode, about.period, about.periodTime
, about.goals.away, about.goals.home, result.strength.code, result.emptyNet)]
# Append the incremental data table to the aggregate data table
dtGoals <- rbindlist(list(dtGoals, ldtGoals_new), use.names=TRUE, fill=TRUE)
}
}
# =============================================================================
# Convert players list to columns
# 2 hours in the full version
# 190686 goals in full table
# For each goal, the player dataframe is 4x5
Sys.time()
dtGoal_player <- data.table()
for (i in 1:dtGoals[,.N]) {
# convert rows with embedded dataframes into multiple rows with scalar elements
s_result.eventCode <- dtGoals[i,result.eventCode]
dtGoal_player_new <- as.data.table(dtGoals[i,players[[1]]])
dtGoal_player_new[, ':=' (result.eventCode=s_result.eventCode)]
dtGoal_player <- rbindlist(list(dtGoal_player, dtGoal_player_new), use.names=TRUE, fill=TRUE)
}
# drop players element
dtGoals[, players:=NULL]
# clean up problem with duplicated rows with playerType=Assist
dtGoal_player[, lag.playerType:=c('nomatch', playerType[-.N]), by=result.eventCode]
dtGoal_player[, playerType2:=ifelse((playerType==lag.playerType),'Assist2',playerType)]
# transpose multiple rows per event into single row with multiple columns for playerType
dtGoal_player_t <- dcast.data.table(dtGoal_player, result.eventCode ~ playerType2
, value.var='player.id', fun.aggregate=max)
# =============================================================================
# Merge players data into dtGoals
Sys.time()
dtGoals <- merge(dtGoals, dtGoal_player_t, by="result.eventCode")
Sys.time()

Using a loop to cycle though API calls in R

I have created a simple script to collect a Youtube Channels statistics. Just wondering how I could loop though a list of channel ID's instead of having to manually change the channel ID each time then re-run the script? I struggle to understand how to write loops in R.
key <- 'MyKey'
channel_id1 <- 'UCLSWNf28X3mVTxTT3_nLCcw'
url <- 'https://www.googleapis.com/youtube/v3/channels?part=statistics'
y <- paste0(url,'&id=',channel_id1,'&key=',key)
yt_channel1 <- fromJSON(txt=y)
yt_d_channel1 <- as.data.frame(do.call(c, unlist(yt_channel1, recursive=FALSE)))
Any way to store all channel ID's of interest in a list or vector then loop though them, storing results into new or the same dataframe?
i.e.
channels <- c('UCLSWNf28X3mVTxTT3_nLCcw', 'UCLSW467236VTxTT3_nLCcw', UHJKHS328787_ndncp')
for i 1:3, {
channels...
do stuff
}
Any help is greatly appreciated.
Yes, store the channel IDs in a column in a data frame. Assuming you have a data frame called my_data_frame with a column ID that contains the IDs, you can loop through the IDs like this:
key <- 'MyKey'
url <- 'https://www.googleapis.com/youtube/v3/channels?part=statistics'
for(i in 1:nrow(my_data_frame)){
y <- paste0(url,'&id=',my_data_frame$ID[i],'&key=',key)
yt_channel1 <- fromJSON(txt=y)
yt_d_channel1 <- as.data.frame(do.call(c, unlist(yt_channel1, recursive=FALSE)))
}
Notice how the ID is referenced using an index i which will count from 1 until the number of rows in your data frame.
Note, this code will not work as you will need to come up with a way to store the results.

extracting list-in-a-list-in-a-list to build dataframe in R

I am trying to build a data frame with book id, title, author, rating, collection, start and finish date from the LibraryThing api with my personal data. I am able to get a nested list fairly easily, and I have figured out how to build a data frame with everything but the dates (perhaps in not the best way but it works). My issue is with the dates.
The list I'm working with normally has 20 elements, but it adds the startfinishdates element only if I added dates to the book in my account. This is causing two issues:
If it was always there, I could extract it like everything else and it would have NA most of the time, and I could use cbind to get it lined up correctly with the other information
When I extract it using the name, and get an object with less elements, I don't have a way to join it back to everything else (it doesn't have the book id)
Ultimately, I want to build this data frame and an answer that tells me how to pull out the book id and associate it with each startfinishdate so I can join on book id is acceptable. I would just add that to the code I have.
I'm also open to learning a better approach from the jump and re-designing the entire thing as I have not worked with lists much in R and what I put together was after much trial and error. I do want to use R though, as ultimately I am going to use this to create an R Markdown page for my web site (for instance, a plot that shows finish dates of books).
You can run the code below and get the data (no api key required).
library(jsonlite)
library(tidyverse)
library(assertr)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
books.lst<-data$books
#create df from json
create.df<-function(item){
df<-map_df(.x=books.lst,~.x[[item]])
df2 <- t(df)
return(df2)
}
ids<-create.df(1)
titles<-create.df(2)
ratings<-create.df(12)
authors<-create.df(4)
#need to get the book id when i build the date df's
startdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(started_stamp,started_date)
finishdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(finished_stamp,finished_date)
collections.df<-map_df(.x=books.lst,~.x$collections)
#from assertr: will create a vector of same length as df with all values concatenated
collections.v<-col_concat(collections.df, sep = ", ")
#assemble df
books.df<-as.data.frame(cbind(ids,titles,authors,ratings,collections.v))
names(books.df)<-c("ID","Title","Author","Rating","Collections")
books.df<-books.df %>% mutate(ID=as.character(ID),Title=as.character(Title),Author=as.character(Author),
Rating=as.character(Rating),Collections=as.character(Collections))
This approach is outside the tidyverse meta-package. Using base-R you can make it work using the following code.
Map will apply the user defined function to each element of data$books which is provided in the argument and extract the required fields for your data.frame. Reduce will take all the individual dataframes and merge them (or reduce) to a single data.frame booksdf.
library(jsonlite)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
booksdf=Reduce(function(x,y){rbind(x,y)},
Map(function(x){
lenofelements = length(x)
if(lenofelements>20){
if(!is.null(x$startfinishdates$started_date)){
started_date = x$startfinishdates$started_date
}else{
started_date=NA
}
if(!is.null(x$startfinishdates$started_stamp)){
started_stamp = x$startfinishdates$started_date
}else{
started_stamp=NA
}
if(!is.null(x$startfinishdates$finished_date)){
finished_date = x$startfinishdates$finished_date
}else{
finished_date=NA
}
if(!is.null(x$startfinishdates$finished_stamp)){
finished_stamp = x$startfinishdates$finished_stamp
}else{
finished_stamp=NA
}
}else{
started_stamp = NA
started_date = NA
finished_stamp = NA
finished_date = NA
}
book_id = x$book_id
title = x$title
author = x$author_fl
rating = x$rating
collections = paste(unlist(x$collections),collapse = ",")
return(data.frame(ID=book_id,Title=title,Author=author,Rating=rating,
Collections=collections,Started_date=started_date,Started_stamp=started_stamp,
Finished_date=finished_date,Finished_stamp=finished_stamp))
},data$books))

Looping through variables to produce tables of percentages

I am very new to R and would appreciate any advice. I am from a STATA background and so learning to think in R. I am trying to produce tables of percentages for my 20 binary variables. I have tried a for loop but not sure where I am going wrong as there is no warning message.
for (i in 1:ncol(MAAS1r[varbinary])) {
varprop<- varbinary[i]
my.table<-table(MAAS1r[varprop])
my.prop<-prop.table(my.table)
cbind(my.table, my.prop)
}
Many thanks
I made one with an example extracted from mtcars
this are two variables that are binary (0 or 1), called VS and AM
mtcarsBivar<- mtcars[,c(8,9)]
get names of the columns:
varbinary <- colnames(mtcarsBivar)
use dplyr to do it:
library(dplyr)
make an empty list to populate
Binary_table <- list()
now fill it with the loop:
for (i in 1:length(varbinary)) {
Binary_table[[i]] <- summarise(mtcarsBivar, percent_1 = sum(mtcarsBivar[,1] == 1)/nrow(mtcarsBivar))
}
Transform it to a data frame
Binary_table <- do.call("cbind", Binary_table)
give it the name of the varbinary to the columns
colnames(Binary_table) <- varbinary
this only works if all your variables are binary

R Apply function to list and create new dataframe

I am wanting to retrieve data from several webpages that is in the same place on all the pages and put it all in one data frame.
I have the following code attempt:
library(XML)
library(plyr)
**##the urls**
raceyears<-list(url2013,url2012,url2011)
**##function that is not producing what I want**
raceyearfunction<-function(x){
page<-readLines(x)
stats<-page[10:19]
y<-read.table(textConnection(stats))
run<-data.frame(y$V1,y$V2)
colnames(run)<-c("Country","Participants")
rbind.fill(run)
}
data<-llply(raceyears,raceyearfunction)
This places all the data in multiple columns (two columns for each webpage) but I am wanting all the data in two columns (Participants, Country) one data frame not many columns in one data frame.
I haven't found a question quite like this already on the site but am open to follow a link. Thank you in advance.
You need to use rbindlist outside of raceyearfunction. Let it return(run) without rbind.fill(run).
You can use ldply instead, then it will return binded data.frame already.
library(XML)
library(plyr)
raceyears <- list(url2013,url2012,url2011)
raceyearfunction<-function(x)
{
page <- readLines(x)
stats <- page[10:19]
y <- read.table(textConnection(stats))
run <- data.frame(y$V1,y$V2)
colnames(run) <- c("Country","Participants")
return(run)
}
data<-ldply(raceyears, raceyearfunction)

Resources