Applying a function to multiple lists - r

I am doing research on U.S. Lobbying, who publishes their data as an open API that is very poorly integrated and only seems to allow 250 observations to be downloaded at one time. I would like to compile the whole data set into one data table but am struggling with the last step to do so. This is what I have thus far
base_url <- sample("https://lda.senate.gov/api/v1/contributions/?page=", 10, rep = TRUE) #Set the number between the commas as how many pages you want
numbers <- 1:10 #Set the second number as how many pages you want
pagesize <- sample("&page_size=250", 10, rep = TRUE) #Set the number between the commas as how many pages you want
pages <- data.frame(base_url, numbers, pagesize)
pages$numbers <- as.character(pages$numbers)
pages$url <- with(pages, paste0(base_url, numbers, pagesize)) # creates list of pages you want. the list is titled pages$url
for (i in 1:length(pages$url)) assign(pages$url[i], GET(pages$url[i])) # Creates all the base lists in need of extraction
The last two things I need to do are extract the data table from the created lists and then full join all of them. I know how to join all of them but extracting the data frames is proving to be challenging. basically, to all the created lists I need to apply the function fromJSON(rawToChar(list$content)). I have tried using lapply but have yet to figure it out. any help would be greatly welcomed!

When you were assigning GET(pages$url[i])) to your data frame you were coercing it to a character vector. Better to assign it to a list and keep it as a response:
library(httr)
library(jsonlite)
library(dplyr) # for bind_rows
page_content <- list()
for (i in 1:length(pages$url)) page_content[[i]] <- GET(pages$url[i]) # Creates all the base lists in need of extraction
Then you can use the code you had written - fromJSON(rawToChar()) - to extract it from raw bytes to characters:
results_list <- lapply(
page_content,
\(page) fromJSON(rawToChar(page[["content"]]))["results"][[1]]
)
results_table <- do.call(bind_rows, results_list)
dim(results_table) # 2500 27
names(results_table)
# [1] "url" "filing_uuid" "filing_type" "filing_type_display" "filing_year"
# [6] "filing_period" "filing_period_display" "filing_document_url" "filing_document_content_type" "filer_type"
# [11] "filer_type_display" "dt_posted" "contact_name" "comments" "address_1"
# [16] "address_2" "city" "state" "state_display" "zip"
# [21] "country" "country_display" "registrant" "lobbyist" "no_contributions"
# [26] "pacs" "contribution_items"

Related

Optimizing data scrape from NHL API using R

I am a novice with R and a total newbie with the NHL API. I wrote an R program to extract all of the goals recorded in the NHL's data repository accessed through the NHL API using the R "nhlapi" package. I have code that works, but it's ugly and slow, and I wanted to see if anyone has suggestions for improving it. I am using the nhl_games_feed function provided by nhlapi to pull all events, from which I select the goals. This function returns a JSON blob (list of lists of lists of lists ...) in R, which I want to convert into a proper data.table.
I pasted a stripped-down version of my code below. I understand that normal practice here would be to include a sample data blob with the code so that other users can recreate my problems, but the data blob is the problem.
When I ran the full version of my code last night, the "Loop through games" portion took about 11 hours, and the "Convert players list to columns" took about 2 hours. Unless I can find a way to push the column or row filtering into the NHL's system, I don't think I am likely to find a way to speed up the "Loop through games" portion. So my first question: Does anyone have any thoughts about how to extract a subset of columns or rows using the NHL API, or do I need to pull everything and process it on my end?
My other question related to the second chunk of code ("Convert players..."), which converts the resulting event data into a single row of scalar elements per event. The event data shows up in lblob_feed[[1]]$liveData$plays$allPlays, which contains one row of scalar elements per event, except that one of the elements is ..$allPlays$players, which is itself a 4x5 dataframe. As a result, the only way that I could find to extract that data into scalar elements is the "Convert players..." loop. Is there a better way to convert this into a simple data.table?
Finally, any tips on other ways to end up with a comprehensive database of NHL events?
require("nhlapi")
require("data.table")
require("tidyverse")
require("hms")
assign("last.warning", NULL, envir = baseenv())
# create small list of selected games, using NHL API game code format
cSelGames <- c(2021020001, 2021020002, 2021020003, 2021020004)
liNumGames <- length(cSelGames)
print(liNumGames)
# 34370 games in the full database
# =============================================================================
# Loop through games
# Pull data for one game per call
Sys.time()
dtGoals <- data.table()
for (liGameNum in 1:liNumGames) {
# Pull the NHL feed blob for one selected game
# 11 hours in the full version
lblob_feed <- nhl_games_feed(gameId = cSelGames[liGameNum])
# Select only the play (event) portion of the feed blob
ldtFeed <- as.data.table(c(lblob_feed[[1]]$gamePk, lblob_feed[[1]]$liveData$plays$allPlays))
setnames(ldtFeed, 1, "gamePk")
# Check for games with no play data - 1995020006 has none and would kill execution
if ('result.eventCode' %in% colnames(ldtFeed)) {
# Check for missing elements in allPlays list
# team.triCode is missing for at least one game, probably for all-star games
if (!('team.triCode' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (team.triCode = NA)]}
if (!('result.strength.code' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.strength.code = NA)]}
if (!('result.emptyNet' %in% colnames(ldtFeed))) {ldtFeed[, ':=' (result.emptyNet = NA)]}
# Select the events and columns for the output data table
ldtGoals_new <- ldtFeed[(result.eventTypeId == 'GOAL')
,list(gamePk, result.eventCode, players, result.description
, team.triCode, about.period, about.periodTime
, about.goals.away, about.goals.home, result.strength.code, result.emptyNet)]
# Append the incremental data table to the aggregate data table
dtGoals <- rbindlist(list(dtGoals, ldtGoals_new), use.names=TRUE, fill=TRUE)
}
}
# =============================================================================
# Convert players list to columns
# 2 hours in the full version
# 190686 goals in full table
# For each goal, the player dataframe is 4x5
Sys.time()
dtGoal_player <- data.table()
for (i in 1:dtGoals[,.N]) {
# convert rows with embedded dataframes into multiple rows with scalar elements
s_result.eventCode <- dtGoals[i,result.eventCode]
dtGoal_player_new <- as.data.table(dtGoals[i,players[[1]]])
dtGoal_player_new[, ':=' (result.eventCode=s_result.eventCode)]
dtGoal_player <- rbindlist(list(dtGoal_player, dtGoal_player_new), use.names=TRUE, fill=TRUE)
}
# drop players element
dtGoals[, players:=NULL]
# clean up problem with duplicated rows with playerType=Assist
dtGoal_player[, lag.playerType:=c('nomatch', playerType[-.N]), by=result.eventCode]
dtGoal_player[, playerType2:=ifelse((playerType==lag.playerType),'Assist2',playerType)]
# transpose multiple rows per event into single row with multiple columns for playerType
dtGoal_player_t <- dcast.data.table(dtGoal_player, result.eventCode ~ playerType2
, value.var='player.id', fun.aggregate=max)
# =============================================================================
# Merge players data into dtGoals
Sys.time()
dtGoals <- merge(dtGoals, dtGoal_player_t, by="result.eventCode")
Sys.time()

Facing a issue with unique() function in R and class of objects

I'm merging multiple files together and then trying to get the unique data out of a particular column. This idea works perfectly fine, when I'm running the code for a single pattern.
united_tweets <- load_data("united")
nrow(united_tweets)
united_unique <- unique(united_tweets[,2])
But, When i run the same code inside a for loop , the unique function seems to create an error. The output of a unique function, or when i try to get a single column saved , the class of the variable changes from 'list' to 'factor'. Trying to find unique values from it returns NULL values. Can someone point out what is wrong here?
for(i in 1:length(airlines)){
tmp <- load_data(airlines[i])
tweet <- as.list(tmp$text)
print(class(tweet))
tmp1 <- as.list(unique.default(tweet))
print(nrow(tmp1))
}
Here is the code i used. Only two differences from yours, read.csv and length(tmp1).
## file names
airlines = c("Delta03262017123126.csv", "Delta03262017124221.csv")
for(i in 1:length(airlines)){
tmp <- read.csv(airlines[i])
tweet <- as.list(tmp$text)
print(class(tweet))
tmp1 <- as.list(unique.default(tweet))
print(length(tmp1))
}
# [1] "list"
# [1] 1
# [1] "list"
# [1] 3495

R: Reading and writing multiple csv files into a loop then using original names for output

Apologies if this may seem simple, but I can't find a workable answer anywhere on the site.
My data is in the form of a csv with the filename being a name and number. Not quite as simple as having file with a generic word and increasing number...
I've achieved exactly what i want to do with just one file, but the issue is there are a couple of hundred to do, so changing the name each time is quite tedious.
Posting my original single-batch code here in the hopes someone may be able to ease the growing tension of failed searches.
# set workspace
getwd()
setwd(".../Desktop/R Workspace")
# bring in original file, skipping first four rows
Person_7<- read.csv("PersonRound7.csv", header=TRUE, skip=4)
# cut matrix down to 4 columns
Person7<- Person_7[,c(1,2,9,17)]
# give columns names
colnames(Person7) <- c("Time","Spare", "Distance","InPeriod")
# find the empty rows, create new subset. Take 3 rows away for empty lines.
nullrow <- (which(Person7$Spare == "Velocity"))-3
Person7 <- Person7[(1:nullrow), ]
#keep 3 needed columns from matrix
Person7<- Person7[,c(1,3,4)]
colnames(Person7) <- c("Time","Distance","InPeriod")
#convert distance and time columns to factors
options(digits=9)
Person7$Distance <- as.numeric(as.character(Person7$Distance))
Person7$Time <- as.numeric(as.character(Person7$Time))
#Create the differences column for distance
Person7$Diff <- c(0, diff(Person7$Distance))
...whole heap of other stuff...
#export Minutes to an external file
write.csv(Person7_maxs, ".../Desktop/GPS Minutes/Person7.csv")
So the three part issue is as follows:
I can create a list or vector to read through the file names, but not a dataframe for each, each time (if that's even a good way to do it).
The variable names throughout the code will need to change instead of just being "Person1" "Person2", they'll be more like "Johnny1" "Lou23".
Need to export each resulting dataframe to it's own csv file with the original name.
Taking any and all suggestions on board - s.t.ruggling with this one.
Cheers!
Consider using one list of the ~200 dataframes. No need for separate named objects flooding global environment (though list2env still shown below). Hence, use lapply() to iterate through all csv files of working directory, then simply name each element of list to basename of file:
setwd(".../Desktop/R Workspace")
files <- list.files(path=getwd(), pattern=".csv")
# CREATE DATA FRAME LIST
dfList <- lapply(files, function(f) {
df <- read.csv(f, header=TRUE, skip=4)
df <- setNames(df[c(1,2,9,17)], c("Time","Spare","Distance","InPeriod"))
# ...same code referencing temp variable, df
write.csv(df_max, paste0(".../Desktop/GPS Minutes/", f))
return(df)
})
# NAME EACH ELEMENT TO CORRESPONDING FILE'S BASENAME
dfList <- setNames(dfList, gsub(".csv", "", files))
# REFERENCE A DATAFRAME WITH LIST INDEXING
str(dfList$PersonRound7) # PRINT STRUCTURE
View(dfList$PersonRound7) # VIEW DATA FRAME
dfList$PersonRound7$Time # OUTPUT ONE COLUMN
# OUTPUT ALL DFS TO SEPARATE OBJECTS (THOUGH NOT NEEDED)
list2env(dfList, envir = .GlobalEnv)

read multiple ENVI files and combine them in one csv

I'm fairly new in working with R but trying to get this done. I have dozens of ENVI spectral datasets stored in a directory. Each dataset is seperated into two files. They all have the same name convention, i.e.:
ID_YYYYMMDD_350-200nm.asr
ID_YYYYMMDD_350-200nm.hdr
The task is to read the dataset, add two columns (ID and date from filename), and store the results in a *.csv-file. I got this to work for a single file (hardcoded).
library(caTools)
setwd("D:/some/path/software_scripts")
### filename without extension
name <- "011a_20100509_350-2500nm"
### split filename in area-id and date
flaeche<-substr(name, 0, 4)
date <- as.Date((substr(name,6,13)),"%Y%m%d")
### get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
### add columns
spectrum <- cbind(Flaeche=flaeche,Datum=as.character(date),spectrum)
### CSV-Dataset with all values
write.csv(spectrum, file = name,".csv", sep=",")
I want to combine all available files into one *.csv file. I know that I've to use list.files but have no idea, how to implement the read.ENVI function and add the resulting matrices ongoing to CSV.
Update:
library(caTools)
setwd("D:/some/path/mean")
files <- list.files() # change or leave totally empty if setwd() put you in the right spot
all_names <- sub("^([^.]*).*", "\\1", files) # strip off extensions
name <- unique(all_names) # get rid of duplicates from .esl and .hdr
# wrap your existing code in a function
mungeENVI <- function(name) {
# split filename in area-id and date
flaeche<-substr(name, 0, 4)
date <- as.Date((substr(name,6,13)),"%Y%m%d")
# get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
# add columns
spectrum <- cbind(Flaeche=flaeche,Datum=as.character(date),spectrum)
return(spectrum)
}
# use lapply to 'loop' over each name
list_of_ENVIs <- lapply(name, mungeENVI) # returns a list
# use do.call(rbind, x) to turn it into a big data.frame
final_df <- do.call(rbind, list_of_ENVIs)
# now write output
write.csv(final_df, "all_results.csv")
you can find a sample dataset here: Sample dataset
I work with a lot of lab data where I can rely on the output files being in a reliable format (same column order, column name, header format, etc). So this is assuming that the .ENVI files you have are similar to that. If your files are not like that, I'm happy to help with that too, I'd just need to see a dummy file or two.
Anyways here's the idea:
library(caTools)
library(lubridate)
library(magrittr)
setwd("~/Binfo/TST/Stack/") # adjust as needed
files <- list.files("data/", full.name = T) # adjust as needed
all_names <- gsub("\\.\\D{3}", "", files) # strip off extensions
names1 <- unique(all_names) # get rid of duplicates
# wrap your existing code in a function
mungeENVI <- function(name) {
# split filename in area-id and date
f <- gsub(".*\\/(\\d{3}\\D)_.*", "\\1", name)
d <- gsub(".*_(\\d+)_.*", "\\1", name) %>% ymd()
# get values from ENVI-file in a matrix
spectrum <- read.ENVI(paste(name,".esl", sep = ""), headerfile=paste(name,".hdr", sep=""))
# add columns
spectrum <- cbind(Flaeche=f,Datum= as.character(d),spectrum)
return(spectrum)
}
# use lapply to 'loop' over each name
list_of_ENVIs <- lapply(names1, mungeENVI) # returns a list
# use do.call(rbind, x) to turn it into a big data.frame
final_df <- do.call(rbind, list_of_ENVIs)
# now write output
write.csv(final_df, "data/all_results.csv")
Let me know if you have any problems and we an go from there. Cheers.
I edited my answer a bit, I think the problem you were hitting is in list.files() it should have had the argument full.name = T. I also adjusted you parsing method to be a little more defensive and use grep capture expressions. I tested the code with your two example files (4 really) but I can build out a large matrix (66743 elements). Also I used lubridate, I think it's a better way to work with dates and times.

R text mining documents from multiple txt files

I have multiple txt files, each referring to a different month of the year (for many years). So, how I could analyze these files (text mining) each of these separately from a unique corpus (or something similar), by taking track of the month-year reference, thank you.
Here is an example I programmed for Game of Thrones subtitles. The subtitles are in the form 60 text files, one file for one episode in the form of S01E01 were we wanted to keep the episode information.
The following code will read the files into a list, and will turn it into a dataframe with episode information and text. You will have to adapt it to your own problem.
library(plyr)
####### Read data ######
filenames <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=TRUE)
filenames_short <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=FALSE)
ldf <- alply(.data=filenames,.margins=1,.fun=scan,what = "character", quiet = T, quote = "")
names(ldf) <- filenames_short
# Loop over all filenames
# Turns list into two columns of a dataframe, episode and word
# create empty dataframe
df_got_subs <- as.data.frame(NULL)
for (i in 1:60) {
# extract listname
# vector with list name
listenname <- filenames_short[i]
vec_listenname <- rep.int(listenname,length(ldf[[i]]))
# Doublecheck
cat("listenname: ",listenname,"\n")
# turn list element into vector
vec_subs <- as.vector(ldf[[i]])
# create dataframe from vectors
df_subs <- cbind.data.frame(vec_listenname,vec_subs,stringsAsFactors=FALSE)
# attach to the "big" dataframe
df_got_subs <- rbind.data.frame(df_got_subs,df_subs)
}
# test datastructure
str(df_got_subs)
# change column names
colnames(df_got_subs) <- c("episode","subs")
The whole text mining we did with the tidytext package from Julia Silge. I didn't post the code because she gives much better examples in this post:
http://juliasilge.com/blog/Life-Changing-Magic/
I hope this helps with your problem.

Resources