Feeding R a list of webpages through a CSV - r

I have created a function which scrapes information and adds it to a data.frame. I want to feed this function a list of urls from a .csv but it does not seem to work when I make it as function.
IMDB <- function(wp){
for (i in wp){
raw_data <- getURL(i)
data <- fromJSON(raw_data)
data <- as.list(data)
length(data)
final_data <- do.call(rbind, data)
Title <- final_data[c("Title"),]
ScreenWriter <- final_data[c("Writer"),]
Fdata <- cbind(Title,ScreenWriter)
Authors <- rbind(Authors, Fdata)
}
}

Related

Extract rows from csv files in R

I want to extract lat/long data + file name from csv
I have done the following:
#libraries-----
library(readr)
library("dplyr")
library("tidyverse")
# set wd-----EXAMPLE
setwd("F:/mydata/myfiles/allcsv")
# have R read files as list -----
list <- list.files("F:/mydata/myfiles/allcsv", pattern=NULL, all.files=FALSE,
full.names=FALSE)
list
]
#lapply function
row.names<- c("Date=0", "Time=3", "Type=2", "Model=1", "Coordinates=nextrow", "Latitude = 38.3356", "Longitude = 51.3323")
AllData <- lapply(list, read.table,
skip=5, header=FALSE, sep=";", row.names=row.names, col.names=NULL)
PulledRows <-
lapply(AllData, function(DF)
DF[fileone$Latitude==38.3356, fileone$Longitude==51.3323]
)
# maybe i need to specify a for loop?
how my data looks
Thank you.
This should work for you. You may have to change the path location if the .csv files are not in your working directory. And the location to save the final results.
results <- data.frame(Latitude=NA,Longitude=NA,FileName=NA) #create empty dataframe
for(i in 1:length(list)){ # loop through each file obtained from list (called above)
dat <- read_csv(list[i],col_names = FALSE) # read in the ith dataset
df <- data.frame(dat[6,1],dat[7,1],list[i]) # create new dataframe with values from dat
df[,1] <- as.numeric(str_remove(df[,1],'Latitude=')) # remove text and make numeric
df[,2] <- as.numeric(str_remove(df[,2],'Longitude='))
names(df) <- names(results) # having the same column names allows next line
results <- rbind(results,df) # 'stacks' the results dataframe and df dataframe
}
results <- na.omit(results) # remove missing values (first row)
write_csv(results,'desired/path')

Memory and otimization problems in JSTOR's XML big loop

I'm having memory and otimization problems when loping over 200,000 documents of JSTOR's data for research. The documents are in xml format. More information can be found here: https://www.jstor.org/dfr/.
In the first step of the code I transform a xml file into a tidy dataframe in the following manner:
Transform <- function (x)
{
a <- xmlParse (x)
aTop <- xmlRoot (a)
Journal <- xmlValue(aTop[["front"]][["journal-meta"]][["journal-title group"]][["journal-title"]])
Publisher <- xmlValue (aTop[["front"]][["journal-meta"]][["publisher"]][["publisher-name"]])
Title <- xmlValue (aTop[["front"]][["article-meta"]][["title-group"]][["article-title"]])
Year <- as.integer(xmlValue(aTop[["front"]][["article-meta"]][["pub-date"]][["year"]]))
Abstract <- xmlValue(aTop[["front"]][["article-meta"]][["abstract"]])
Language <- xmlValue(aTop[["front"]][["article-meta"]][["custom-meta-group"]][["custom-meta"]][["meta-value"]])
df <- data.frame (Journal, Publisher, Title, Year, Abstract, Language, stringsAsFactors = FALSE)
df
}
In the sequence, I use this first function to transform a series of xml files into a single dataframe:
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml")
i = 2
df2 <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
while (i<=length(files))
{
df <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
df2[i,] <- df
i <- i + 1
}
data.frame(df2)
}
When I have more than 100000 files it takes several hours to run. In case with 200000 it eventually breaks or gets to slow over time. Even in small sets, it can be noticed that it runs slower over time. Is there something I'm doing worong? Could I do something to otimize the code? I've already tried rbind and bind-rows instead of allocating the values directly using df2[i,] <- df.
Avoid the growth of objects in a loop with your assignment df2[i,] <- df (which by the way only works if df is one-row) and avoid the bookkeeping required of while with iterator, i.
Instead, consider building a list of data frames with lapply that you can then rbind together in one call outside of loop.
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml", full.names = TRUE)
df_list <- lapply(files, Transform)
final_df <- do.call(rbind, unname(df_list))
# ALTERNATIVES FOR POSSIBLE PERFORMANCE:
# final_df <- data.table::rbindlist(df_list)
# final_df <- dplyr::bind_rows(df_list)
# final_df <- plyr::rbind.fill(df_list)
}

R Function with for Loops creates several dataframes, how to have each one have a different name

I have a for loop that loops through a list of urls,
url_list <- c('http://www.irs.gov/pub/irs-soi/04in21id.xls',
'http://www.irs.gov/pub/irs-soi/05in21id.xls',
'http://www.irs.gov/pub/irs-soi/06in21id.xls',
'http://www.irs.gov/pub/irs-soi/07in21id.xls',
'http://www.irs.gov/pub/irs-soi/08in21id.xls',
'http://www.irs.gov/pub/irs-soi/09in21id.xls',
'http://www.irs.gov/pub/irs-soi/10in21id.xls',
'http://www.irs.gov/pub/irs-soi/11in21id.xls',
'http://www.irs.gov/pub/irs-soi/12in21id.xls',
'http://www.irs.gov/pub/irs-soi/13in21id.xls',
'http://www.irs.gov/pub/irs-soi/14in21id.xls',
'http://www.irs.gov/pub/irs-soi/15in21id.xls')
dowloads an excel file from each one assigns it to a dataframe and performs a set of data cleaning operations on it.
library(gdata)
for (url in url_list){
test <- read.xls(url)
cols <- c(1,4:5,97:98)
test <- test[-(1:8),cols]
test <- test[1:22,]
test <- test[-4,]
test$Income <-test$Table.2.1...Returns.with.Itemized.Deductions..Sources.of.Income..Adjustments..Itemized.Deductions.by.Type..Exemptions..and.Tax..Items..by.Size.of.Adjusted.Gross.Income..Tax.Year.2015..Filing.Year.2016.
test$Total_returns <- test$X.2
test$return_dollars <- test$X.3
test$charitable_deductions <- test$X.95
test$charitable_deduction_dollars <- test$X.96
test[1:5] <- NULL
}
My problem is that the loop simply writes over the same dataframe for each iteration through the loop. How can I have it assign each iteration through the loop to a data frame with a different name?
Use assign. This question is a duplicate of this post: Change variable name in for loop using R
For your particular case, you can do something like the following:
for (i in 1:length(url_list)){
url = url_list[i]
test <- read.xls(url)
cols <- c(1,4:5,97:98)
test <- test[-(1:8),cols]
test <- test[1:22,]
test <- test[-4,]
test$Income <-test$Table.2.1...Returns.with.Itemized.Deductions..Sources.of.Income..Adjustments..Itemized.Deductions.by.Type..Exemptions..and.Tax..Items..by.Size.of.Adjusted.Gross.Income..Tax.Year.2015..Filing.Year.2016.
test$Total_returns <- test$X.2
test$return_dollars <- test$X.3
test$charitable_deductions <- test$X.95
test$charitable_deduction_dollars <- test$X.96
test[1:5] <- NULL
assign(paste("test", i, sep=""), test)
}
You could write to a list:
result_list <- list()
for (i_url in 1:length(url_list)){
url <- url_list[i_url]
...
result_list[[i_url]] <- test
}
You can also name the list
names(result_list) <- c("df1","df2","df3",...)
Here's another approach with lapply instead of for loops which will write all resulting data.frames as separate list items which can then be re-named (if needed).
url_list <- c('http://www.irs.gov/pub/irs-soi/04in21id.xls',
...
'http://www.irs.gov/pub/irs-soi/15in21id.xls')
readURLFunc <- function(z){
test <- readxl::read_xls(z)
...
test[1:5] <- NULL
return(test)}
data_list <- lapply(url_list, readURLFunc)

Creating function to parse multiple xml files in r

I have a code to parse a single xml file that relates to a data feed of football match. However, I have over 300+ games worth of data and I want to apply this code to all of these feeds as doing it manually by hand would take along time. I'm new to data science and although I have seen other posts about multiple XML parsing I don't really know to about changing the code so that it suits this data structure
library(XML)
library(plyr)
library(gdata)
library(reshape)
f24 <- file.choose() #XML FILE TO BE PARSED
grabAll <- function(XML.parsed, field){
parse.field <- xpathSApply(XML.parsed, paste("//", field, "[#*]", sep=""))
results <- t(sapply(parse.field, function(x) xmlAttrs(x)))
if(typeof(results)=="list"){
do.call(rbind.fill, lapply(lapply(results, t), data.frame,
stringsAsFactors=F))
} else {
as.data.frame(results, stringsAsFactors=F)
}
}
#Play-by-Play Parsing
pbpParse <- xmlInternalTreeParse(f24)
eventInfo <- grabAll(pbpParse, "Event")
eventParse <- xpathSApply(pbpParse, "//Event")
NInfo <- sapply(eventParse, function(x) sum(names(xmlChildren(x)) == "Q"))
QInfo <- grabAll(pbpParse, "Q")
EventsExpanded <- as.data.frame(lapply(eventInfo[,1:2], function(x) rep(x, NInfo)), stringsAsFactors=F)
QInfo <- cbind(EventsExpanded, QInfo)
names(QInfo)[c(1,3)] <- c("Eid", "Qid")
QInfo$value <- ifelse(is.na(QInfo$value), 1, QInfo$value)
Qual <- cast(QInfo, Eid ~ qualifier_id)
#FINAL DATA FOR ONE GAME
events <- merge(eventInfo, Qual, by.x="id", by.y="Eid", all.x=T, suffixes=c("", "Q"))

Dynamic ComboBox in R GUI Development

I am trying to create a simple GUI in R. In this GUI I want to have two combobox, 1) List of data frame from the current workspace 2) Names of variable corresponding to the data frame selected in first combobox. I am trying to use gWidgets library but I am not sure how to do it. Can anyone in this forum help me? Thanks
I want SourceVar as a combo box by taking variable names from inData argument
## list data frames in an environment
lsDF <- function(envir=.GlobalEnv) {
varNames <- ls(envir=envir)
dfs <- sapply(varNames, function(i) inherits(get(i,envir=envir),"data.frame"))
varNames[dfs]
}
testFun <- function(inData,sourceVar,targetVar,outData)
{
inData[[targetVar]] <- as.factor(inData[[sourceVar]])
assign(outData, inData, envir = .GlobalEnv)
}
lst <- list()
lst$action <- list(beginning="toFactor(",ending=")")
lst$arguments$inData <- list(type="gcombobox",lsDF())
lst$arguments$sourceVar <- list(type="gedit")
lst$arguments$targetVar <- list(type="gedit")
lst$arguments$outData <- list(type="gedit")
ggenericwidget(lst, container=gwindow("Recode to Factor"))
I wouldn't use ggenericwidget (it isn't supported in gWidgets2). Instead create a combobox along the lines of cb <- gcombobox(character(0), cont=...) then to populate it call its [<- method: e.g., cb[] <- lsDf().
Here is a pattern for you to work with:
library(gWidgets2)
data(mtcars)
testdf <- data.frame(a=1:3, b= 1:3)
## list data frame names in envir
dfnms <- names(Filter(is.data.frame, mget(ls())) )
lsNms <- function(d, envir=.GlobalEnv) names(get(d, envir=envir))
w <- gwindow("two combos")
g <- glayout(cont=w)
g[1,1] <- "Data frames:"
g[1,2] <- (dfs <- gcombobox(dfnms, cont=g))
g[2,1] <- "Variables:"
g[2,2] <- (vnames <- gcombobox(character(0), cont=g))
addHandlerChanged(dfs, handler=function(...) vnames[] <- lsNms(svalue(dfs)))

Resources