How to call a script in another script in R - r

I have created a series of commands in R that get a job done using a specific URL. I would like to iterate the series of commands over a list of URLS that reside in a separate text file. How do I call the list into the commands one at a time?
I do not know what the proper terminology for this programming action. I've looked into scripting and batch programming but this is not what I want to do.
# URL that comes from list
URL <- "http://www.urlfromlist.com"
# Load URL
theurl <- getURL(URL,.opts = list(ssl.verifypeer = FALSE) )
# Read the tables
tables <- readHTMLTable(theurl)
# Create a list
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
# Convert the list to a data frame
df <- do.call(rbind.data.frame, tables)
# Save dataframe out as a csv file
write.csv(df2, file = dynamicname, row.names=FALSE)
The above code is what I am doing. The first variable needs to be a different URL each time from a list - rinse and repeat. Thanks!
UPDATED CODE - this is still not writing out any files but runs.
# Function to pull tables from list of URLs
URLfunction<- function(x){
# URL that comes from list
URL <- x
# Load URL
theurl <- RCurl::getURL(URL,.opts = list(ssl.verifypeer = FALSE) )
# Read the tables
tables <- XML::readHTMLTable(theurl)
# Create a list
tables <- rlist::list.clean(tables, fun = is.null, recursive = FALSE)
# Convert the list to a data frame
df <- do.call(rbind,tables)
# Split date and time column out
df2 <- separate(df, "Date / Time", c("Date", "Time"), sep = " ")
# Fill the missing column with text, in this case shapename
shapename <- qdapRegex::ex_between(URL, "ndxs", ".html")
df2$Shape <- shapename
# Save dataframe out as a csv file
write.csv(result, paste0(shapename, '.csv', row.names=FALSE))
return(df2)
}
URL <- read.csv("PATH", header = FALSE)
purrr::map_df(URL, URLfunction) ## Also tried purrr::map_df(URL[,1], URLfunction)

If i understand your question correctly,
my answer could be work with your problem.
Used library
library(RCurl)
library(XML)
library(rlist)
library(purrr)
Define function
URLfunction<- function(x){
# URL that comes from list
URL <- x
# Load URL
theurl <- RCurl::getURL(URL,.opts = list(ssl.verifypeer = FALSE) )
# Read the tables
tables <- XML::readHTMLTable(theurl)
# Create a list
tables <- rlist::list.clean(tables, fun = is.null, recursive = FALSE)
# Convert the list to a data frame
df <- do.call(rbind,tables)
# Save dataframe out as a csv file
return(df)
}
Assume you have a data like below
( I am not sure what data looks like you have )
URL <- c("https://stackoverflow.com/questions/56139810/how-to-call-a-script-in-another-script-in-r",
"https://stackoverflow.com/questions/56122052/labelling-points-on-a-highcharter-scatter-chart/56123057?noredirect=1#comment98909916_56123057")
result<- purrr::map(URL, URLfunction)
result <- do.call(rbind, result)
Write.csv is last step
If you want write.csv by each URL , plz move in to URLfunction
write.csv(result, file = dynamicname, row.names=FALSE)
Aditional
List version
URL <- list("https://stackoverflow.com/questions/56139810/how-to-call-a-script-in-another-script-in-r",
"https://stackoverflow.com/questions/56122052/labelling-points-on-a-highcharter-scatter-chart/56123057?noredirect=1#comment98909916_56123057")
result<- purrr::map_df(URL, URLfunction)
>result
asked today yesterday
1 viewed 35 times <NA>
2 active today <NA>
3 viewed <NA> 34 times
4 active <NA> today
CSV
URL <- read.csv("PATH",header = FALSE)
result<- purrr::map_df(URL[,1], URLfunction)
>result
asked today yesterday
1 viewed 35 times <NA>
2 active today <NA>
3 viewed <NA> 34 times
4 active <NA> today
Add edited version of your code.
URLfunction<- function(x){
# URL that comes from list
URL <- x
# Load URL
theurl <- RCurl::getURL(URL,.opts = list(ssl.verifypeer = FALSE) )
# Read the tables
tables <- XML::readHTMLTable(theurl)
# Create a list
tables <- rlist::list.clean(tables, fun = is.null, recursive = FALSE)
# Convert the list to a data frame
df <- do.call(rbind,tables)
# Split date and time column out
df2 <- tidyr::separate(df, "Date / Time", c("Date", "Time"), sep = " ")
# Fill the missing column with text, in this case shapename
shapename <- unlist(qdapRegex::ex_between(URL, "ndxs", ".html"))
# qdapRegex::ex_between returns list type, when it added to df2 it couldn't be saved.
# So i added 'unlist'
df2$Shape <- shapename
# Save dataframe out as a csv file
write.csv(df2, paste0(shapename, '.csv'), row.names=FALSE)
# Here are two error.
# First, You maked the data named 'df2' not 'result'. So i changed result -->df2
# Second, row.names is not the 'paste0' attributes, it is 'write.csv's attributes.
return(df2)
}
After defining above function,
URL = c("nuforc.org/webreports/ndxsRectangle.html",
"nuforc.org/webreports/ndxsRound.html")
RESULT = purrr::map_df(URL, URLfunction) ## Also tried purrr::map_df(URL[,1], URLfunction)
Finally, i get the result below
1. Rectangle.csv, Round.csv files on your desktop(Saved path).
2. Returning row binded data frame looks like below (2011 x 8)
> RESULT[1,]
Date Time City State Shape Duration
1 5/2/19 00:20 Honolulu HI Rectangle 3 seconds
Summary
1 Several of rectangles connected in different LED like colors. Such as red, green, blue, etc. ;above Waikiki. ((anonymous report))
Posted
1 5/9/19

Related

Converting text files to excel files in R

I have radiotelemetry data that is downloaded as a series of text files. I was provided with code in 2018 that looped through all the text files and converted them into CSV files. Up until 2021 this code worked. However, now the below code (specifically the lapply loop), returns the following error:
"Error in setnames(x, value) :
Can't assign 1 names to a 4 column data.table"
# set the working directory to the folder that contain this script, must run in RStudio
setwd(dirname(rstudioapi::callFun("getActiveDocumentContext")$path))
# get the path to the master data folder
path_to_data <- paste(getwd(), "data", sep = "/", collapse = NULL)
# extract .TXT file
files <- list.files(path=path_to_data, pattern="*.TXT", full.names=TRUE, recursive=TRUE)
# regular expression of the record we want
regex <- "^\\d*\\/\\d*\\/\\d*\\s*\\d*:\\d*:\\d*\\s*\\d*\\s*\\d*\\s*\\d*\\s*\\d*"
# vector of column names, no whitespace
columns <- c("Date", "Time", "Channel", "TagID", "Antenna", "Power")
# loop through all .TXT files, extract valid records and save to .csv files
lapply(files, function(x){
df <- read_table(file) # read the .TXT file to a DataFrame
dt <- data.table(df) # convert the dataframe to a more efficient data structure
colnames(dt) <- c("columns") # modify the column name
valid <- dt %>% filter(str_detect(col, regex)) # filter based on regular expression
valid <- separate(valid, col, into = columns, sep = "\\s+") # split into columns
towner_name <- str_sub(basename(file), start = 1 , end = 2) # extract tower name
valid$Tower <- rep(towner_name, nrow(valid)) # add Tower column
file_path <- file.path(dirname(file), paste(str_sub(basename(file), end = -5), ".csv", sep=""))
write.csv(valid, file = file_path, row.names = FALSE, quote = FALSE) # save to .csv
})
I looked up possible fixes for this and found using "setnames(skip_absent=TRUE)" in the loop resolved the setnames error but instead gave the error "Error in is.data.frame(x) : argument "x" is missing, with no default"
lapply(files, function(file){
df <- read_table(file) # read the .TXT file to a DataFrame
dt <- data.table(df) # convert the dataframe to a more efficient data structure
setnames(skip_absent = TRUE)
colnames(dt) <- c("col") # modify the column name
valid <- dt %>% filter(str_detect(col, regex)) # filter based on regular expression
valid <- separate(valid, col, into = columns, sep = "\\s+") # split into columns
towner_name <- str_sub(basename(file), start = 1 , end = 2) # extract tower name
valid$Tower <- rep(towner_name, nrow(valid)) # add Tower column
file_path <- file.path(dirname(file), paste(str_sub(basename(file), end = -5), ".csv", sep=""))
write.csv(valid, file = file_path, row.names = FALSE, quote = FALSE) # save to .csv
})
I'm confused at to why this code is no longer working despite working fine last year? Any help would be greatly appreciated!
The error occured at this line colnames(dt) <- c("columns") where you provided only one value to rename the (supposedly) 4-column dataframe. If you meant to replace a particular column, you can
colnames(dt)[i] <- c("columns")
where i is the index of the column you are renaming. Alternatively, provide a vector with 4 new names.

How should I approach merging (full joining) multiple (>100) CSV files with a common key but inconsistent number of rows?

Before I dive into the question, here is a similar problem asked but there is not yet a solution.
So, I am working in R, and there is a folder in my working directory called columns that contains 198 similar .csv files with the name format of a 6-digit integer (e.g. 100000) that increases inconsistently (since the name of those files are actually names for each variable).
Now, I have would like to full join them, but somehow I have to import all of those files into R and then join them. Naturally, I thought about using a list to contain those files and then use a loop to join them. This is the code I tried to use:
#These are the first 3 columns containing identifiers
matrix_starter <- read_csv("files/matrix_starter.csv")
## import_multiple_csv_files_to_R
# Purpose: Import multiple csv files to the Global Environment in R
# set working directory
setwd("columns")
# list all csv files from the current directory
list.files(pattern=".csv$") # use the pattern argument to define a common pattern for import files with regex. Here: .csv
# create a list from these files
list.filenames <- list.files(pattern=".csv$")
#list.filenames
# create an empty list that will serve as a container to receive the incoming files
list.data <- list()
# create a loop to read in your data
for (i in 1:length(list.filenames))
{
list.data[[i]] <- read.csv(list.filenames[i])
list.data[[i]] <- list.data[[i]] %>%
select(`Occupation.Title`,`X2018.Employment`) %>%
rename(`Occupation title` = `Occupation.Title`) #%>%
#rename(list.filenames[i] = `X2018.Employment`)
}
# add the names of your data to the list
names(list.data) <- list.filenames
# now you can index one of your tables like this
list.data$`113300.csv`
# or this
list.data[1]
# source: https://www.edureka.co/community/1902/how-can-i-import-multiple-csv-files-into-r
The chunk above solve the importing part. Now I have a list of .csv files. Next, I would like to join them:
for (i in 1:length(list.filenames)){
matrix_starter <- matrix_starter %>% full_join(list.data[[i]], by = `Occupation title`)
}
However, this does not work nicely. I end up with somewhere around 47,000 rows, to which I only expect around 1700 rows. Please let me know your opinion.
Reading the files into R as a list and including the file name as a column can be done like this:
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
data <- read_xls(
x,
sheet = 1
)
data$File_name <- x
data
})
I am assuming now that all your excel files have the same structure: the same columns and column types.
If that is the case you can use dplyr::bind_rows to create one combined data frame.
You could off course loop through the list and left_join the list elements. E.g. by using Reduce and merge.
Update based on mihndang's comment. Is this what you are after when you say: Is there a way to use the file name to name the column and also not include the columns of file names?
library(dplyr)
library(stringr)
path <- "./files"
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
read.csv(x, stringsAsFactors = FALSE)
})
col1 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Values")
col2 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Character")
df1 <- data[[1]] %>%
rename(!!col1 := Value,
!!col2 := Character)
I created two simple .csv files in ./files: file1.csv and file2.csv. I read them into a list. I extract the first list element (the DF) and work out column names in a variable. I then rename the columns in the DF by passing the two variables to them. The column name includes the file name.
Result:
> View(df1)
> df1
file1: Values file1: Character
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
I guess you are looking for :
result <- Reduce(function(x, y) merge(x, y, by = `Occupation title`, all = TRUE), list.data)
which can be done using purrrs Reduce as well :
result <- purrr::reduce(list.data, dplyr::full_join, by = `Occupation title`)
When you do full join it adds every combination and gives us the tables. if you are looking for unique records then you might want to use left join where keep dataframe/table on left whose all columns you want keep as reference and keep the file you want to join on right.
Hope this helps.

R Import files with differing number of initial rows to skip

I need to read many files into R, do some clean up, and then combine them into one data frame. The files all basically start like this:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.07.11 09:47:35 =~=~=~=~=~=~=~=~=~=~=~=
up
Upload #18
Reader: S1 Site: AA
--------- upload 18 start ---------
Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap
E,2016-07-05,11:45:44.17,"upload 17 complete"
D,2016-07-05,11:46:24.69,00:00:00.87,HA,900_226000745055,A2,8,1102
D,2016-07-05,11:46:43.23,00:00:01.12,HA,900_226000745055,A2,10,143
The row with column headers is "Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap". Data should have 9 columns. The problem is that the number of rows above the header string is different for every file, so I cannot simply use skip = 5. I also only need lines that begin with "D,", everything else is messages, not data.
What is the best way to read in my files, ensuring that I have 9 columns and skipping all the junk?
I have been using the read_csv function from the readr() package because thus far it has produced the fewest formatting issues. But, I am open to any new ideas including a way to read in just lines that begin with "D,". I toyed with using read.table and skip = grep("Type," readLines(i)), but it doesn't seem to find the header string correctly. Here's my basic code:
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA", skip = 35)
# do clean-up stuff
datalist[[i]] <- d
}
One other basic R solution is the following: You read in the file by lines, get the indices of rows, that begin with "D" and the header row. After, you simply split these lines by "," and put it in a data.frame and assign the names from the header row to it.
lines <- readLines(i)
dataRows <- grep("^D,", lines)
names <- unlist(strsplit(lines[grep("Type,", lines)], split = ","))
data <- as.data.frame(matrix(unlist(strsplit(lines[dataRows], ",")), nrow = length(dataRows), byrow=T))
names(data) <- names
Output:
Type Date Time Duration Type Tag ID Ant Count Gap
1 D 2016-07-05 11:46:24.69 00:00:00.87 HA 900_226000745055 A2 8 1102
2 D 2016-07-05 11:46:43.23 00:00:01.12 HA 900_226000745055 A2 10 143
You can use a custom function to loop over each file and filter only those which start with D in the type column and bind them all together at the end. Drop the bind_rows if you want them as separate lists.
load_data <-function(path) {
require(dplyr)
setwd(path)
files <- dir()
read_files <- function(x) {
data_file <- read.csv(paste(path, "/", x, ".csv", sep = ""), stringsAsFactors = FALSE, na.strings=c("","NA"))
row.number <- grep("^Type$", data_file[,1])
colnames(data_file) <- data_file[row.number,]
data_file <- data_file[-c(1:row.number+1),]
data_file <- data_file %>%
filter(grepl("^D", Type))
return(data_file)
}
data <- lapply(files, read_files)
}
list_of_file <- bind_rows(load_data("YOUR_FOLDER_PATH"))
If your header row always begins with the word Type, you can simply omit the skip option from your initial read, and then remove any rows before the header row. Here's some code to get you started (not tested):
dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
# do clean-up stuff
datalist[[i]] <- d
}
If you want to keep the header, you can use:
for (i in dataFiles) {
d01 <- read_csv(i, col_names = F, na = "NA")
headerRow <- which( d01[,1] == 'Type' )
d01 <- d01[headerRow+1,] # This keeps all rows after the header row.
header <- d01[headerRow,] # Get names from header row.
setNames( d01, header ) # Assign names.
# do clean-up stuff
datalist[[i]] <- d
}

Classifying PDF Text Documents based on the presence/absence of specific words in R

I would like to be able to import PDF documents into R and classify them as either:
Relevant (contains a specific string, for example, "tacos", within the first 100 words)
Irrelevant (DOES NOT contain "tacos" within the first 100 words)
To be more specific, I would like to address the following questions:
Does a package(s) exist in R to perform this basic classification?
If so, is it possible to generate a dataset that would look something like this in R if I had 2 PDF documents with Paper1 containing at least one instance of the string, "tacos", in the first 100 words and Paper2 that DOES NOT contain at least one instance of the string, "tacos":
Any references to documentation/R packages/sample R code or mock examples related to this type of classification using R would be greatly appreciated! Thanks!
You can use the pdftools library and do something like this:
First, load the library and grab some pdf file names:
library(pdftools)
fns <- list.files("~/Documents", pattern = "\\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames...
Then define a function that reads a PDF file in as text and looks up the first n words. (It might be useful to check for errros, like unknown password or things like that - my ex. function returns NA for such cases.)
isRelevant <- function(fn, needle, n = 100L, ...) {
res <- try({
txt <- pdf_text(fn)
txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE)
any(grepl(needle, txt[1:n], ...))
}, silent = TRUE)
if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)
Finally, wrap it up and put it into a data frame:
data.frame(
Document = basename(fns),
Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
# Document Classification
# 1 a.pdf relevant
# 2 b.pdf not relevant
# 3 c.pdf relevant
# 4 d.pdf not relevant
# 5 e.pdf relevant
While #lukeA beat me to it, I wrote another small function that uses pdftools as well. The only real difference is that lukeA looks at the first n-characters, and my skript looks at the first n words.
This is how my approach looks
library(pdftools)
library(dplyr) # for data_frames and bind_rows
# to find the files better
setwd("~/Desktop/pdftask/")
# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)
# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
# loop over the files
res_list <- lapply(pdf_files, function(file) {
# use the library pdftools::pdf_text to extract the text from the pdf
content <- pdf_text(file)
# do some cleanup, i.e., remove punctuation, new-lines and lower all letters
content2 <- tolower(content)
content2 <- gsub("\\n", "", content2)
content2 <- gsub("[[:punct:]]", "", content2)
# split up the text by spaces
content_vec <- strsplit(content2, " ")[[1]]
# look if the search term is within the first n_words words
found <- search_term %in% content_vec[1:n_words]
# create a data_frame that holds our data
res <- data_frame(file = file,
relevance = ifelse(found,
"Relevant",
"Irrelevant"))
return(res)
})
# bind the data to a "tidy" data_frame
res_df <- bind_rows(res_list)
return(res_df)
}
search_pdf(pdf_files, search_term = "taco", n_words = 100)
# # A tibble: 3 × 2
# file relevance
# <chr> <chr>
# 1 pdfs//pdf_empty.pdf Irrelevant
# 2 pdfs//pdf_taco1.pdf Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant

Efficiency in assigning programmatically in R

In summary, I have a script for importing lots of data stored in several txt files. In a sigle file not all the rows are to be put in the same table (DF now switching to DT), so for each file I select all the rows belonging to the same DF, get DF and assign to it the rows.
The first time I create a DF named ,say, table1 I do:
name <- "table1" # in my code the value of name will depend on different factors
# and **not** known in advance
assign(name, someRows)
Then, during the execution my code may find (in other files) other lines to be put in the table1 data frame, so:
name <- "table"
assign(name, rbindfill(get(name), someRows))
My question is: is assign(get(string), anyObject) the best way for doing assignment programmatically? Thanks
EDIT:
here is a simplified version of my code: (each item in dataSource is the result of read.table() so one single text file)
set.seed(1)
#
dataSource <- list(data.frame(fileType = rep(letters[1:2], each=4),
id = rep(LETTERS[1:4], each=2),
var1 = as.integer(rnorm(8))),
data.frame(fileType = rep(letters[1:2], each=4),
id = rep(LETTERS[1:4], each=2),
var1 = as.integer(rnorm(8))))
# # #
#
library(plyr)
#
tablesnames <- unique(unlist(lapply(dataSource,function(x) as.character(unique(x[,1])))))
for(l in tablesnames){
temp <- lapply(dataSource, function(x) x[x[,1]==l, -1])
if(exists(l)) assign(l, rbind.fill(get(l), rbind.fill(temp))) else assign(l, rbind.fill(temp))
}
#
#
# now two data frames a and b are crated
#
#
# different method using rbindlist in place of rbind.fill (faster and, until now, I don't # have missing column to fill)
#
rm(a,b)
library(data.table)
#
tablesnames <- unique(unlist(lapply(dataSource,function(x) as.character(unique(x[,1])))))
for(l in tablesnames){
temp <- lapply(dataSource, function(x) x[x[,1]==l, -1])
if(exists(l)) assign(l, rbindlist(list(get(l), rbindlist(temp)))) else assign(l, rbindlist(temp))
}
I would recommend using a named list, and skip using assign and get. Many of the cool R features (lapply for example) work very well on lists, and do not work with using assign and get. In addition, you can easily pass lists in to a function, while this can be somewhat cumbersome with groups of variables combined with assign and get.
If you want to read a set of files into one big data.frame I'd use something like this (assuming csv like text files):
library(plyr)
list_of_files = list.files(pattern = "*.csv")
big_dataframe = ldply(list_of_files, read.csv)
or if you want to keep the result in a list:
big_list = lapply(list_of_files, read.csv)
and possibly use rbind.fill:
big_dataframe = do.call("rbind.fill", big_list)

Resources