I have been trying to turn pdf files into data frames using R. I start out by reading the text into R and using data.table to split the data into a list item per page. I am now having trouble writing a loop to combine the questions with their respective continued items. The txt.list object in the below code is a brief example of the format.
### Short list
txt.list <- list('Q1', 'Q2', 'continued page',
'Q3', 'continued page', 'continued page',
'Q4', 'Q5', 'continued page', 'continued page',
'Q6', 'continued page', 'Q7', 'continued page', 'continued page')
### Label pages that continue from the previous
is.continuation <- lapply(txt.list, function(x){ startsWith(x, 'continued')}) # find which pages are continuations
is.continuation <- c(unlist(is.continuation)) # unlist for list item naming
names(txt.list) <- as.character(is.continuation)
print(txt.list)
This result is that each page in the list that is a continuation of the corresponding question is given a "TRUE" character label (I know this can be done without list labeling, I'm just trying avoid referring to an external vector).
Since each pdf file from this website almost always uses the same format, I am trying to make this work (at least somewhat) for future uses. I've been trying something along the lines of:
new.list <- vector(mode = 'list',
length = length(which(names(txt.list) == 'TRUE')))
for(i in 1:length(txt.list)){
j = i + 1 # pg ahead
if(names(txt.list)[[j]] == "TRUE"){
new.list[[i]][[1]] <- txt.list[[i]]
m = 2 # index ahead
while(names(txt.list)[[j]] == "TRUE"){
new.list[[i]][[m]] <- txt.list[[j]]
m = m + 1
}
} else {
new.list[[i]] <- txt.list[[i]]
}
}
After a few tries, I'm just completely drawing blanks. Any help would be much appreciated!
It's been awhile since I've really worked in r, but am I misreading your for loop? Don't you need for (i in 1:length(...))? If you don't have the 1: part, then there's no range, and so you won't do any looping.
Your main issue outside of that is that you're pumping your newlist in at the 'i' location, when that variable is only appropriate for reading from txt.list. You should keep a separate tracker for new.list (such as nlSize), and tick it up whenever it's appropriate.
Another minor issue is that you have an anchor before your while loop that you can avoid.
Finally, I would definitely get away from setting the names as truth values. It would have been better to reference an external vector, though you don't have to do that either.
Just make a function and use it inside your loop.
I put my code in a function called normalizeList and then call it on txt.list. This way you can use it on other similar lists.
normalizeList <- function (lst) {
is.continuation <- function (x)
startsWith(x, 'continued');
new.list <- list()
nlSize <- 0
for(i in 1:length(lst)) {
isLast <- length(lst) == i
cur <- lst[[i]]
nxt <- ifelse(isLast, '', lst[[i+1]]);
if(is.continuation(cur)){
new.list[[nlSize]] <- c(new.list[[nlSize]], cur)
next
}
nlSize <- nlSize + 1
new.list[nlSize] <- ifelse(is.continuation(nxt), list(cur), cur)
}
new.list
}
normalizeList(txt.list);
Related
Here are the set of circumstances that have gotten me stuck:
The Problem
I have written a function in R that will need to execute within a for loop with the goal of manually adjusting some values and then updating a large nested list in the global environment. I have 2 functions more.points() and get.num.pts() that ask for user input. These are part of a larger function add.points() which runs everything and will be wrapped in a for loop. Unfortunately, I cannot figure out how to update the nested list to the correct place in the list's hierarchy from within the function. I must do it from within the function in order to make sure I dont run lines of code in the for loop after the function because this will cause readlines() to fail and take the next line of code as user input. How do I update the hierarchical list object in the correct place from within the add.points() function? assign() does not appear to be up to the task, at least to my limited knowledge. Any help is greatly appreciated. I am making a pipeline for aligning an atlas to brain images so I can localize cells that fluoresce to their respective brain regions.
more.points <- function(){
more.pts <- readline(prompt = "Do you need to add correspondence points to adjust the atlas registration? (y/n): ")
}
get.num.pts <- function(){
num.pts <- readline(prompt = "How many additional points are required? (You will be able to add additional points later if you need to): ")
}
add.points <- function(){
mo.pts <- as.character(more.points());
if(mo.pts == "y" || mo.pts == "Y" || mo.pts == "Yes" || mo.pts == "yes"){
while(mo.pts == "y" || mo.pts == "Y" || mo.pts == "Yes" || mo.pts == "yes") {
#ask for user input about number of new points to be created
n.pts <- as.integer(get.num.pts());
reg.fun.obj <- paste0(n.pts," updated!");
print(reg.fun.obj)
#do other stuff
#assign totally works here just fine because it isnt a hierarchical list being updated
assign("reg.obj", reg.fun.obj, envir = .GlobalEnv);
#Need to update the correct position in the list object hierarchy with new info.
assign(i.data[[reference.df$i[[i]]]][[reference.df$j[[i]]]][[reference.df$k[[i]]]], reg.obj, envir = .GlobalEnv);
#But this cannot take `i.data[[reference.df$i[[i]]]][[reference.df$j[[i]]]][[reference.df$k[[i]]]]` for the name argument. it must be a string.
mo.pts = as.character(more.points())
}
}
}
Reproducible example:
Here is an example of the global environment hierarchical list I need to update from an object within the add.points() function:
#Hierarchical List Object Example
#The image objects have more complexity in my real implementation i.e. image_1 is itself a list object with multiple attributes.
list.i <- c("channel1", "channel2", "channel3")
list.j <- c("m1", "m2", "m3")
list.k <- c("image_1", "image_2", "image_3")
k.tmp <- list()
j.tmp <- list()
i.data <- list()
for(i in seq_along(list.i)){
for(j in seq_along(list.j)){
for(k in seq_along(list.k)){
k.tmp[[k]] <- list.k[[k]]
names(k.tmp)[[k]] <- paste0("img", k)
}
j.tmp[[j]] <- k.tmp
names(j.tmp)[[j]] <- paste0("m", j)
k.tmp <- list()
}
i.data[[i]] <- j.tmp
names(i.data)[[i]] <- paste0("channel", i)
j.tmp <- list
}
remove(k.tmp,j.tmp)
#Additional example list I am using to know which elements of the hierarchy need to be updated/adjusted as the for loop cycles.
reference.df <- data.frame(i = c(rep(1, 9), rep(2, 9), rep(3, 9)), j = c(rep(c(1, 1, 1, 2, 2, 2, 3, 3, 3),3)), k = c(rep(c(1, 2, 3),9)))
Code to run function:
reg.obj <- i.data[[reference.df$i[[i]]]][[reference.df$j[[i]]]][[reference.df$k[[i]]]]
for(i in seq_along(reference.df$k)){
add.points()
}
Remember: I am unable to run anything after the function within the for loop because R will interpret the next line as the user input being fed to readlines(). Thus, the whole point of this loop and function - getting user input, saving, and cycling to the next image for the user to provide input on - will not occur.
For anyone else who runs into an issue like this. Don't be stupid like me. Use the return() function within your function to convert your variable into an output that you can feed into your nested list thusly:
in the function:
myfun(){
#do stuff to make object containing update
return(update.obj)
}
#run the function:
list[[x]][[y]][[z]] <- myfun()
#is equivalent to below occurring outside a function:
list[[x]][[y]][[z]] <- update.obj
Yes this was stupid but hopefully I helped someone avoid my fundamental mistake here. If you can avoid it, don't use assign() in a function.
I'm very new at R and I would like to do a loop in order to return search volume (through an API call) for a list of keywords.
Here the code that I used :
install.packages("SEMrushR")
library(SEMrushR)
mes_keywords_to_check <- readLines("voyage.txt") # List of keywords to check
mes_keywords_to_check <- as.character(mes_keywords_to_check)
Loop
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
}
By doing this, I only get the Search Volume for the first keyword in the list. My purpose if of course to get the date required for the full list of keywords.
Here is the table that I get:
enter image description here
Do you have any idea how I could solve this issue?
Well, you need to add your results to some kind of container. for example to a list. As of now, you have just one object that gets filled with data from the most recent iteration of your loop.
results = list()
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
results[[i]] <- df_test_2
}
But, most R experts would suggest to refrain from using a loop
library("plyr")
result <- plyr::ldply(mes_keywords_to_check, function(x) keyword_overview_all(as.character(x), "fr","API KEY NUMBER"))
I did not test this, and it probably needs some tweaking, but it should point you in the right direction.
It looks like you're reading in the text file with readLines("voyage.txt") which will return a list of each line. These lines are then being passed to the for loop. The below will convert the lines to words. There are various approaches, but below uses a loop within a loop to keep using for() and in case you prefer to search line-by-line-word-by-word. It also uses a regex to split on non-alpha-numeric so that you omit words bounded by punctuation.
mes_lines <- readLines("voyage.txt") # List of keywords to check
mes_lines <- as.character(mes_lines)
search_results <- list()
for (i in 1:length(mes_lines)) {
mes_keywords_to_check <- unlist(strsplit(mes_lines,"[^[:alnum:]]"))
mes_keywords_to_check <- mes_keywords_to_check[nchar(mes_keywords_to_check)>0]
if (length(mes_keywords_to_check)==0) next
for (w in 1:length(mes_keywords_to_check))
{
test_keyword <- as.character(mes_keywords_to_check[w])
print(paste0("Checking word=",test_keyword))
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
search_results <- append(search_results,df_test_2)
}
}
search_results
Thanks for pointing to the right direction.
Here is what I did, and this is working:
final_result <- data.frame()
mes_keywords_to_check <- readLines("voyage.txt")
mes_keywords_to_check <- as.character(mes_keywords_to_check)
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY")
final_result <- rbind(final_result,df_test_2)
}
I am getting stuck trying to incorporate the "pages" element here, along with the "type" - the pages element changes but the type remains constant.
The current code I have maps over the tickers and downloads the data into a list format, however it only uses the default number of pages, how can I map over the tickers as well as the pages?
Adding company_filings(pages, type) doesn´t work.
library(edgarWebR)
tickers <- c("63908", "793952")
count <- 100
pages <- 1:4
type = "10-K"
custom_company_filings <- function(compfilings){
company_filings(compfilings)
}
x <- Map(custom_company_filings, tickers)
Edit:
What I am trying to achieve is to do something like:
for(tick in tickers){
for(i in pages){
company_filings("get company filings")
}
}
So for each ticker I Will collect pages 1,2,3,4 of the company_filing() function - which comes from the edgarWebr package.
The custom_company_filing() function was my attempt at trying to solve the problem but when I posted here I removed all my attempts. For instance one attempt was:
custom_company_filings <- function(compfilings, pages){
company_filings(compfilings, pages)
}
x <- Map(custom_company_filings(page = pages), tickers)
We can use a nested lapply:
library(edgarWebR)
tickers <- c("63908", "793952")
count <- 100
pages <- 1:4
type <- "10-K"
lapply(tickers, function(x){
lapply(pages, function(y){
company_filings(x, type = type, count = count, page = y)
})
})
Or with purrr:
library(purrr)
pmap(expand.grid(tickers, pages), ~company_filings(..1, type = type, count = count, page = ..2))
Here is my R Script that works just fine:
perc.rank <- function(x) trunc(rank(x)) / length(x) * 100.0
library(dplyr)
setwd("~/R/xyz")
datFm <- read.csv("yellow_point_02.csv")
datFm <- filter(datFm, HRA_ClassHRA_Final != -9999)
quant_cols <- c("CL_GammaRay_Despiked_Spline_MLR", "CT_Density_Despiked_Spline_FinalMerged",
"HRA_PC_1HRA_Final", "HRA_PC_2HRA_Final","HRA_PC_3HRA_Final",
"SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT", "Ultrasonic_DT_Despiked_Spline_MLR")
# add an extra column to datFm to store the quantile value
for (column_name in quant_cols) {
datFm[paste(column_name, "quantile", sep = "_")] <- NA
}
# initialize an empty dataframe with the new column names appended
newDatFm <- datFm[0,]
# get the unique values for the hra classes
hraClassNumV <- sort(unique(datFm$HRA_ClassHRA_Final))
# loop through the vector and create currDatFm and append it to newDatFm
for (i in hraClassNumV) {
currDatFm <- filter(datFm, HRA_ClassHRA_Final == i)
for (column_name in quant_cols) {
currDatFm <- within(currDatFm,
{
CL_GammaRay_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$CL_GammaRay_Despiked_Spline_MLR)
CT_Density_Despiked_Spline_FinalMerged_quantile <- perc.rank(currDatFm$CT_Density_Despiked_Spline_FinalMerged)
HRA_PC_1HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_1HRA_Final)
HRA_PC_2HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_2HRA_Final)
HRA_PC_3HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_3HRA_Final)
SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT_quantile <- perc.rank(currDatFm$SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT)
Ultrasonic_DT_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$Ultrasonic_DT_Despiked_Spline_MLR)
}
)
}
newDatFm <- rbind(newDatFm, currDatFm)
}
newDatFm <- newDatFm[order(newDatFm$Core_Depth),]
# head(newDatFm, 10)
write.csv(newDatFm, file = "Ricardo_quantiles.csv")
I have a few questions though. Every R book or video that I have read or watched, recommends using the 'apply' family of language constructs over the classic 'for' loop stating that apply is much faster.
So the first question is: how would you write it using apply (or tapply or some other apply)?
Second, is this really true though that apply is much faster than for? The csv file 'yellow_point_02.csv' has approx. 2500 rows. This script runs almost instantly on my Macbook Pro which has 16 Gig of memory.
Third, See the 'quant_cols' vector? I created it so that I could write a generic loop (for columm_name in quant_cols) ....But I could not make it to work. So I hard-coded the column names post-fixed with '_quantile' and called the 'perc.rank' many times. Is there a way this could be made dynamic? I tried the 'paste' stuff that I have in my script, but that did not work.
On the positive side though, R seems awesome in its ability to cut through the 'Data Wrangling' tasks with very few statements.
Thanks for your time.
I am using R to work with a large JS object (using the library rjsonio). As such, I have a lot of nested lists, which are getting somewhat cumbersome to work with. I have a simplified example below. I am trying to work with this object by creating some form of ‘getter’ and ‘setter’ functions. After looking around, I have found a pretty nice ‘getter’ function that recurses through the object and returns the first matching label. This is especially great because it lends itself to chaining functions together. However, I can not figure out a way to get the same effect for a ‘setter’ function. Any thoughts on how to create a ‘setter’ function that can be chained together in a similar fashion?
#example, simplified, object
app = list(
1,
2,
d=list(a=123,
b=456,
list(
FirstKey=list(attr1='good stuff', attr2=12345),
SecondKey=list(attr1='also good stuff', attr2=4321)
)
)
)
#Return a function that returns the value
#associated with first label that matches 'name'
getByName <- function(name){
rmatch <- function(x) {
pos <- match(name, names(x))
if (!is.na(pos))
return(x[[pos]])
for (el in x) {
if (class(el) == "list") {
out <- Recall(el)
if (!is.null(out)) return(out)
}
}
}
rmatch
}
getFirstKey <- getByName("FirstKey")
getAttr1 <- getByName("attr1")
getAttr2 <- getByName("attr2")
#I like that I can chain these functions together
getAttr1(getFirstKey(app))
getAttr2(getFirstKey(app))
# I would like to be able to do something like this
# But this won't work
### getAttr1(getFirstKey(app)) <- 9876
# This does work,,, but I loose the ability to chain functions together
# Closure around a replacement function
setterKeyAttr <- function(keyName, attr){
function(x, value){
x$d[[3]][[keyName]][[attr]] <- value
x
}
}
`setFirstKeyAttr2<-` <- setterKeyAttr("FirstKey", "attr2")
setFirstKeyAttr2(app) <- 22222
#check the answer is correct
getAttr2(getFirstKey(app))
references:
R decorator to change both input and output
http://r.789695.n4.nabble.com/How-to-get-a-specific-named-element-in-a-nested-list-td3037430.html
http://adv-r.had.co.nz/Functions.html
This is what I came up with. It makes the recursive function return the position of the 'name' and still be able to chain the calls together. I am not sure if this is a great way to do it... but it seems to be working... This is based off the fact that app[[c(3,3,1,)]] is a valid way to index in R.
rmatch.pos <- function(object, name, seq=NA, level=NULL){
##return the vector of integers corresponding to the first match
##of 'name' to a label in object or NULL if no match is found
###object: a list, likely deeply nested
##name: the name of the label to look for
##seq: starting point to search for 'name' in 'object' i.e. c(2,3,3)
##level: don't touch this; it keeps track of how deep the recursive execution is
##can be chained together to reduce ambiguity or result:
##obj <- list(a=1, b=list(c=2, d=list(e=1, attr1="really?", f=list(attr1 = "found me!"))))
##obj[[rmatch.pos(obj, "attr1", rmatch.pos(obj, "f"))]]
if(is.null(seq)){
#short circuit if NULL gets passed
#when chaining, this forces the whole 'chain'
#to NULL when any 'link' is NULL
return(NULL)
}
if(is.null(level)){
level <- length(na.omit(seq))
}
if(any(is.na(seq))){
temp <- object
}else{
temp <- object[[seq]]
}
level <- level + 1
pos <- match(name, names(temp))
if(!is.na(pos)){
seq[level] <- pos
return(seq)
}
for(el in seq_along(temp)){
if(class(temp[[el]]) == "list"){
seq[level] <- el
out <- Recall(object, name, seq, level)
if(!is.null(out)){
return(out)
}
}
}
}
###Examples
rmatch.pos(app, "ThirdKey")
rmatch.pos(app, "attr2")
###chaining example
rmatch.pos(app, "attr2", rmatch.pos(app, "FirstKey"))
rmatch.pos(app, "attr2", rmatch.pos(app, "SecondKey"))
rmatch.pos(app, "attr1", rmatch.pos(app, "ERROR"))
rmatch.pos(app, "ERROR", rmatch.pos(app, "attr1"))