I have a list of names (e.g: authors) and a pdf file which includes those names. I need to calculate how many times those authors are mentioned in the pdf file.
Let's say my table of authors is named "author" and the pdf file's name is "pdf" (I converted and stored this pdf file in R already using pdf_text already)
I've tried the following:
author$count <- 0
author$count <- for (i in author$name) { sum(str_count(pdf, i))}
But it didn't work. When I printed author$count, the results were NULL. Is there a way to fix this?
Unlike most other functions, for does not return a value in R, which unfortunately makes it much less useful. Instead, in most situations one of the vector mapping functions (lapply, vapply etc.) is more suitable to the task.
In your case, vapply does the trick:
author$count <- vapply(author$name, \(i) sum(str_count(pdf, i)), integer(1L))
(If you’re using an older version of R, you need to replace \(i) with function (i).)
Note that you do not need to assign 0 to author$count beforehand. That value would be overwritten anyway.
A note on vapply vs. sapply
vapply ensures that the result of the function call actually conforms to the expected format (here: integer(1L), i.e. every element is a single integer). sapply doesn’t do this, which makes using sapply risky in non-interactive code, since it won’t notify you if there’s an error with the data. purrr::map_* behaves similarly to vapply.
We may need to assign within the loop. Also, loop across the sequence to do the assignment
for(i in seq_along(author$name)) {
author$count[i] <- sum(str_count(pdf, author$name[i]))
}
Related
I'm attempting to write an R script in a way that remains as automated as possible. To this end, I am trying to create a for loop to execute a function on multiple files. The outputs need to be saved as objects for the purposes of the program I am using and therefore each output from the for loop needs to have a distinct name. This is the code I have so far:
filenames <- as.list(Sys.glob("*.ab1"))
SeqOb <- list()
for (i in filenames)
{
SeqOb <- readsangerseq(i)
}
"readsangerseq" is the function I'm attempting to execute to create multiple SeqOb objects. What I've read from other discussions led me to create an empty list in which to store my output objects, but I can't seem to figure out how to make the for loop write them as distinct outputs.
If you would like to continue using the for loop and want distinct outputs instead of a list you may consider using assign(paste()) in order to give each file a unique object name. Although, as a relative newcomer to R myself, I'm starting to learn there are more elegant ways than for loops as well, such as MrFlick's answer.
for (i in 1:length(filenames)) {
#You may be able to substitute your function in the line below
assign(paste("SomeNamingRule", i, sep = ""), (readsangerseq(i)))
}
I've noticed that R keeps the index from for loops stored in the global environment, e.g.:
for (ii in 1:5){ }
print(ii)
# [1] 5
Is it common for people to have any need for this index after running the loop?
I never use it, and am forced to remember to add rm(ii) after every loop I run (first, because I'm anal about keeping my namespace clean and second, for memory, because I sometimes loop over lists of data.tables--in my code right now, I have 357MB-worth of dummy variables wasting space).
Is there an easy way to get around this annoyance?
Perfect would be a global option to set (a la options(keep_for_index = FALSE); something like for(ii in 1:5, keep_index = FALSE) could be acceptable as well.
In order to do what you suggest, R would have to change the scoping rules for for loops. This will likely never happen because i'm sure there is code out there in packages that rely on it. You may not use the index after the for loop, but given that loops can break() at any time, the final iteration value isn't always known ahead of time. And having this as a global option again would cause problems with existing code in working packages.
As pointed out, it's for more common to use sapply or lapply loops in R. Something like
for(i in 1:4) {
lm(data[, 1] ~ data[, i])
}
becomes
sapply(1:4, function(i) {
lm(data[, 1] ~ data[, i])
})
You shouldn't be afraid of functions in R. After all, R is a functional language.
It's fine to use for loops for more control, but you will have to take care of removing the indexing variable with rm() as you've pointed out. Unless you're using a different indexing variable in each loop, i'm surprised that they are piling up. I'm also surprised that in your case, if they are data.tables, they they are adding additional memory since data.tables don't make deep copies by default as far as i know. The only memory "price" you would pay is a simple pointer.
I agree with the comments above. Even if you have to use for loop (using just side effects, not functions' return values) it would be a good idea to structure
your code in several functions and store your data in lists.
However, there is a way to "hide" index and all temporary variables inside the loop - by calling the for function in a separate environment:
do.call(`for`, alist(i, 1:3, {
# ...
print(i)
# ...
}), envir = new.env())
But ... if you could put your code in a function, the solution is more elegant:
for_each <- function(x, FUN) {
for(i in x) {
FUN(i)
}
}
for_each(1:3, print)
Note that with using "for_each"-like construct you don't even see the index variable.
I'm having a problem with the below function:
ab<-matrix(c(1:20),nrow=4)
rownames(ab)<-c("a","b","c","d")
cd<-c("a","c")
test<-function(x,y,ID_Tag){
for(i in y) {
M_scaled<-t(scale(t(x),center=T))
a<-quantile(M_scaled[match(i,rownames(x)),])
assign(paste0("Probes_",ID_Tag,"_quan_",i),a)
}
}
test(ab,cd,"C1")
x is the dataframe/matrix
y is the string I need to search for in rownames(x)
ID_Tag is is the number I use to distinguish my samples from each other.
The function is running, but no output is generated into strings afterwards.
Hope somebody can help me
When you use assign within a function it will make the assignment to a variable that is accessible within that function only (i.e. it's like using <-). To get around this, you need to specify the envir argument in assign to be either the global environment globalenv() or the parent frame of the function. So try changing your assign statement to
assign(..., envir = parent.frame())
or
assign(..., envir = globalenv())
depending on what you want exactly (in the example you provided they are equivalent). Have a look at ?parent.frame for more info on these. Another possibility is to specify the pos argument in assign, check ?assign.
As an aside, assigning global objects from within a function can lead to various problems in general. I find it better practice in your example to return a list of objects created in the for loop rather than use assign.
There seems to be variations to this question, but none seem to address the situation of being in a loop AND naming and output file. How I thought this might work:
for(j in 1:3) {
for(k in 1:17){
extract_[j]km <- extract(RasterStack, SpatialPolygonsDataFrame_[j]km, layer=[k], nl=1, df=TRUE)
}
}
The extract function is from the raster package. I have already created a series of RasterStacks and SpatialPolygons and I want to pass these to a function ("extract") that has several parameters, some of which I wish to manipulate through the loop, and label the output accordingly. This is a breeze in BASH, but I can't figure this out in R.
Ultimately, I'd like to pass strings as well, but another post seems to show the way there.
EDIT: I originally posted the above function as being a single dataframe, when in fact, they are specified objects from the raster package (which are ultimately dataframes).
As Justin points out, working with a list is more inline with R's structure than messing up the workspace with lots of named variables. It quickly becomes challenging to work when you have a lot of objects in the workspace to "know" what's next.
Your way:
for(j in 1:3) {
assign(
paste("extract",j,"km",sep=""), # or paste0 to avoid need for sep=""
function(
get(
paste("data",j,"km",sep="")
)
)
)
}
Personally, I prefer working with lists, so below, I convert your data objects to a list and show you how to run a function on all elements of that list. Working in this way usually relegates the need to use strings in the "get" and "assign" fashion.
# just converting your variables to a list
data.list <- mget(grep("data",ls(),value=TRUE),envir=.GlobalEnv)
# then output results
result.list <- lapply(data.list,your_function)
I am trying to source multiple functions, that differ by a number in the name.
For example: func1, func2.
I tried using "func_1", and "func_2", as well as putting the number first, "1func" and "2func". No matter how I index the function names, the source function just reads in one function that it calls "func" - which is not what I want.
I have tried using for-loops and sapply:
for-loop:
func.list <- list.files(path="/some_path",pattern="some pattern",full.names=TRUE)
for(i in 1:length(func.list)){
source(func.list[i])
}
sapply:
sapply(func.list,FUN=source)
I am going to be writing multiple versions of a data correction function, and would really like to be able to index them - because giving a concise, but specific, name would be difficult, and not allow me to selectively source just the function files from their directory.
In my code, func.list gives the output (I have replaced the actual directory because of privacy/contractual issues):
[1] "mypath/1resp.correction.R"
[2] "mypath/2resp.correction.R"
Then when I source func.list with either the for-loop or sapply code (listed above), R only loads one function named resp.correction, with the code body from "2resp.correction.R".
The argument to source is a file name, not a function name. So you cannot be fancy here: you need to provide the exact filenames.
It sounds like your two files contain the definitions of a function with the same name (resp.correction) in both files, so yes, as you source one file after the other, the function is overwritten in your global environment.
You could, inside your loop, reassign the function to a different name:
func.list <- list.files(path="/some_path",pattern="some pattern",full.names=TRUE)
for(i in 1:length(func.list)) {
source(func.list[i], local = TRUE)
assign(paste0("resp.correction", i), resp.correction, envir = .GlobalEnv)
}