how does the BertModel know to skip attention_mask argument when applied to a single sentence? - bert-language-model

I am creating a class which can generate sentence embedding for both a single sentence and a list of sentences using pretrained BertModel. From a sample code, I see the statement
outputs = self.model(tokens_tensor, segments_tensors)
which is without the attention_mask argument. Yet it produces the same result if I do input the attention mask tensor argument
outputs = self.model(tokens_tensor, attention_tensors, segments_tensors)
When running the code for an entire dataset, then the attention_tensors is absolutely needed.
I understand the reason for not needing attention mask for a single sentence, but how does the python code know the second argument is actually segments_tensor, since in the document, it is expecting attention_tensors to be the second argument.
https://huggingface.co/transformers/model_doc/bert.html

If the attention_mask is not set (and is thus None), is explicitly set to ones everywhere.
See l. 803 in modeling_bert.py.

Related

with quotation mark and without quotation mark, what is the difference

For the following code, I wish to find the minimun cp item which has the lowest xerror
data(iris)
install.packages("rpart")
library(rpart)
set.seed(161)
tree.model1<-rpart(Sepal.Length~., data = iris)
install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(tree.model1)
tree.model2<-rpart(Sepal.Length~., data = iris, cp=0.005)
tree.model2$cptable
par(mfrow=c(1,2))
rpart.plot(tree.model1)
rpart.plot(tree.model2)
which.min(tree.model2$cptable[,"xerror"])
my question is focused on the last line, what if I put
which.min(tree.model2$cptable[, xerror] it doesn't work
what is the function of put the quotation mark here?
R syntax dictates the use of quotation marks when indexing with strings. I assume your confusion is that since xerror is a variable name and you normally use it without quoting in other lines, you expect it to be the same. However, you must see the difference between the index of a variable and the variable itself.
Therefore the use of [] (indexing) does not allow for you to use xerror without quotation but it will work when you use which.min(tree.model2$cptable[,4]) for instance, since xerror is the 4th column (another index for "xerror") in the cptable.
You'll start to pick these up as you progress further using R. Another tip would be to neatly write and comment your code so both you and others can understand easily.

Why doesn't R throw an error when I use only the initial part of my column name in a data frame?

I have a data frame containing various columns along with sender_bank_flag. I ran the below two queries on my data frame.
sum(s_50k_sample$sender_bank_flag, na.rm=TRUE)
sum(s_50k_sample$sender_bank, na.rm=TRUE)
I got the same output from both the queries even though there is no such column as sender_bank in my data frame. I expected to get an error for the second code. Didn't know R has such a functionality! Does anyone know what exactly is this functionality & how can it be better utilized?
Probably worthwhile to augment all comments into an answer.
Both my comment and BenBolker's point to doc page ?Extract:
Under Recursive (list-like) objects:
Both "[[" and "$" select a single element of the list. The main difference is that "$" does not allow computed indices, whereas "[[" does. x$name is equivalent to x[["name", exact = FALSE]]. Also, the partial matching behavior of "[[" can be controlled using the exact argument.
Under Character indices:
Character indices can in some circumstances be partially matched (see ?pmatch) to the names or dimnames of the object being subsetted (but never for subassignment). Unlike S (Becker et al p. 358), R never uses partial matching when extracting by "[", and partial matching is not by default used by "[[" (see argument exact).
Thus the default behaviour is to use partial matching only when extracting from recursive objects (except environments) by "$". Even in that case, warnings can be switched on by options(warnPartialMatchDollar = TRUE).
Note, the manual has rich information, and make sure you fully digest them. I formatted the content, adding Stack Overflow threads behind where relevant.
Links provided by phiver's comment are worth reading in a long term.

How do you format multiline R package messages?

In developing an R package, I would like to use R's message() or warning() functions to produce output for my package user.
Sometimes these messages can be long. I can do this (text is just trivial example):
message("If you got to this point in the code, it means the matrix was empty. Calculation continues but you should consider re-evaluating an earlier step in the process")
Great... But for style, I also want my lines of code to be less than 80 characters, so they fit nicely in narrow screens, on GitHub, etc. And then I can use an IDE code reflow tool to reformat my message easily if it changes.
So I try this:
message("If you got to this point in the code, it means
the matrix was empty. Calculation continues but you should consider
re-evaluating an earlier step in the process")
This solves my code criteria -- it is less than 80 character lines and can reflow as expected. But that sticks the whitespace right in my message output, which I also don't want:
If you got to this point in the code, it means
the matrix was empty. Calculation continues but you should consider
re-evaluating an earlier step in the process
So I found this handy function called strwrap() that seems to solve the problem:
message(strwrap("If you got to this point in the code, it means
the matrix was empty. Calculation continues but you should consider
re-evaluating an earlier step in the process"))
Output:
If you got to this point in the code, it means the matrix was empty.
Calculation continues but you should considerre-evaluating an earlier
step in the process
Looks nice -- but it eliminated the space between "consider" and "re-evaluating" because that space was at a newline.
Another alternative is to break it up into chunks in the code:
message("If you got to this point in the code, it means ",
"the matrix was empty. Calculation continues but you should consider ",
"re-evaluating an earlier step in the process")
This makes the output look correct, but the text can no longer easily reflow with IDE, etc, because it's not one string, so this doesn't work for me on the dev side.
So: How can I make a nicely formatted message that lets me write the message easily across lines?
I have written this function:
.nicemsg = function(...) {
message(paste(strwrap(...), collapse="\n"))
}
Is there a better way using a built-in so I don't have to include this function in every R package I write?
Using a couple more arguments from strwrap makes this possible
message(strwrap(..., prefix = " ", initial = ""))
You might be able to improve readability by playing with the order of arguments. I'm not sure if this is better or not.
message(strwrap(prefix = " ", initial = "",
"If you got to this point in the code, it means
the matrix was empty. Calculation continues but you should consider
re-evaluating an earlier step in the process"))
Or, if you prefer wrapping it
tidymess <- function(..., prefix = " ", initial = ""){
message(strwrap(..., prefix = prefix, initial = initial))
}
You can force linebreaks by adding \n in the string.
message("If you got to this point in the code,\nit means the matrix was empty.\nCalculation continues but you should consider re-evaluating\nan earlier step in the process")
# If you got to this point in the code,
# it means the matrix was empty.
# Calculation continues but you should consider re-evaluating
# an earlier step in the process

What does the "by" argument in ffbase::as.character do?

In the post below,
aggregation using ffdfdply function in R
There is a line like this.
splitby <- as.character(data$Date, by = 250000)
Just out of curiosity, I wonder what by argument means. It seems to be related to ff dataframe but I'm not sure. Google search and R documentation of as.character and as.vector provided no useful information.
I tried some examples but the codes below give the same results.
d <- seq.Date(Sys.Date(), Sys.Date()+10000, by = "day")
as.character(d, by=1)
as.character(d, by=10)
as.character(d, by=100)
If anybody could tell me what it is, I'd appreciate it. Thank you in advance.
Since as.character.ff works using the default as.character internally, and in view of the fact that df vectors can be larger than RAM, the data needs to be processed in chunks. The partition into chunks is facilitated by the chunk function. In this case, the relevant method is chunk.ff_vector. By default, this will calculate the chunk size by dividing getOption("ffbatchbytes") by the record size. However, this behaviour can be overridden by supplying the chunk size using by.
In the example you give, the ff vector will be converted to character 250000 members at a time.
The end result will be the same for any by or without by at all. Larger values will lead to greater temporary use of RAM but potentially quicker operation.
First, that function is ffbase::as.character, not plain old base::as.character
See http://www.inside-r.org/packages/cran/ffbase/docs/as.character.ff
which says
as.character((x, ...))
Arguments:
x: a ff vector
...: other parameters passed on to chunk
So the by argument is being passed through to some chunk function.
Then you need to figure out which package's chunk function is being used. Type ?chunk, tell us which one, then go read its doc to see what its by argument does.

Determining argument descriptions within R

I need a way to determine the description of an argument within R.
For example, if I'm using the function qplot() from the package ggplot2, I need to be able to extract a description of each argument in qplot(). From ggplot2's reference manual, it's easy to see that there are several arguments to consider. One of them is called "data", which is described as: "data frame to use (optional). If not specified, will create one, extracting vectors from the current environment."
Is there a way to get this information from within an R session, rather than by reading a reference manual? Maybe an R function similar to packageDescription(), but for a function's arguments?
Thanks!
edit: I found a variant on my question answered here:
How to access the help/documentation .rd source files in R?
Reading the .Rd files seems like the safest way to get the information I need. For anyone interested, the following code returns a list of arguments and their descriptions, where "package_name" can be any package you want:
db <- Rd_db("package_name")
lapply(db, tools:::.Rd_get_metadata, "arguments")
Thank you for your help, everyone.
From the R console in the Mac GUI R.app ... When I look at the text output from help'seq', help_type="text") (which goes to a temporary file) I see that the beginning of hte descriptions you want are demarcated by:
_A_r_g_u_m_e_n_t_s: # Those underscores were ^H's before I pasted
And then the arguments appear in are name:description pairs:
...: arguments passed to or from methods.
from, to: the starting and (maximal) end values of the sequence. Of
length ‘1’ unless just ‘from’ is supplied as an unnamed
argument.
by: number: increment of the sequence.
length.out: desired length of the sequence. A non-negative number,
which for ‘seq’ and ‘seq.int’ will be rounded up if
fractional.
along.with: take the length from the length of this argument.
When I use a Terminal session to get that same output it appears in the same window but as a Unix help page like:
Arguments:
...: arguments passed to or from methods.
from, to: the starting and (maximal) end values of the sequence. Of
length ‘1’ unless just ‘from’ is supplied as an unnamed
argument.
by: number: increment of the sequence.
length.out: desired length of the sequence. A non-negative number,
which for ‘seq’ and ‘seq.int’ will be rounded up if
fractional.
along.with: take the length from the length of this argument.
I believe these are displayed by whatever system program is called by the value of options("pager"). In my case, that is the program "less".

Resources