Huggingface BERT Tokenizer add new token - bert-language-model

I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords.
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer.encode_plus("Somespecialcompany")
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Now, I would like to add those names to the tokenizer IDs so they are not split up.
tokenizer.add_tokens("Somespecialcompany")
output: 1
This extends the length of the tokenizer from 30522 to 30523.
The desired output would therefore be the new ID:
tokenizer.encode_plus("Somespecialcompany")
output: 30522
But the output is the same as before:
output: {'input_ids': [101, 2070, 13102, 8586, 4818, 9006, 9739, 2100, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
So my question is; what is the right way of adding new tokens to the tokenizer so I can use them with tokenizer.encode_plus() and tokenizer.batch_encode_plus()?

I opened a bug report on github. And apparently I just have to set the special_tokens argument to True:
tokenizer.add_tokens(["somecompanyname"], special_tokens=True)
output: 30522

Related

Contrast for Limma - Voom

I'm doing a differential expression analysis for RNA-seq data with limma - voom. My data is about a cancer drug, 49 samples in total, some of them are responders some of them are not. I need some help building the contrast. I'm dealing with only one factor here, so two groups only.
I know it's the simplest type of data, but I'm getting most of the data as differntialy expressed (which should not be the case), only 13% is not differntialy expressed, and I think the problem has to do with the contrast. This is the design I made, with 1 or 0.
1 for NoResponse means there was no response, and 1 for Response means there was a response.
using dput:
structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), .Dim = c(49L,
2L), .Dimnames = list(c("Pt1", "Pt10", "Pt103", "Pt106", "Pt11",
"Pt17", "Pt2", "Pt24", "Pt26", "Pt27", "Pt28", "Pt29", "Pt31",
"Pt36", "Pt37", "Pt38", "Pt39", "Pt4", "Pt46", "Pt47", "Pt5",
"Pt52", "Pt59", "Pt62", "Pt65", "Pt66", "Pt67", "Pt77", "Pt78",
"Pt79", "Pt8", "Pt82", "Pt84", "Pt85", "Pt89", "Pt9", "Pt90",
"Pt92", "Pt98", "Pt101", "Pt18", "Pt3", "Pt30", "Pt34", "Pt44",
"Pt48", "Pt49", "Pt72", "Pt94"), c("NoResponse", "Response")), assign = c(1L,
1L), contrasts = list(Response = "contr.treatment"))
And here is my code for the analysis it self:
d0 <- DGEList(rawdata)
d0 <- calcNormFactors(d0)
Voom <- voom(d0, design, plot = TRUE)
vfit <- lmFit(Voom, design)
contrast <- makeContrasts(Response - NoResponse,
levels = colnames(coef(vfit)))
vfit <- contrasts.fit(vfit, contrasts = contrast)
efit <- eBayes(vfit)
plotSA(efit, main = 'final model: Mean-Variance trend')
The bioconductor guide didn't help.
Note: The problem is not with the data. The voom plot is very good, I'm just stuck with the contrast which is (I think) making all the mess.

'Undefined columns selected' error when trying to calculate population attributable risk from a Cox model (using AF::AFcoxph in R)?

For a current project i am trying to calculate the population attributable risk using Hazards obtained from a Cox proportional hazards model. There is a function in the package AF that does this specifically (link). However, when I try to run the code I get an error that says Error in [.data.frame(data, , eventvar) : undefined columns selected, and I have no idea what causes the error.
Some example code:
# Load packages
library(dplyr)
library(magrittr)
library(survival)
library(AF)
# Get data
mydata <- structure(list(id = c(7971001, 3098, 1314, 5178001, 756001, 6787002,
693, 2839001, 1186, 5897002, 6761002, 2839002, 3606001, 4530001,
3094001, 6902001, 489001, 2010, 3451, 4526002, 854001, 1942,
678, 3327, 8381001, 443002, 2920001, 5302001, 6413002, 3645001,
830, 8776001, 7289001, 1198, 3307003, 1159, 5014002, 1727001,
756, 1454, 3198002, 469001, 3823001, 2959001, 3472, 6555002,
3091002, 1047, 2060, 7759001, 906002, 5826002, 6745001, 592001,
3136, 5784001, 1194001, 335001, 2376, 2895, 1627001, 5565002,
1862, 3429, 3425, 5978001, 651, 7833001, 37, 1702, 266, 3282001,
336, 2675001, 804001), exposure = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,
1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1), event = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), time = c(12.7748117727584,
2.08350444900753, 14.8774811772758, 2.06981519507187, 11.0581793292266,
15.4661190965092, 4.90349075975359, 4.67898699520876, 8.4435318275154,
14.1409993155373, 14.1464750171116, 14.4394250513347, 15.6632443531828,
13.2265571526352, 14.839151266256, 9.60164271047228, 11.1567419575633,
14.8692676249144, 14.9322381930185, 5.87268993839836, 14.3928815879535,
14.2012320328542, 10.2724161533196, 13.6317590691307, 13.4401095140315,
12.2929500342231, 5.70841889117043, 14.2368240930869, 14.6858316221766,
15.8083504449008, 14.6255989048597, 15.7015742642026, 8.90349075975359,
15.0609171800137, 4.54483230663929, 1.2703627652293, 9.36892539356605,
10.258726899384, 10.6721423682409, 11.6714579055441, 13.1772758384668,
15.813826146475, 10.8911704312115, 2.51060917180014, 14.5872689938398,
12.5147159479808, 14.1656399726215, 9.18275154004107, 14.2614647501711,
5.8425735797399, 12.2108145106092, 15.9808350444901, 14.3518138261465,
9.29226557152635, 14.1464750171116, 10.113620807666, 7.37850787132101,
9.10061601642711, 14.3326488706366, 11.2689938398357, 13.1060917180014,
4.61875427789186, 8.72005475701574, 14.031485284052, 13.9000684462697,
8.65982203969884, 14.5872689938398, 2.18480492813142, 9.79603011635866,
3.40041067761807, 3.35112936344969, 0.454483230663929, 5.39082819986311,
13.5578370978782, 14.9650924024641)), row.names = c(NA, -75L), class = "data.frame")
# Fit a Cox model
cox_model <- coxph(formula=Surv(time=time, event=event, type="right") ~ 1 + exposure, data=mydata, ties="breslow")
# Calculate PAR
par_model <- AFcoxph(cox_model, data=mydata, exposure ="exposure") # Gives error
par_model <- AFcoxph(cox_model, data=mydata, exposure ="exposure", times="time") # Gives error
par_model <- AFcoxph(cox_model, data=mydata, exposure ="exposure", clusterid="id") # Gives error
par_model <- AFcoxph(cox_model, data=mydata, exposure ="exposure", times="time", clusterid="id") # Gives error
Anyone has an idea what causes this error?
Make sure that your version of R is up-to-date. You can download the latest version here.
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
##snip##
other attached packages:
[1] AF_0.1.5 survival_3.2-13
There does not seem to be any flaw in your code, since I do not get the errors you are describing.
library(survival)
# install.packages('AF')
library(AF)
cox_model <- coxph(Surv(time, event) ~ exposure, data = mydata, ties="breslow")
par_model <- AFcoxph(cox_model, data = mydata, exposure ="exposure")
par_model_cluster <- AFcoxph(cox_model, data=mydata,
exposure ="exposure", clusterid="id")
# ------------------------------------------------------------------
> identical(par_model, par_model_cluster)
[1] TRUE
# ---------------------------------------
> par_model
Estimated attributable fraction (AF) and standard error :
Time AF Std.Error
4.544832 0.5875401 0.3919930
4.678987 0.5854788 0.3923111
8.659822 0.5830906 0.3926216
9.182752 0.5804960 0.3929152
9.601643 0.5777211 0.3931865
10.258727 0.5747394 0.3934271
12.292950 0.5710542 0.3937199
12.514716 0.5672505 0.3939801
13.106092 0.5631900 0.3981882
13.177276 0.5590750 0.3984770
13.226557 0.5548138 0.3987241
14.439425 0.5467442 0.3994595

In-Degree Bonacich Power Centrality in R?

Thank you for your time in advance. I am attempting to identify a method to calculate in-degree Bonacich Power Centrality in R. I'm a long-time UCINET user attempting to make the switch. In UCINET, this is done selecting Beta Centrality (Bonacich Power), and selecting "in-centrality" for the direction.
In R, it doesn't seem as though there is a way to calculate this using either sna or igraph packages. Here it is for bonpow in sna:
bonpow(dat, g=1, nodes=NULL, gmode="digraph", diag=FALSE, tmaxdev=FALSE,
exponent=1, rescale=FALSE, tol=1e-07)
I do specify digraph, but I am not able to replicate the analysis in R.
Similarly, here it is for power_centrality in igraph:
power_centrality(graph, nodes = V(graph), loops = FALSE,
exponent = 1, rescale = FALSE, tol = 1e-07, sparse = TRUE)
Here, there does not seem to be a way to specify that it is a directed graph (although you can specify it when defining the network). However, you can estimate it for betweenness centrality.
In neither case do I seem to be able to specify in-degree or out-degree power centrality. Any help is appreciated. Is there something either in these or in a different package that I may be overlooking?
I'm not sure about what do you mean by direction since the original paper, seems to me, does not deal with it. Now, a thing that is usually done with these statistics that are calculated directly from the adjacency matrix is "change the direction" by taking the transpose of the statistic (for example, when computing exposure in the netdiffuseR package we allow the user to compute "incoming" or "outgoing" exposure by just taking the transpose of the adjacency matrix). When you take the transpose, you are essentially flipping the directionality of the ties, i.e. i->j turns to j->i.
If that's what UCINET does (again, not completely sure what it is), then you can get the "incoming"/"outgoing" version by transposing the network. Here is a toy example:
# Loading the sna package (btw: igraph's implementation is a copy of
# sna's). I wrap it around suppressMessages to avoid the verbose
# print that the package has
suppressMessages(library(sna))
# This is random graph I generated with 10 vertices
net <- structure(
c(0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1,
0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1,
0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1,
0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0),
.Dim = c(10L, 10L)
)
# Here is the default
bonpow(net)
#> [1] -0.8921521 -0.7658658 -0.9165947 -1.4176664 -0.6151369 -0.7862345
#> [7] -0.9206684 -1.3565601 -1.0347335 -1.0062173
# Here I'm getting the transpose of the adjmat
net2 <- t(net)
# The output is different (as you can see)
bonpow(net2)
#> [1] -0.8969158 -1.1026305 -0.6336011 -0.7158869 -1.2960022 -0.9545159
#> [7] -1.1684592 -0.8845729 -1.0368018 -1.1190876
Created on 2019-11-20 by the reprex package (v0.3.0)

What does this error mean "order(vertex_attr(g, measure), decreasing = TRUE) : argument 1 is not a vector" in R?

I am trying to calculate robustness, a graph theory measure using R (braingraph package).
Robustness = robustness(my_networkgraph, type = c("vertex"), measure = ("btwn.cent"))
I get the following error, when I use the above robustness function:
Error in order(vertex_attr(g, measure), decreasing = TRUE) : argument 1 is not a vector
Any idea, what I am doing wrong here?
My network, which is a matrix has been converted to igraph object and robustness was calculated.
My network as a matrix:
mynetwork <- matrix(c(0, 1, 0, 1, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 0, 1, 1,
0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0), nrow = 8)
This matrix was converted as igraph using the following code:
my_networkgraph <-graph_from_adjacency_matrix(mynetwork, mode = c("undirected"),weighted = NULL, diag = TRUE, add.colnames = NULL, add.rownames = NA)
Please help me to understand the above error
Thanks
Priya
There was a bug in the above function. To run the robustness code, you will need to supply a vertex attribute to your network: V(network)$degree <- degree(network) V(network)$btwn.cent <- centr_betw(network)$res

Using car::Anova package for a doubly-multivariate MANOVA in R

I'm trying to run a repeated-measures MANOVA in R, which also contains a number of dependent variables (key outcome variables of behavioural tasks). The repeated-measures are due to a cross-over design, in which individuals took a drug and placebo (in randomised order).
The code I'm running looks like this:
imatrix <- matrix(c(
1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, -1,
0, 1, 0, 0, 0, 0, 1,
0, 1, 0, 0, 0, 0, -1,
0, 0, 1, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, -1,
0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 1, 0, 0, -1,
0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 1, 0, -1,
0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 1, -1
), 12, 7, byrow=TRUE)
colnames(imatrix) <- c("BCST", "CGT", "AST", "AGN", "DDT", "FERT", "NAC")
(imatrix <- list(measure=imatrix[,1:6], condition=imatrix[,7]))
contrasts(condition_factor) <- matrix(c(-1,1,1, -1), ncol=2)
doubly.mod<-lm(cbind(bcst_nac$totPersErr,bcst_placebo$totPersErr,cantab_nac$CGT.Delay.aversion,cantab_placebo$CGT.Delay.aversion,cantab_nac$AST.Switching.cost..Mean..correct.,cantab_placebo$AST.Switching.cost..Mean..correct.,cantab_nac$AGN.Affective.response.bias..Mean.,cantab_placebo$AGN.Affective.response.bias..Mean.,aucs_NAC,aucs_placebo,fert_nac$FERTACCHA,fert_placebo$FERTACCHA)~1))
Manova(doubly.mod, imatrix=imatrix, type =3)
The result is this error: Error in Anova.III.mlm(mod, SSPE, error.df, idata, idesign, icontrasts, :
(list) object cannot be coerced to type 'double'
However, when I change imatrix back from a list to a matrix, I get this error response:
Error in do.call(cbind, imatrix) : second argument must be a list
I've based this off the example from the car::Anova package about doubly multivariate analyses. Please let me know if you can help, or if I can add anything to make this question clearer.

Resources