Huggingface BertForMaskedLM fails after 90+ iterations - jupyter-notebook

I'm using BertForMaskedLM in Jupyter Notebooks. output = model(masked_arr) wouldn't run after 89-95 iterations. Each time I try running the cell it would fail after a different number of iterations in that range. (I tested this by making print statements before the line as well as a count of the iteration.) I'm on M1 Macbook Air 2020. I'm wondering if this is an issue with BertForMaskedLM/I'm using it the wrong way/hardware issue. The cell would show still running and not show any error.
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
imp = {}
for count, id_to_mask in enumerate(unique_ids):
# mask all occurences of word
masked_arr = torch.tensor([[103 if x==id_to_mask else x for x in input_ids]])
# pass masked document into BertForMaskedLM
output = model(masked_arr)
# find indices of word in document
idxs = np.where(input_ids==id_to_mask)[0]
# find probability of word from bert output
prob = output.logits[0][idxs][:,id_to_mask]
# summation for all occurencesof word
sum_prob = prob.sum()
# debug
print(count)
# store in dict
imp[id_to_mask] = sum_prob

Related

xarray chunk dataset PerformanceWarning: Slicing with an out-of-order index is generating more chunks

I am trying to run a simple calculation based on two big gridded datasets in xarray (around 5 GB altogether, daily data from 1850-2100). I keep running out of memory when I try it this way
import xarray as xr
def readin(model):
observed = xr.open_dataset(var_obs)
model_sim = xr.open_dataset(var_sim)
observed = observed.sel(time = slice('1989','2010'))
model_hist = model_sim.sel(time = slice('1989','2010'))
model_COR = model_sim
return(observed, model_hist, model_COR)
def method(model):
clim_obs = observed.groupby('time.day').mean(dim='time')
clim_hist = model_hist.groupby('time.day').mean(dim='time')
diff_scaling = clim_hist-clim_obs
bc = model_COR.groupby('time.day') - diff_scaling
bc[var]=bc[var].where(bc[var]>0,0)
bc = bc.reset_coords('day',drop=True)
observed, model_hist, model_COR = readin('model')
method('model')
I tried to chunk the (full)
model_COR
to split up the memory
model_COR.chunk(chunks={'lat': 20, 'lon': 20})
or across the time dimension
model_COR.chunk(chunks={'time': 8030})
but no matter what I tried resulted in
PerformanceWarning: Slicing with an out-of-order index is generating xxx times more chunks
Which doesn't exactly sound like the outcome I want? Where am I going wrong here? Happy about any help!

Printing output data (data-frames / matrices) along with plots in .pdf format in R

I would like to print my result of R codes viz. a few data frames (approx 4-5) and a few plots (4 plots) in a pdf file i.e. I want to export all the output in a pdf file using some set of codes (that I would like to run at the end) or in a separate file i.e. outside of mother/core/main codes (like by using rmarkdown::render("analysis.R", "pdf_document"). Also, I would like to achieve: -
a. I want to print only the output (and not the coding). For this, some commands like #+echo=FALSE were found useful which I am using/placing at the beginning of my mother code file and running the below ones (in an external file): -
library(rmarkdown)
render("Analysis.R", output_format = "pdf_document", output_file = "Aalysis_Output",quiet = TRUE)
b. But my tables are not formatted, they are getting split into many tables (being they are a little wide and so the rownames are getting replicated in each table too and each table just having one column - I have three columns in total - so I am getting 3 tables).
c. Moreover, I am getting "##" in each line/row of output which I want to remove.
Also note that currently, I don't want to use R Markdown or KnitR... (as I don't want a .Rmd file etc. for my R codes). Is there any way to do the same easily - like what we use to do in C programming - we open a text file and start writing our results/output of C using fprinft etc.)
Below are all my datasets (R data-frames) and plots (I used ggplot in R to create plots) which I want to get printed in a well-formatted pdf file.
## Printing all the results & plots ##
print(base_result, right = T) # Base Results
print(DSA_AP, right = T) # DSA for Drug A vs P
Torn_AP # Tornado for Drug A vs P
print(DSA_AQ, right = T) # DSA for Drug A vs Q
Torn_AQ # Tornado for Drug A vs Q
print(DSA_AR, right = T) # DSA for Drug A vs R
Torn_AR # Tornado for Drug A vs R
print(DSA_AS, right = T) # DSA for Drug A vs S
Torn_AS # Tornado for Drug A vs S
I tried the following codes to get the result: -
Set-1 Codes: Issue it’s not printing the tables/data-frames (but plotting all the plots only).
pdf("MyOutput_12.pdf", paper = "A4")
print(base_result, right = T) # Base Results
# Issue: It is only coming as a print in the console and similarly for below tables.
print(DSA_AP, right = T) # DSA for Drug A vs P
Torn_AP # Tornado for Drug A vs P
print(DSA_AQ, right = T) # DSA for Drug A vs Q
Torn_AQ # Tornado for Drug A vs Q
print(DSA_AR, right = T) # DSA for Drug A vs R
Torn_AR # Tornado for Drug A vs R
print(DSA_AS, right = T) # DSA for Drug A vs S
Torn_AS # Tornado for Drug A vs S
dev.off()
Set-2 Codes: Now, it’s printing the tables but it is printing one over the previous table or plot i.e. overwriting (below are the codes for printing three datasets and two plots).
library(gridExtra)
pdf("MyOutput_14.pdf", paper = "A4")
grid.table(base_result)
grid.table(DSA_AP) # DSA for Drug A vs P
Torn_AP # Tornado for Drug A vs P
grid.table(DSA_AQ) # DSA for Drug A vs Q
Torn_AQ # Tornado for Drug A vs Q
dev.off()
Moreover, in the set-2 the tables are some big and getting truncated i.e. the right part of each table is missing (getting cut in pdf file) -- need to adjust width etc. of tables in the pdf file print.
Thanks in advance!
A very simple way, if you are using Rstudio, is to write your script, then hit control + shift + K. This automatically creates a report, and you can choose PDF output if you want. Note: under the hood this is still using RMarkdown - but you do not have to write an RMD file yourself, so maybe this is okay?
I have found codes serving the purpose 'a' and 'c' and for 'b' I now decreased the column headings and row names of my data frames (output-tables) to save space/page and now I am getting the required output. I used the following codes: -
knitr::opts_chunk$set(fig.width = 7, fig.height = 5, fig.align = 'left', dpi = 96, echo = F, comment = "", message = F, warning = F)
# For PDF Output
rmarkdown::render("Model_3.3.R", output_format = "pdf_document", output_file = "Output_3.3_01.pdf")
# For HTML Output
rmarkdown::render("Model_3.3.R", output_format = "html_document", output_file = "Output_3.3_01.html")
Note: These codes are required to run outside the mother codes i.e. in a separate file or in R / RStudio console.

R: use single file while running a for loop on list of files

I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!

Slow bigram frequency function in R

I’m working with Twitter data and I’m currently trying to find frequencies of bigrams in which the first word is “the”. I’ve written a function which seems to be doing what I want but is extremely slow (originally I wanted to see frequencies of all bigrams but I gave up because of the speed). Is there a faster way of solving this problem? I’ve heard about the RWeka package, but have trouble installing it, I get an error about (ERROR: dependencies ‘RWekajars’, ‘rJava’ are not available for package ‘RWeka’)…
required libraries: tau and tcltk
bigramThe <- function(dataset,column) {
bidata <- data.frame(x= character(0), y= numeric(0))
pb <- tkProgressBar(title = "progress bar", min = 0,max = nrow(dataset), width = 300)
for (i in 1:nrow(dataset)) {
a <- column[i]
bi<-textcnt(a, n = 2, method = "string")
tweetbi <- data.frame(V1 = as.vector(names(bi)), V2 = as.numeric(bi))
tweetbi$grepl<-grepl("the ",tweetbi$V1)
tweetbi<-tweetbi[which(tweetbi$grepl==TRUE),]
bidata <- rbind(bidata, tweetbi)
setTkProgressBar(pb, i, label=paste( round(i/nrow(dataset), 0), "% done"))}
aggbi<-aggregate(bidata$V2, by=list(bidata $V1), FUN=sum)
close(pb)
return(aggbi)
}
I have almost 500,000 rows of tweets stored in a column that I pass to the function. An example dataset would look like this:
text userid
tweet text 1 1
tweets text 2 2
the tweet text 3 3
To use RWeka, first run sudo apt-get install openjdk-6-jdk (or install/re-install your JDK in Windows GUI) then try re-installing the package.
Should that fail, use download.file to download the source .zip file and install from source, i.e. install.packages("RWeka.zip", type = "source", repos = NULL).
If you want to speed things up without using a different package, consider using multicore and re-writing the code to use an apply function which can take advantage of parallelism.
You can get rid of the evil loop structure by collapsing the text column into one long string:
paste(dataset[[column]], collapse=" *** ")
bi<-textcnt(a, n = 2, method = "string")
I expected to also need
subset(bi, function(x) !grepl("*", x)
But it turns out that the textcnt method doesn't include bigrams with * in them, so you're good to go.

Crash logs for R/Rstudio/Rstudio-server after suspected memory issue

I have a long running process in R (via Rstudio-server), which I suspect may have a memory problem, eventuating in the R session crashing. Unfortunately I am not around to monitor exactly what is going on: do crash logs exist and where do I find them if they do?
My setup is as follows:
Rstudio-server installed in ubuntu 12.04, on a vmmare player virtual machine.
I access the r session from firefox, on a windows 7 installation.
I leave the program running overnight, and come back to find the following error messagein the rstudio interface:
The previous R session was abnormally terminated due to an unexpected crash.
You may have lost workspace data as a result of this crash.
It appears that the following code is causing the problem (not a reproducible sample). The code takes a list of regression formulas (about 250k), a data frame of 1500 rows by 70 columns, and also allows you to specify the number of cores to be used in the calculation:
get_models_rsquared = function(combination_formula,df,cores = 1){
if (cores == "ALL"){cores <- detectCores()}
require(parallel) #if parallel processing is not required, mclapply should be changed to lapply.
#using mclapply to calculate linear models in parallel,
#storing adjusted r-squared and number of missing rows in a list
combination_fitted_models_rsq = mclapply(seq_along(combination_formula), function(i)
list(summary(lm(data = df, combination_formula[[i]]))$adj.r.squared,
length(summary(lm(data = df, combination_formula[[i]]))$na.action)), mc.cores = cores )
#now store the adjusted r-squared and missing rows of data
temp_1 = lapply(seq_along(combination_fitted_models_rsq), function(i)
combination_fitted_models_rsq[[i]][[1]])
temp_1 = as.numeric(temp_1)
temp_2 = lapply(seq_along(combination_fitted_models_rsq), function(i)
combination_fitted_models_rsq[[i]][[2]])
temp_2 = as.numeric(temp_2)
#this is the problematic line
temp_3 = lapply(seq_along(combination_formula), function(i) {
length(attributes(terms.formula(combination_formula[[i]]))$term.labels)
}#tells you number of predictors in each formula used for linear regression
)#end lapply
result = data.frame(temp_1,temp_2,temp_3)
names(result) = c("rsquared","length.na","number_of_terms")
return(result)
}
The calculation of temp_3 seems to give the problems when the function is called. However, it all works properly if you take the code for temp_3 out of the function and calculate it after running the function.

Resources