Encoding/tokenizing dataset dictionary (BERT/Huggingface) - bert-language-model

I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed everything into a Dataset Dictionary:
#Creating Dataset Objects
dataset_train = datasets.Dataset.from_pandas(training_data)
dataset_test = datasets.Dataset.from_pandas(testing_data)
#Get rid of weird columns
dataset_train = dataset_train.remove_columns('__index_level_0__')
dataset_test = dataset_test.remove_columns('__index_level_0__')
#Create Dataset Dictionary
data_dict = datasets.DatasetDict({"train":dataset_train,"test":dataset_test})
I am transforming everything to a dataset dictionary cause I am following more or less a code and transfer it to my problem. Anyways, I am defining the function to tokenize:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score
num_labels = 5
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
batch_size = 16
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
def tokenize(batch):
return tokenizer(batch, padding=True, truncation=True)
and call the function with:
data_encoded = data_dict.map(tokenize, batched=True, batch_size=None)
I am getting this error after all this:
ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).
What am I missing? Sorry I am completely new to the whole Huggingface infrastructure…

Found the error on my own as I had to specify the column which had to be tokenized. The correct Tokenizer function would be:
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
instead of
def tokenize(batch):
return tokenizer(batch, padding=True, truncation=True)

Related

How are vector operations performed on 20newsgroups_vectorized data set?

When I fetch 20newsgroups_vectorized data by
newsgroups = fetch_20newsgroups_vectorized(subset='all')
labels = newsgroups.target_names
target = newsgroups.target
target = pd.DataFrame([labels[i] for i in target], columns=['label'])
data = newsgroups.data
data is the <class 'scipy.sparse.csr.csr_matrix'> with the shape
(18846, 130107)
How can I subset the data by target names (for example, extract only 'rec.sport.baseball') and use vector operations on those sparse row vectors (for example, calculate the mean vector or the distances)?
Unfortunately, subsetting the data by target names option is not available in fetch_20newsgroups_vectorized but it is available in
fetch_20newsgroups, just that you have to vectorize the data yourself.
Here is how you can do it.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
newsgroups_train = fetch_20newsgroups(subset='all',
categories=['rec.sport.baseball'])
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
print(vectors.shape)
# (994, 13986)
Read more here

How to use glmulti from python using rpy2?

consider the following dataframe
import pickle
a='pickle.loads(b\'\\x80\\x03cpandas.core.frame\\nDataFrame\\nq\\x00)\\x81q\\x01}q\\x02(X\\x05\\x00\\x00\\x00_dataq\\x03cpandas.core.internals.managers\\nBlockManager\\nq\\x04)\\x81q\\x05(]q\\x06(cpandas.core.indexes.base\\n_new_Index\\nq\\x07cpandas.core.indexes.base\\nIndex\\nq\\x08}q\\t(X\\x04\\x00\\x00\\x00dataq\\ncnumpy.core.multiarray\\n_reconstruct\\nq\\x0bcnumpy\\nndarray\\nq\\x0cK\\x00\\x85q\\rC\\x01bq\\x0e\\x87q\\x0fRq\\x10(K\\x01K\\n\\x85q\\x11cnumpy\\ndtype\\nq\\x12X\\x02\\x00\\x00\\x00O8q\\x13K\\x00K\\x01\\x87q\\x14Rq\\x15(K\\x03X\\x01\\x00\\x00\\x00|q\\x16NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK?tq\\x17b\\x89]q\\x18(X\\x0b\\x00\\x00\\x00priceToBookq\\x19X\\x04\\x00\\x00\\x00betaq\\x1aX\\x0e\\x00\\x00\\x00price to salesq\\x1bX\\x0c\\x00\\x00\\x00gross profitq\\x1cX\\x0c\\x00\\x00\\x0052WeekChangeq\\x1dX\\n\\x00\\x00\\x00market capq\\x1eX\\x04\\x00\\x00\\x00ebitq\\x1fX\\r\\x00\\x00\\x00total revenueq X\\x0c\\x00\\x00\\x00payout ratioq!X\\x08\\x00\\x00\\x00pe ratioq"etq#bX\\x04\\x00\\x00\\x00nameq$Nu\\x86q%Rq&h\\x07cpandas.core.indexes.range\\nRangeIndex\\nq\\\'}q((h$NX\\x05\\x00\\x00\\x00startq)K\\x00X\\x04\\x00\\x00\\x00stopq*K\\x07X\\x04\\x00\\x00\\x00stepq+K\\x01u\\x86q,Rq-e]q.h\\x0bh\\x0cK\\x00\\x85q/h\\x0e\\x87q0Rq1(K\\x01K\\nK\\x07\\x86q2h\\x12X\\x02\\x00\\x00\\x00f8q3K\\x00K\\x01\\x87q4Rq5(K\\x03X\\x01\\x00\\x00\\x00<q6NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK\\x00tq7b\\x89B0\\x02\\x00\\x00\\xd1#,\\x9b9\\x8c)#Cz\\xe5\\xd5\\x94_\\xf5?\\x92(\\x0ffn9\\xf0?\\n+\\x15TT-\\x17# \\xd5\\xb0\\xdf\\x13\\x03%#u\\xdek\\xad\\xd4\\xb8\\xfb?\\x1c\\xee#\\xb7&\\xbd\\xf3?-\\x98\\xf8\\xa3\\xa8\\xf3\\xf3?H\\xfd\\xf5\\n\\x0b\\xae\\xf1?:;\\x19\\x1c%/\\xf1?\\x9f\\x93\\xde7\\xbe\\xf6\\xf0?\\xbb}V\\x99)\\xad\\xf3?\\xae\\xbby\\xaaC.\\xf3?\\xa5,C\\x1c\\xeb\\xe2\\xf9?d\\x94g^\\x0e\\x13\\x12#\\x9e\\xc7r\\\\\\xd7i\\x06#\\xe4\\xe0\\x0c\\xddp\\xc8\\xcc?%\\x95)\\xe6 \\x18 #\\xa1\\xf4\\x85\\x90\\xf3\\x1e!#y6P\\x85\\xe4\\x89\\x0e#.\\xd9\\xc2=\\xe0\\x1b\\x0c#\\x00\\x00\\x00\\xc6\\x9e\\xe86B\\x00\\x00\\x00fF\\xb83B\\x00\\x00\\x00.\\xdc\\xb6\\x0bB\\x00\\x00\\x80\\x954\\xa5%B\\x00\\x00#\\\'1O3B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xf1\\xda\\x84\\xff\\x9d\\x82\\xd5?f\\xb8>\\x028#\\xa0?\\xc8^\\xef\\xfex/\\xb0\\xbf\\xab\\xd5\\x91\\x02\\x8f\\x18\\xd6?\\xd7\\xc05\\xfb,d\\xd6?r\\x8e\\xb6\\x01\\n\\xbb\\xc8?\\xc0\\xd52\\x00\\xf1F\\xc9?\\x00\\x00\\x00 \\x8b\\x1bqB\\x00\\x00\\x00\\xa0\\x92HKB\\x00\\x00\\x00\\x80\\xcb\\x8a B\\x00\\x00\\x00\\x00\\x98)_B\\x00\\x00\\x00`\\xca+pB\\x00\\x00\\x00\\xa0N\\xe4WB\\x00\\x00\\x00\\x00\\xc0\\x9fQB\\x00\\x00\\x00\\xc5\\x0c\\xc5-B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00lIv\\xf0A\\x00\\x00\\x00\\xd9\\xb83\\x17B\\x00\\x00\\x80\\xa3\\x1c\\x01$B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xc0\\x17\\xcaINB\\x00\\x00\\x00fF\\xb83B\\x00\\x00#\\xdcq\\xaaBB\\x00\\x00\\x00\\x87h\\x00*B\\x00\\x00\\xc0\\xca\\xd3L=B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xa1\\xf81\\xe6\\xae%\\xd0?\\x8b\\xfde\\xf7\\xe4a\\xd9?\\x00\\x00\\x00\\x00\\x00\\x00\\xf8?\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf1\\xf4JY\\x868\\xd6?>(\\xc5\\x1ap\\xce\\xd4?\\xc3\\xf5(\\\\\\x8f\\xc2\\xcd?\\xad\\xbf%\\x00\\xff\\xe05#\\xc8$#gaG\\\'#\\x9a\\x99\\x99\\x99\\x99\\x996#{Ic\\xb4\\x8e\\x82>#^+\\xa1\\xbb$\\x8a;#UL\\xa5\\x9fp\\xbe)#0G\\x8f\\xdf\\xdb\\x84(#q8tq9ba]q:h\\x07h\\x08}q;(h\\nh\\x0bh\\x0cK\\x00\\x85q<h\\x0e\\x87q=Rq>(K\\x01K\\n\\x85q?h\\x15\\x89]q#(h\\x19h\\x1ah\\x1bh\\x1ch\\x1dh\\x1eh\\x1fh h!h"etqAbh$Nu\\x86qBRqCa}qDX\\x06\\x00\\x00\\x000.14.1qE}qF(X\\x04\\x00\\x00\\x00axesqGh\\x06X\\x06\\x00\\x00\\x00blocksqH]qI}qJ(X\\x06\\x00\\x00\\x00valuesqKh1X\\x08\\x00\\x00\\x00mgr_locsqLcbuiltins\\nslice\\nqMK\\x00K\\nK\\x01\\x87qNRqOuaustqPbX\\x04\\x00\\x00\\x00_typqQX\\t\\x00\\x00\\x00dataframeqRX\\t\\x00\\x00\\x00_metadataqS]qTub.\')'
a=eval(a)
a
and I want to run the function known by glmulti in python. I tried lots of ways but I failed. I then did the hopeless act of going to r as follow
take the dataset to excel file as
a.to_excel('test1.xlsx')
Go to r studio
install.packages("glmulti", "rJava", "readxl")
library("glmulti", "rJava", "readxl")
getwd()
setwd(".Gp\\to\\the\\python directory where you are workingin")
my_data <- read_excel("test1.xlsx", sheet = 1)
Change the columns of the dataframe because it does not work with the main values of the data
j=1
for (i in paste0("x",1:length(my_data))){
names(my_data)[j]=i
j=j+1
}
Select my x variable and y variable
y=my_data[,6]
x=my_data[, names(my_data) != names(my_data)[6]]
finally, I run the function I want in r as
glmulti(names(y), names(x), data=my_data, method="h")
Is there an easier way to run it from python using rpy2? If so can you please advise on this?
Consider converting Pandas data frame into an R data frame with rpy2, and then call just as you do now the glmulti from imported package.
However, a few notes about R:
Every function or method derives from a package which is true of Python except standard library functions (e.g., list, sum, type). But in R, its standard library packages are loaded by default (e.g., utils, stats, base) for everyday methods (e.g., read.csv, head, summary).
Though you can qualify package names with each function call such as with base::names, it is not required as in Python but helpful in case of name collision with other packages.
You do not need a for loop to rename all columns but can vectorize with base::paste0 and assign using stats::setNames or base::colnames.
Python Processing
import pandas as pd
import pickle
df_py = eval('pickle.loads(...)')
# RE-ORDER COLUMNS BY MOVING SIXTH COLUMN TO FIRST POSITION
cols = df_py.columns.to_list()
new_order = [cols[5]] + cols[0:5] + cols[6:]
df_py = df_py.reindex(new_order, axis=1)
print(df_py.head(10))
R Processing
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
utils = importr('utils')
base = importr('base')
stats = importr('stats')
glmulti = importr('glmulti') # DOES NOT REQUIRE rJava PACKAGE BUT DOES REQUIRE Java LANGUAGE
# CONVERT TO R DATAFRAME
pandas2ri.activate()
df_r = pandas2ri.py2ri(df_py) # USING ABOVE PANDAS DATA FRAME
# RENAME COLUMNS y, x1, x2, x3, ...
df_r = stats.setNames(df_r, base.c("y", base.paste0("x", base.seq(1,base.length(df_r)[0]-1))))
print(utils.head(df_r, 10))
# CALL glmulti()
glmulti.glmulti(y = base.names(df_r)[0],
xr = base.names(df_r)[1:],
data = df_r,
method = "h")

backtest package R extracting returns from object

I would like to extract the returns from the backtest package which are according the to the manual stored within a 5 dimensional array called 'results')
This is the backtest package:
https://cran.r-project.org/web/packages/backtest/backtest.pdf
A simple example looks like this:
library(backtest)
data(starmine)
bt <- backtest(starmine, in.var = c("smi"),
ret.var = "ret.0.1.m", date.var = "date",
id.var = "id", buckets = 10,
natural = TRUE, by.period = TRUE)
summary(bt)
When you run the summary command, it will print out the return series for each decile. I would like to extract those into a dataframe that I can use for further analysis.
Does someone know, how I can access the return series or extract it?
The bt object is an object with class backtest (which we see from class(bt)). The summary() function has a method defined for backtest objects which only prints the information to the screen. If you try to assign the information via stuff <- summary(bt), the stuff object will be NULL. To access the data that summary(bt) prints to the screen, you should use the accessor functions created for that object ( they are described in ?'backtest-class'). These functions include:
means()
counts()
summary()
marginals()
summaryStats()
turnover()
In order to access the data frame of summary statistics by month printed as the side effect of summary(bt), you can run summaryStats(bt). Please see pages 5-8 of the backtest help files for more information.

Export SpectraObjects to csv in ChemoSpec

I am using ChemoSpec to analyse FTIR spectra in R.
I was able to import several csv files using files2SpectraObject, applied some of the data pre-processing procedures, such as normalization and binning, and generated new SpectraObjects with the results.
Is it possible to export data back to csv format from the generated SpectraObjects?
So far I tried this
write.table(ftirbin, "E:/ftirbin.txt", sep="\t")
and got this:
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""Spectra"" to a data.frame
Thanks in advance!
G
If you look at ?Spectra you'll see how a Spectra object is stored. The intensity values are in your_object$data, and the frequency values are in your_object$freq. So you can't export the whole object (it's not a data frame, but rather a list), but you can export the pieces. To export the frequencies in the first column, and the samples in the following columns, you can do this (example uses a built in data set, SrE.IR):
tmp <- cbind(SrE.IR$freq, t(SrE.IR$data))
colnames(tmp) <- c("freq", SrE.IR$names)
tmp <- as.data.frame(tmp) # it was a matrix
Then you can write it out using write.csv or write.table (check the arguments to avoid row numbers).

Rpy2 - List of List of Dataframes

I'm trying to figure out how to use python to do file parsing from XML files into a data structure to pass into R.
What I need to create in R is a List of Lists of Dataframes:
Nodes = data.frame()
Edges = data.frame()
NetworkCompListA = list()
NetworkCompListA[['Nodes']] = Nodes
NetworkCompListA[['Edges']] = Edges
Networks = list()
Networks[['NetA']] = NetworkCompListA
Networks[['NetB']] = NetworkCompListB
I know how to create a dataframe from the examples in the Rpy2 documentation.
import rpy2.rlike.container as rlc
od = rlc.OrdDict([('value', robjects.IntVector((1,2,3))),
('letter', robjects.StrVector(('x', 'y', 'z')))])
df = robjects.DataFrame(od)
How do I insert 'df' into a List and then insert that list into another list in python and then write that out to an rdata file to load into another instance of R?
Thanks!
The class ListVector requires an object that implements iteritems() (such as a dict, or a OrderedDict). Note that in R data.frames are just lists with a the (loose) constrain that all elements should be vectors of the same length (or a matrix with the matching number of rows can be accepted), and with row names and column names (list's names being the column names).
from rpy2.robjects.vectors import ListVector, DataFrame
# rpy2's OrdDict was added because there was no ordered dict
# in Python's stdlib. It should be gone by rpy2-2.5
from collections import OrderedDict
od = OrderedDict((('a', 1), ('b', 2)))
df = DataFrame(od)
od_l = OrderedDict((('df', df),))
df_in_list = ListVector(od_l)
df_in_list_in_list = ListVector(OrderedDict((('df_in_list', df_in_list),))

Resources