Error: can't convert cuda:0 device type tensor to numpy - plot

I want to plot these variables: prediction_py and y_test_py. But I got the error below:

If your data is on the GPU, you cannot directly convert it to Numpy array. You should first move it to the CPU and then convert it to Numpy array as such:
x = torch.randn(2, 3).cuda()
x_np = x.cpu().numpy()

Related

Encoding/tokenizing dataset dictionary (BERT/Huggingface)

I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed everything into a Dataset Dictionary:
#Creating Dataset Objects
dataset_train = datasets.Dataset.from_pandas(training_data)
dataset_test = datasets.Dataset.from_pandas(testing_data)
#Get rid of weird columns
dataset_train = dataset_train.remove_columns('__index_level_0__')
dataset_test = dataset_test.remove_columns('__index_level_0__')
#Create Dataset Dictionary
data_dict = datasets.DatasetDict({"train":dataset_train,"test":dataset_test})
I am transforming everything to a dataset dictionary cause I am following more or less a code and transfer it to my problem. Anyways, I am defining the function to tokenize:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, f1_score
num_labels = 5
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
batch_size = 16
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
def tokenize(batch):
return tokenizer(batch, padding=True, truncation=True)
and call the function with:
data_encoded = data_dict.map(tokenize, batched=True, batch_size=None)
I am getting this error after all this:
ValueError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).
What am I missing? Sorry I am completely new to the whole Huggingface infrastructure…
Found the error on my own as I had to specify the column which had to be tokenized. The correct Tokenizer function would be:
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
instead of
def tokenize(batch):
return tokenizer(batch, padding=True, truncation=True)

How to quickly and easily convert between R and pandas dataframes in Databricks?

I am an R user with minimal python experience. I have some colleagues who use python and I want to be able to easily convert between R and python/pandas dataframes in the same Databricks notebook. I have heard that I have to use spark temp tables to do this and that it is quite straightforward, but I cannot find any complete example code and so far I haven't been able to get it to work.
I get a SparkR dataframe (as I can't get Base R dataframes to work with RegisterTempTable()) and convert it to a temp table:
#Cell 1
jdbc_url <- "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb;user=user;password=*****"
df_R <- read.jdbc(jdbc_url, "(SELECT TOP 10 * FROM [schema].[table]) as result" )
SparkR:::registerTempTable(df_R,"df_temptable")
Then I try to read that back in as a pandas dataframe:
%python
#Cell 2:
import pandas as pd
pandas_df = df_temptable.select("*").toPandas()
which results in the error:
NameError: name 'df_temptable' is not defined
How do I successfully convert between R and python dataframes and back within Databricks (I would preferably like to go from a Base R dataframe to a pandas dataframe without using any Scala and in as few steps as possible)?
From the error message "NameError: name 'df_temptable' is not defined", it looks like df_temptable is not defined as the dataframe.
Here is an example to convert spark DataFrames to and from Pandas DataFrames.
%python
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# Convert the Spark DataFrame back to a pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

How to use glmulti from python using rpy2?

consider the following dataframe
import pickle
a='pickle.loads(b\'\\x80\\x03cpandas.core.frame\\nDataFrame\\nq\\x00)\\x81q\\x01}q\\x02(X\\x05\\x00\\x00\\x00_dataq\\x03cpandas.core.internals.managers\\nBlockManager\\nq\\x04)\\x81q\\x05(]q\\x06(cpandas.core.indexes.base\\n_new_Index\\nq\\x07cpandas.core.indexes.base\\nIndex\\nq\\x08}q\\t(X\\x04\\x00\\x00\\x00dataq\\ncnumpy.core.multiarray\\n_reconstruct\\nq\\x0bcnumpy\\nndarray\\nq\\x0cK\\x00\\x85q\\rC\\x01bq\\x0e\\x87q\\x0fRq\\x10(K\\x01K\\n\\x85q\\x11cnumpy\\ndtype\\nq\\x12X\\x02\\x00\\x00\\x00O8q\\x13K\\x00K\\x01\\x87q\\x14Rq\\x15(K\\x03X\\x01\\x00\\x00\\x00|q\\x16NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK?tq\\x17b\\x89]q\\x18(X\\x0b\\x00\\x00\\x00priceToBookq\\x19X\\x04\\x00\\x00\\x00betaq\\x1aX\\x0e\\x00\\x00\\x00price to salesq\\x1bX\\x0c\\x00\\x00\\x00gross profitq\\x1cX\\x0c\\x00\\x00\\x0052WeekChangeq\\x1dX\\n\\x00\\x00\\x00market capq\\x1eX\\x04\\x00\\x00\\x00ebitq\\x1fX\\r\\x00\\x00\\x00total revenueq X\\x0c\\x00\\x00\\x00payout ratioq!X\\x08\\x00\\x00\\x00pe ratioq"etq#bX\\x04\\x00\\x00\\x00nameq$Nu\\x86q%Rq&h\\x07cpandas.core.indexes.range\\nRangeIndex\\nq\\\'}q((h$NX\\x05\\x00\\x00\\x00startq)K\\x00X\\x04\\x00\\x00\\x00stopq*K\\x07X\\x04\\x00\\x00\\x00stepq+K\\x01u\\x86q,Rq-e]q.h\\x0bh\\x0cK\\x00\\x85q/h\\x0e\\x87q0Rq1(K\\x01K\\nK\\x07\\x86q2h\\x12X\\x02\\x00\\x00\\x00f8q3K\\x00K\\x01\\x87q4Rq5(K\\x03X\\x01\\x00\\x00\\x00<q6NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK\\x00tq7b\\x89B0\\x02\\x00\\x00\\xd1#,\\x9b9\\x8c)#Cz\\xe5\\xd5\\x94_\\xf5?\\x92(\\x0ffn9\\xf0?\\n+\\x15TT-\\x17# \\xd5\\xb0\\xdf\\x13\\x03%#u\\xdek\\xad\\xd4\\xb8\\xfb?\\x1c\\xee#\\xb7&\\xbd\\xf3?-\\x98\\xf8\\xa3\\xa8\\xf3\\xf3?H\\xfd\\xf5\\n\\x0b\\xae\\xf1?:;\\x19\\x1c%/\\xf1?\\x9f\\x93\\xde7\\xbe\\xf6\\xf0?\\xbb}V\\x99)\\xad\\xf3?\\xae\\xbby\\xaaC.\\xf3?\\xa5,C\\x1c\\xeb\\xe2\\xf9?d\\x94g^\\x0e\\x13\\x12#\\x9e\\xc7r\\\\\\xd7i\\x06#\\xe4\\xe0\\x0c\\xddp\\xc8\\xcc?%\\x95)\\xe6 \\x18 #\\xa1\\xf4\\x85\\x90\\xf3\\x1e!#y6P\\x85\\xe4\\x89\\x0e#.\\xd9\\xc2=\\xe0\\x1b\\x0c#\\x00\\x00\\x00\\xc6\\x9e\\xe86B\\x00\\x00\\x00fF\\xb83B\\x00\\x00\\x00.\\xdc\\xb6\\x0bB\\x00\\x00\\x80\\x954\\xa5%B\\x00\\x00#\\\'1O3B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xf1\\xda\\x84\\xff\\x9d\\x82\\xd5?f\\xb8>\\x028#\\xa0?\\xc8^\\xef\\xfex/\\xb0\\xbf\\xab\\xd5\\x91\\x02\\x8f\\x18\\xd6?\\xd7\\xc05\\xfb,d\\xd6?r\\x8e\\xb6\\x01\\n\\xbb\\xc8?\\xc0\\xd52\\x00\\xf1F\\xc9?\\x00\\x00\\x00 \\x8b\\x1bqB\\x00\\x00\\x00\\xa0\\x92HKB\\x00\\x00\\x00\\x80\\xcb\\x8a B\\x00\\x00\\x00\\x00\\x98)_B\\x00\\x00\\x00`\\xca+pB\\x00\\x00\\x00\\xa0N\\xe4WB\\x00\\x00\\x00\\x00\\xc0\\x9fQB\\x00\\x00\\x00\\xc5\\x0c\\xc5-B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00lIv\\xf0A\\x00\\x00\\x00\\xd9\\xb83\\x17B\\x00\\x00\\x80\\xa3\\x1c\\x01$B\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xc0\\x17\\xcaINB\\x00\\x00\\x00fF\\xb83B\\x00\\x00#\\xdcq\\xaaBB\\x00\\x00\\x00\\x87h\\x00*B\\x00\\x00\\xc0\\xca\\xd3L=B\\x00\\x00\\x00\\xec\\xed58B\\x00\\x00\\x80\\t\\x93\\xa64B\\xa1\\xf81\\xe6\\xae%\\xd0?\\x8b\\xfde\\xf7\\xe4a\\xd9?\\x00\\x00\\x00\\x00\\x00\\x00\\xf8?\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf1\\xf4JY\\x868\\xd6?>(\\xc5\\x1ap\\xce\\xd4?\\xc3\\xf5(\\\\\\x8f\\xc2\\xcd?\\xad\\xbf%\\x00\\xff\\xe05#\\xc8$#gaG\\\'#\\x9a\\x99\\x99\\x99\\x99\\x996#{Ic\\xb4\\x8e\\x82>#^+\\xa1\\xbb$\\x8a;#UL\\xa5\\x9fp\\xbe)#0G\\x8f\\xdf\\xdb\\x84(#q8tq9ba]q:h\\x07h\\x08}q;(h\\nh\\x0bh\\x0cK\\x00\\x85q<h\\x0e\\x87q=Rq>(K\\x01K\\n\\x85q?h\\x15\\x89]q#(h\\x19h\\x1ah\\x1bh\\x1ch\\x1dh\\x1eh\\x1fh h!h"etqAbh$Nu\\x86qBRqCa}qDX\\x06\\x00\\x00\\x000.14.1qE}qF(X\\x04\\x00\\x00\\x00axesqGh\\x06X\\x06\\x00\\x00\\x00blocksqH]qI}qJ(X\\x06\\x00\\x00\\x00valuesqKh1X\\x08\\x00\\x00\\x00mgr_locsqLcbuiltins\\nslice\\nqMK\\x00K\\nK\\x01\\x87qNRqOuaustqPbX\\x04\\x00\\x00\\x00_typqQX\\t\\x00\\x00\\x00dataframeqRX\\t\\x00\\x00\\x00_metadataqS]qTub.\')'
a=eval(a)
a
and I want to run the function known by glmulti in python. I tried lots of ways but I failed. I then did the hopeless act of going to r as follow
take the dataset to excel file as
a.to_excel('test1.xlsx')
Go to r studio
install.packages("glmulti", "rJava", "readxl")
library("glmulti", "rJava", "readxl")
getwd()
setwd(".Gp\\to\\the\\python directory where you are workingin")
my_data <- read_excel("test1.xlsx", sheet = 1)
Change the columns of the dataframe because it does not work with the main values of the data
j=1
for (i in paste0("x",1:length(my_data))){
names(my_data)[j]=i
j=j+1
}
Select my x variable and y variable
y=my_data[,6]
x=my_data[, names(my_data) != names(my_data)[6]]
finally, I run the function I want in r as
glmulti(names(y), names(x), data=my_data, method="h")
Is there an easier way to run it from python using rpy2? If so can you please advise on this?
Consider converting Pandas data frame into an R data frame with rpy2, and then call just as you do now the glmulti from imported package.
However, a few notes about R:
Every function or method derives from a package which is true of Python except standard library functions (e.g., list, sum, type). But in R, its standard library packages are loaded by default (e.g., utils, stats, base) for everyday methods (e.g., read.csv, head, summary).
Though you can qualify package names with each function call such as with base::names, it is not required as in Python but helpful in case of name collision with other packages.
You do not need a for loop to rename all columns but can vectorize with base::paste0 and assign using stats::setNames or base::colnames.
Python Processing
import pandas as pd
import pickle
df_py = eval('pickle.loads(...)')
# RE-ORDER COLUMNS BY MOVING SIXTH COLUMN TO FIRST POSITION
cols = df_py.columns.to_list()
new_order = [cols[5]] + cols[0:5] + cols[6:]
df_py = df_py.reindex(new_order, axis=1)
print(df_py.head(10))
R Processing
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
utils = importr('utils')
base = importr('base')
stats = importr('stats')
glmulti = importr('glmulti') # DOES NOT REQUIRE rJava PACKAGE BUT DOES REQUIRE Java LANGUAGE
# CONVERT TO R DATAFRAME
pandas2ri.activate()
df_r = pandas2ri.py2ri(df_py) # USING ABOVE PANDAS DATA FRAME
# RENAME COLUMNS y, x1, x2, x3, ...
df_r = stats.setNames(df_r, base.c("y", base.paste0("x", base.seq(1,base.length(df_r)[0]-1))))
print(utils.head(df_r, 10))
# CALL glmulti()
glmulti.glmulti(y = base.names(df_r)[0],
xr = base.names(df_r)[1:],
data = df_r,
method = "h")

Xarray - concatenating slices from multiple files

I'm attempting to concatenate slices of multiple files into one file (initialized by a zeros array) and then write to a nCDF file. However, I receive the error:
arguments without labels along dimension 'Time' cannot be aligned
because they have different dimension sizes: {365, 30}
I understand the error (the isel() changes the size of the dimension to the size of the slice), however I don't know how to correct or circumvent the problem. Am I approaching this task correctly? Here's a simplified version of the first iteration:
import xarray as xr
import numpy as np
i=0
PRCP = np.zeros((365,327,348))
d = xr.open_dataset("/Path")
d = d.isel(Time=slice(0,-1,24))
P = d['CUMPRCP'].values
DinM = P.shape[0]
PRCP[i:i+DinM,:,:] = P
i = i + DinM
PRCPxr = xr.DataArray(PRCP.astype('float32'),dims=[('Time'),
'south_north', 'west_east'])
d['DPRCP'] = PRCPxr
Problem was solved by removing the dims=() argument from xr.DataArray(), where it arbitrarily renamed them.

Extracting binary data from a mixed data file

I am trying to read binary data from a mixed data file (ascii and binary) using R, the data file is constructed in a pseudo-xml format. The idea I had was to use the scan function, read the specific lines and then convert the binary to numerical values but I can't seem to do this in R. I have a python script that does this, but I would like to do the job in R, the python script is below. The binary section within the data file is enclosed by the start and end tags and .
The data file is a proprietary format containing spectroscopic data, a link to an example data file is included below. To quote the user manual:
Data of BinData elements are written as a binary array of bytes. Each
8 bytes of the binary array represent a one double-precision
floating-point value. Therefore the size of the binary array is
NumberOfPoints * 8 bytes. For two-dimensional arrays, data layout
follows row-major form used by SafeArrays. This means that moving to
next array element increments the last index. For example, if a
two-dimensional array (e.g. Data(i,j)) is written in such
one-dimensional binary byte array form, moving to the next 8 byte
element of the binary array increments last index of the original
two-dimensional array (i.e. Data(i,j+1)). After the last element of
the binary array the combination of carriage return and linefeed
characters (ANSI characters 13 and 10) is written.
Thanks for any suggestions in advance!
Link to example data file:
https://docs.google.com/file/d/0B5F27d7b1eMfQWg0QVRHUWUwdk0/edit?usp=sharing
Python script:
import sys, struct, csv
f=open(sys.argv[1], 'rb')
#
t = f.read()
i = t.find("<BinData>") + len("<BinData>") + 2 # add \r\n line end
header = t[:i]
#
t = t[i:]
i = t.find("\r\n</BinData>")
bin = t[:i]
#
doubles=[]
for i in range(len(bin)/8):
doubles.append(struct.unpack('d', bin[i*8:(i+1)*8])[0])
#
footer = t[i+2:]
#
myfile = open("output.csv", 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(doubles)
I wrote the pack package to make this easier. You still have to search for the start/end of the binary data though.
b <- readBin("120713b01.ols", "raw", 4000)
# raw version of the start of the BinData tag
beg.raw <- charToRaw("<BinData>\r\n")
# only take first match, in case binary data randomly contains "<BinData>\r\n"
beg.loc <- grepRaw(beg.raw,b,fixed=TRUE)[1] + length(beg.raw)
# convert header to text
header <- scan(text=rawToChar(b[1:beg.loc]),what="",sep="\n")
# search for "<Number of Points"> tags and calculate total number of points
numPts <- prod(as.numeric(header[grep("<Number of Points>",header)+1]))
library(pack)
Data <- unlist(unpack(rep("d", numPts), b[beg.loc:length(b)]))

Resources