CamembertForSequenceClassification : training is not working - bert-language-model

I try to use and adapt a notebook based on huggingface models: Text Classification on GLUE (https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb#scrollTo=YZbiBDuGIrId)
My goal is to classify a sentence (16 classes predefined).
So I followed the notebook and did. My data looks like below.
id data label langue
0 text_1 label_1 Français
0 text_2 label_2 Français
1 text_3 label_3 Français
import pandas as pd
import numpy as np
from datasets import load_dataset, load_metric, DatasetDict, Features, Value, ClassLabel, Dataset
I have a labeldict like this
{'label_1': 0,
'label_2': 1,
...}
dataset = load_dataset('csv', sep="|", data_files={"train" : train_paths, "test" : test_paths})
output:
DatasetDict({
train: Dataset({
features: ['id', 'data', 'label', 'langue'],
num_rows: ...
})
test: Dataset({
features: ['id', 'data', 'label', 'langue'],
num_rows: ...
})
})
Did all before in the notebook and when I try to do this:
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset[validation_key],
tokenizer=tokenizer,
compute_metrics= compute_metrics,
callbacks=[MLflowCallback()]
)
trainer.train()
I have the error: The following columns in the training set don't have a corresponding argument in CamembertForSequenceClassification.forward and have been ignored: langue, id, data. IndexError: tuple index out of range
What can I do ?

Related

when I run the program def update_graphs will no populate. Can anyone see anything small that prevents that function from running?

from jupyter_plotly_dash import JupyterDash
import dash
import dash_leaflet as dl
import dash_core_components as dcc
import dash_html_components as html
import plotly.express as px
import dash_table
from dash.dependencies import Input, Output
import base64
import os
import numpy as np
import pandas as pd
from pymongo import MongoClient
from Module import AnimalShelter
username = "username"
password = "password"
animal = AnimalShelter(username, password)
df = pd.DataFrame.from_records(animal.readAll({}))
#########################
# Dashboard Layout / View
#########################
app = JupyterDash('Dash DataTable Only')
image_filename = 'Grazioso Salvare Logo.png' # customer image
encoded_image = base64.b64encode(open(image_filename, 'rb').read())
app.layout = html.Div([
html.Center(html.Img(src='data:image/png;base64,{}'.format(encoded_image.decode()))),
html.Center(html.B(html.H1('Kristopher Collis'))),
html.Hr(),
html.Div(
#Radio Items to select the rescue filter options
dcc.RadioItems(
id='filter-type',
),
),
html.Hr(),
dash_table.DataTable(
id='datatable-id',
columns=[
{"name": i, "id": i, "deletable": False, "selectable": True} for i in df.columns
],
data=df.to_dict('records'),
editable=False,
filter_action="native",
sort_action="native",
sort_mode="multi",
column_selectable=False,
row_selectable="multi",
row_deletable=False,
selected_columns=[],
selected_rows=[],
page_action="native",
page_current= 0,
page_size= 10,
),
html.Hr(),
html.Div(className='row',
style={'display' : 'flex'},
children =[
html.Div(
id='graph-id',
className='col s12 m6'
),
html.Div(
id = 'map-id',
className='col s12 m6'
),
]
),
html.Br(),
html.Hr(),
])
#############################################
# Interaction Between Components / Controller
#############################################
# #callback for Piechart
#app.callback(
Output('graph-id', "children"),
[Input('datatable-id', "derived_viewport_data")])
#fucntion for update_graph
def update_graphs(viewData):
dff = pd.DataFrame.from_dict(viewData)
names = dff['breed'].value_counts().keys().tolist()
values = dff['breed'].value_counts().tolist()
return[
dcc.Graph(
id = 'graph-id',
fig = px.pie(data_frame = dff,values = values,names = names,
color_discrete_sequence = px.colors.sequential.RdBu,width = 800,height = 500
)
)
]
#callback for update_map
#app.callback(
Output('map-id', "children"),
[Input('datatable-id', "derived_viewport_data"),
Input('datatable-id', 'selected_rows'),
Input('datatable-id', 'selected_columns')])
#update_function with variables
def update_map(viewData, selected_rows, selected_columns):
dff = pd.DataFrame.from_dict(viewData)
#width, height of map, center of map, and how much zoom do you want for map
return [dl.Map(style = {'width': '1000px', 'height': '500px'}, center = [30.75,-97.48], zoom = 7,
children = [dl.TileLayer(id = "base-layer-id"),
#marker with tool tip and popup
dl.Marker(position=[(dff.iloc[selected_rows[0],13]), (dff.iloc[selected_rows[0],14])], children=[
dl.Tooltip(dff.iloc[selected_rows[0],4]),
dl.Popup([
html.H4("Animal Name"),
html.P(dff.iloc[selected_rows[0],9]),
])
])
])
]
app
When I run the program, the geoloction map populates but not the graph. I was able to populate the graph a couple of times finding information on plotly, and other documentation. I have spent a while trying to figure out why the graph will not display again. I did attempt to use fig.show() at the bottom of the update_graphs function. I am not sure if that was what made it work, but I am stumped. I am respectfully requesting help finding the error in the def update_graphs function.

How to add a new colum to pyspark datafarme with dictionary values?

I was trying to add a new column to my existing data frame in pyspark. My data frame looks
like as follows. And I was trying with the help of this post
Pyspark: Replacing value in a column by searching a dictionary
by-searching- a-dictionary
Fruit
Orange
Orange
Apple
Banana
Apple
the code I was tring as like this
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR, 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in F.chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
Expected output:
Expected output:
Fruit Fruit_code
Orange OR
Orange OR
Apple AP
Banana BN
Apple AP
I'm getting below error: I know its because of function F. But I don't know how to fix. Can someone help me ?
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <MODULE>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <LISTCOMP>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
I have modified your code snippet to get it working.
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR', 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
data = spark.createDataFrame([("Orange", ), ("Apple", ), ("Banana", ), ], ("Fruit", ))
new_data = addCols(data)
new_data.show()
Output
+------+----------+
| Fruit|Fruit_code|
+------+----------+
|Orange| OR|
| Apple| AP|
|Banana| BN|
+------+----------+

pyspark how to sum and produce top 10 using pyspark

I have a csv file with two fields, a key and a value:
{1Y4dZ123eAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZ123433MGooBmVzBLUWEZ1234CUY91},8.530366
{1YdZ2344AMGooBmVzBLUWE123JfCCUY91},8.530366
{1YdECDNthiMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBDJTdBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZ123qeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBm123LUWEZ2JfCCUY91},8.530366
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{13uZ6tSr5oh1ui9Hd1tEqJKo2AHhJ6JdFS},0.03895804
What I'm trying to do is sum up the second column and group by the first column, then derive the top 10 keys with the highest values.
Below is the code I've tried using but I get a 'tuple index out of range' error:
import re
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.session import SparkSession
sc = pyspark.SparkContext()
spark = SparkSession(sc)
voutFile = sc.textFile("input/voutfiltered.csv")
features=voutFile.map(lambda l:
(l.split(',')[0],float(l.split(',')[1])))
top10 = features.takeOrdered(10, key = lambda x: -x[2])
for record in top10:
print("{}: {};{}".format(record[0],record[1],record[2]))```
Any particular reason why you're not using the DataFrame API? It's much more flexible, convenient and faster than the RDD API.
import pyspark.sql.functions as f
df = spark.read.format("csv").option("header", "true").load("/path/to/your/file.csv/")
(df.groupBy(f.col("key_col"))
.agg(f.count(f.col("value_col")).alias("count_value_col"))
.sort(col("count_value_col").desc())
.limit(10)
.show())

Python ArcPy - Print Layer with highest field value

I have some python code that goes through layers in my ArcGIS project and prints out the layer names and their corresponding highest value within the field "SUM_USER_VisitCount".
Output Picture
What I want the code to do is only print out the layer name and SUM_USER_VisitCount field value for the one layer with the absolute highest value.
Desired Output
I have been unable to figure out how to achieve this and can't find anything online either. Can someone help me achieve my desired output?
Sorry if the code layout is a little weird. It got messed up when I pasted it into the "code sample"
Here is my code:
import arcpy
import datetime
from datetime import timedelta
import time
#Document Start Time in-order to calculate Run Time
time1 = time.clock()
#assign project and map frame
p =
arcpy.mp.ArcGISProject(r'E:\arcGIS_Shared\Python\CumulativeHeatMaps.aprx')
m = p.listMaps('Map')[0]
Markets = [3000]
### Centers to loop through
CA_Centers = ['Castro', 'ColeValley', 'Excelsior', 'GlenPark',
'LowerPacificHeights', 'Marina', 'NorthBeach', 'RedwoodCity', 'SanBruno',
'DalyCity']
for Market in Markets:
print(Market)
for CA_Center in CA_Centers:
Layers =
m.listLayers("CumulativeSumWithin{0}_{1}_Jun2018".format(Market,CA_Center))
fields = ['SUM_USER_VisitCount']
for Layer in Layers:
print(Layer)
sqlClause = (None, 'ORDER BY ' + 'SUM_USER_VisitCount') # + 'DESC'
with arcpy.da.SearchCursor(in_table = Layer, field_names = fields,
sql_clause = sqlClause) as searchCursor:
print (max(searchCursor))
You can create a dictonary that stores the results from each query and then print out the highest one at the end.
results_dict = {}
for Market in Markets:
print(Market)
for CA_Center in CA_Centers:
Layers =
m.listLayers("CumulativeSumWithin{0}_{1}_Jun2018".format(Market,CA_Center))
fields = ['SUM_USER_VisitCount']
for Layer in Layers:
print(Layer)
sqlClause = (None, 'ORDER BY ' + 'SUM_USER_VisitCount') # + 'DESC'
with arcpy.da.SearchCursor(in_table = Layer, field_names = fields,
sql_clause = sqlClause) as searchCursor:
print (max(searchCursor))
results_dict[Layer] = max(searchCursor)
# get key for dictionary item with the highest value
highest_count_layer = max(results_dict, key=results_dict.get)
print(highest_count_layer)
print(results_dict[highest_count_layer])

How to update holoviews Bars using an ipywidgets SelectionRangeSlider?

I want to select data from some pandas DataFrame in a Jupyter-notebook through a SelectionRangeSlider and plot the filtered data using holoviews bar chart.
Consider the following example:
import numpy as np
import pandas as pd
import datetime
import holoviews as hv
hv.extension('bokeh')
import ipywidgets as widgets
start = int(datetime.datetime(2017,1,1).strftime("%s"))
end = int(datetime.datetime(2017,12,31).strftime("%s"))
size = 100
rints = np.random.randint(start, end + 1, size = size)
df = pd.DataFrame(rints, columns = ['zeit'])
df["bytes"] = np.random.randint(5,20,size=size)
df['who']= np.random.choice(['John', 'Paul', 'George', 'Ringo'], len(df))
df["zeit"] = pd.to_datetime(df["zeit"], unit='s')
df.zeit = df.zeit.dt.date
df.sort_values('zeit', inplace = True)
df = df.reset_index(drop=True)
df.head(2)
This gives the test DataFrame df:
Let's group the data:
data = pd.DataFrame(df.groupby('who')['bytes'].sum())
data.reset_index(level=0, inplace=True)
data.sort_values(by="bytes", inplace=True)
data.head(2)
Now, create the SelectionRangeSlider that is to be used to filter and update the barchart.
%%opts Bars [width=800 height=400 tools=['hover']]
def view2(v):
x = df[(df.zeit > r2.value[0].date()) & (df.zeit < r2.value[1].date())]
data = pd.DataFrame(x.groupby('who')['bytes'].sum())
data.sort_values(by="bytes", inplace=True)
data.reset_index(inplace=True)
display(hv.Bars(data, kdims=['who'], vdims=['bytes']))
r2 = widgets.SelectionRangeSlider(options = options, index = index, description = 'Test')
widgets.interactive(view2, v=r2)
(I have already created an issue on github for the slider not displaying the label correctly, https://github.com/jupyter-widgets/ipywidgets/issues/1759)
Problems that persist:
the image width and size collapse to default after first update (is there a way to give %%opts as argument to hv.Bars?)
the y-Scale should remain constant (i.e. from 0 to 150 for all updates)
is there any optimization possible concerning speed of updates?
Thanks for any help.
Figured out how to do it using bokeh: https://github.com/bokeh/bokeh/issues/7082

Resources