How to add a new colum to pyspark datafarme with dictionary values? - dictionary

I was trying to add a new column to my existing data frame in pyspark. My data frame looks
like as follows. And I was trying with the help of this post
Pyspark: Replacing value in a column by searching a dictionary
by-searching- a-dictionary
Fruit
Orange
Orange
Apple
Banana
Apple
the code I was tring as like this
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR, 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in F.chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
Expected output:
Expected output:
Fruit Fruit_code
Orange OR
Orange OR
Apple AP
Banana BN
Apple AP
I'm getting below error: I know its because of function F. But I don't know how to fix. Can someone help me ?
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <MODULE>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])
FILE "/MYPROJECT/DATASETS/DERIVED/OPPORTUNITY_WON.PY", LINE 8, IN <LISTCOMP>
MAPPING_EXPR = CREATE_MAP([LIT(X) FOR X IN CHAIN(*SIMPLE_DICT.ITEMS())])

I have modified your code snippet to get it working.
from pyspark.sql import functions as F
from itertools import chain
simple_dict = {'Orange': 'OR', 'Apple': 'AP', 'Banana': 'BN'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*simple_dict.items())])
def addCols(data):
data = (data.withColumn('Fruit_code', mapping_expr[data['Fruit']]))
return data
data = spark.createDataFrame([("Orange", ), ("Apple", ), ("Banana", ), ], ("Fruit", ))
new_data = addCols(data)
new_data.show()
Output
+------+----------+
| Fruit|Fruit_code|
+------+----------+
|Orange| OR|
| Apple| AP|
|Banana| BN|
+------+----------+

Related

CamembertForSequenceClassification : training is not working

I try to use and adapt a notebook based on huggingface models: Text Classification on GLUE (https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb#scrollTo=YZbiBDuGIrId)
My goal is to classify a sentence (16 classes predefined).
So I followed the notebook and did. My data looks like below.
id data label langue
0 text_1 label_1 Français
0 text_2 label_2 Français
1 text_3 label_3 Français
import pandas as pd
import numpy as np
from datasets import load_dataset, load_metric, DatasetDict, Features, Value, ClassLabel, Dataset
I have a labeldict like this
{'label_1': 0,
'label_2': 1,
...}
dataset = load_dataset('csv', sep="|", data_files={"train" : train_paths, "test" : test_paths})
output:
DatasetDict({
train: Dataset({
features: ['id', 'data', 'label', 'langue'],
num_rows: ...
})
test: Dataset({
features: ['id', 'data', 'label', 'langue'],
num_rows: ...
})
})
Did all before in the notebook and when I try to do this:
trainer = Trainer(
model,
args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset[validation_key],
tokenizer=tokenizer,
compute_metrics= compute_metrics,
callbacks=[MLflowCallback()]
)
trainer.train()
I have the error: The following columns in the training set don't have a corresponding argument in CamembertForSequenceClassification.forward and have been ignored: langue, id, data. IndexError: tuple index out of range
What can I do ?

Bokeh LabelSet x axis being datetime

I am new to Bokeh and looking for solution to label each data point. Replicating the examples shown in documents, I could not find solutions with X axis being datetime.
import pandas as mypd
from bokeh.models import LabelSet , ColumnarDataSource
from bokeh.plotting import figure, output_file, show
date_1 = ['2020-01-01', '2020-01-02','2020-01-03','2020-01-04','2020-01-05']
sal = mypd.DataFrame(date_1)
sal.columns = ["Date_1"]
sal['Sales'] = [15,25,36,17,4]
sal['Date_1'] = mypd.to_datetime(sal['Date_1'])
p= figure(x_axis_type = "datetime")
p.line(x =sal['Date_1'] ,y = sal['Sales'])
lab = LabelSet(x = sal['Date_1'], y = sal['Sales'], text = sal['Sales'])
p.add_layout(lab)
show(p)
It is throwing the error
ValueError: expected an element of either String, Dict(Enum('expr', 'field', 'value', 'transform'), Either(String, Instance(Transform), Instance(Expression), Float)) or Float, got 0 2020-01-01
I understand the error is because x axis take numerical data for labelset.
Is my understanding correct ?
If yes what is the workaround ?
I tried with similar queries but could not find a solution for myself.
Similar Query
And this
The simplest solution is to just use a common data source. It also prevents you from embedding the data twice.
import pandas as pd
from bokeh.models import LabelSet, ColumnDataSource
from bokeh.plotting import figure, show
sal = (pd.DataFrame({'Date_1': pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05']),
'Sales': [15, 25, 36, 17, 4]})
.set_index('Date_1'))
ds = ColumnDataSource(sal)
p = figure(x_axis_type="datetime")
p.line(x='Date_1', y='Sales', source=ds)
lab = LabelSet(x='Date_1', y='Sales', text='Sales', source=ds)
p.add_layout(lab)
show(p)

pyspark how to sum and produce top 10 using pyspark

I have a csv file with two fields, a key and a value:
{1Y4dZ123eAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZ123433MGooBmVzBLUWEZ1234CUY91},8.530366
{1YdZ2344AMGooBmVzBLUWE123JfCCUY91},8.530366
{1YdECDNthiMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBDJTdBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZ123qeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBm123LUWEZ2JfCCUY91},8.530366
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{13uZ6tSr5oh1ui9Hd1tEqJKo2AHhJ6JdFS},0.03895804
What I'm trying to do is sum up the second column and group by the first column, then derive the top 10 keys with the highest values.
Below is the code I've tried using but I get a 'tuple index out of range' error:
import re
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.session import SparkSession
sc = pyspark.SparkContext()
spark = SparkSession(sc)
voutFile = sc.textFile("input/voutfiltered.csv")
features=voutFile.map(lambda l:
(l.split(',')[0],float(l.split(',')[1])))
top10 = features.takeOrdered(10, key = lambda x: -x[2])
for record in top10:
print("{}: {};{}".format(record[0],record[1],record[2]))```
Any particular reason why you're not using the DataFrame API? It's much more flexible, convenient and faster than the RDD API.
import pyspark.sql.functions as f
df = spark.read.format("csv").option("header", "true").load("/path/to/your/file.csv/")
(df.groupBy(f.col("key_col"))
.agg(f.count(f.col("value_col")).alias("count_value_col"))
.sort(col("count_value_col").desc())
.limit(10)
.show())

How to update holoviews Bars using an ipywidgets SelectionRangeSlider?

I want to select data from some pandas DataFrame in a Jupyter-notebook through a SelectionRangeSlider and plot the filtered data using holoviews bar chart.
Consider the following example:
import numpy as np
import pandas as pd
import datetime
import holoviews as hv
hv.extension('bokeh')
import ipywidgets as widgets
start = int(datetime.datetime(2017,1,1).strftime("%s"))
end = int(datetime.datetime(2017,12,31).strftime("%s"))
size = 100
rints = np.random.randint(start, end + 1, size = size)
df = pd.DataFrame(rints, columns = ['zeit'])
df["bytes"] = np.random.randint(5,20,size=size)
df['who']= np.random.choice(['John', 'Paul', 'George', 'Ringo'], len(df))
df["zeit"] = pd.to_datetime(df["zeit"], unit='s')
df.zeit = df.zeit.dt.date
df.sort_values('zeit', inplace = True)
df = df.reset_index(drop=True)
df.head(2)
This gives the test DataFrame df:
Let's group the data:
data = pd.DataFrame(df.groupby('who')['bytes'].sum())
data.reset_index(level=0, inplace=True)
data.sort_values(by="bytes", inplace=True)
data.head(2)
Now, create the SelectionRangeSlider that is to be used to filter and update the barchart.
%%opts Bars [width=800 height=400 tools=['hover']]
def view2(v):
x = df[(df.zeit > r2.value[0].date()) & (df.zeit < r2.value[1].date())]
data = pd.DataFrame(x.groupby('who')['bytes'].sum())
data.sort_values(by="bytes", inplace=True)
data.reset_index(inplace=True)
display(hv.Bars(data, kdims=['who'], vdims=['bytes']))
r2 = widgets.SelectionRangeSlider(options = options, index = index, description = 'Test')
widgets.interactive(view2, v=r2)
(I have already created an issue on github for the slider not displaying the label correctly, https://github.com/jupyter-widgets/ipywidgets/issues/1759)
Problems that persist:
the image width and size collapse to default after first update (is there a way to give %%opts as argument to hv.Bars?)
the y-Scale should remain constant (i.e. from 0 to 150 for all updates)
is there any optimization possible concerning speed of updates?
Thanks for any help.
Figured out how to do it using bokeh: https://github.com/bokeh/bokeh/issues/7082

Spark: How to translate count(distinct(value)) in Dataframe API's

I'm trying to compare different ways to aggregate my data.
This is my input data with 2 elements (page,visitor):
(PAG1,V1)
(PAG1,V1)
(PAG2,V1)
(PAG2,V2)
(PAG2,V1)
(PAG1,V1)
(PAG1,V2)
(PAG1,V1)
(PAG1,V2)
(PAG1,V1)
(PAG2,V2)
(PAG1,V3)
Working with a SQL command into Spark SQL with this code:
import sqlContext.implicits._
case class Log(page: String, visitor: String)
val logs = data.map(p => Log(p._1,p._2)).toDF()
logs.registerTempTable("logs")
val sqlResult= sqlContext.sql(
"""select page
,count(distinct visitor) as visitor
from logs
group by page
""")
val result = sqlResult.map(x=>(x(0).toString,x(1).toString))
result.foreach(println)
I get this output:
(PAG1,3) // PAG1 has been visited by 3 different visitors
(PAG2,2) // PAG2 has been visited by 2 different visitors
Now, I would like to get the same result using Dataframes and thiers API, but I can't get the same output:
import sqlContext.implicits._
case class Log(page: String, visitor: String)
val logs = data.map(p => Coppia(p._1,p._2)).toDF()
val result = log.select("page","visitor").groupBy("page").count().distinct
result.foreach(println)
In fact, that's what I get as output:
[PAG1,8] // just the simple page count for every page
[PAG2,4]
What you need is the DataFrame aggregation function countDistinct:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Log(page: String, visitor: String)
val logs = data.map(p => Log(p._1,p._2))
.toDF()
val result = logs.select("page","visitor")
.groupBy('page)
.agg('page, countDistinct('visitor))
result.foreach(println)
You can use dataframe's groupBy command twice to do so. Here, df1 is your original input.
val df2 = df1.groupBy($"page",$"visitor").agg(count($"visitor").as("count"))
This command would produce the following result:
page visitor count
---- ------ ----
PAG2 V2 2
PAG1 V3 1
PAG1 V1 5
PAG1 V2 2
PAG2 V1 2
Then use the groupBy command again to get the final result.
df2.groupBy($"page").agg(count($"visitor").as("count"))
Final output:
page count
---- ----
PAG1 3
PAG2 2
I think in the newer versions of Spark it is easier. The following is tested with 2.4.0.
1. First, create an array for sample.
val myArr = Array(
("PAG1","V1"),
("PAG1","V1"),
("PAG2","V1"),
("PAG2","V2"),
("PAG2","V1"),
("PAG1","V1"),
("PAG1","V2"),
("PAG1","V1"),
("PAG1","V2"),
("PAG1","V1"),
("PAG2","V2"),
("PAG1","V3")
)
2. Crate a dataframe
val logs = spark.createDataFrame(myArr)
.withColumnRenamed("_1","page")
.withColumnRenamed("_2","visitor")
3. Now aggregation with distinctCount spark sql function
import org.apache.spark.sql.{functions => F}
logs.groupBy("page").agg(
F.countDistinct("visitor").as("visitor"))
.show()
4. Expected result:
+----+-------+
|page|visitor|
+----+-------+
|PAG1| 3|
|PAG2| 2|
+----+-------+
Use this if you want to display the distinct values of a column
display(sparkDF.select('columnName').distinct())

Resources