I have a string column which includes "," delimiter, I want to split this column into multiple rows.
Here's the table
|Token |Shop|
|a |P |
|A10,A9a,C1a,F1 |R |
Expected Output:
|Token |Shop|
|a |P |
|A10 |R |
|A9a |R |
|C1a |R |
|F1 |R |
I tried below logic using mv-expand but it doesn't seem to work
datatable(Tokens:string, Shop:string)["a", "P",
"A10,A9a,C1a,F1", "R" ]
| mv-expand Token =todynamic(Tokens) to typeof(string)
You can use split() before mv-expand:
datatable(Tokens:string, Shop:string)["a","P","A10,A9a,C1a,F1","R" ]
| mv-expand token = split(Tokens, ",") to typeof(string)
Related
I have a column 'Apples' in azure table that has this string: "Colour:red,Size:small".
Current situation:
|-----------------------|
| Apples |
|-----------------------|
| Colour:red,Size:small |
|-----------------------|
Desired Situation:
|----------------|
| Colour | Size |
|----------------|
| Red | small |
|----------------|
Please help
I'll answer the title as I noticed many people searched for a solution.
The key here is mv-expand operator (expands multi-value dynamic arrays or property bags into multiple records):
datatable (str:string)["aaa,bbb,ccc", "ddd,eee,fff"]
| project splitted=split(str, ',')
| mv-expand col1=splitted[0], col2=splitted[1], col3=splitted[2]
| project-away splitted
project-away operator allows us to select what columns from the input exclude from the output.
Result:
+--------------------+
| col1 | col2 | col3 |
+--------------------+
| aaa | bbb | ccc |
| ddd | eee | fff |
+--------------------+
This query gave me the desired results:
| parse Apples with "Colour:" AppColour ", Size:" AppSize
Remember to include all the different delimiters preceding each word you want to extract, e.g ", Size". Mind the space between.
This helped me then i used my intuition to customize the query according to my needs:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/parseoperator
csv-data
I wanted to delete the empty column from the dataframe
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test_Parquet").master("local[*]").getOrCreate()
names = spark.read.csv("name.csv", header="true", inferSchema="true")
names.show()
This is the dataframe created from name.csv file
+-------+---+---+---+-----+----+
| Name| 1|Age| 3|Class| _c5|
+-------+---+---+---+-----+----+
|Diwakar| | 25| | 12|null|
|Prabhat| | 27| | 15|null|
| Zyan| | 30| | 17|null|
| Jack| | 35| | 21|null|
+-------+---+---+---+-----+----+
Spark by default gave name to the empty column as 1, 3,_c5 can we stop spark giving default names to the columns.
I wanted to have a dataframe like given below:
+-------+---+---+---+-----+----+
| Name| |Age| |Class| |
+-------+---+---+---+-----+----+
|Diwakar| | 25| | 12|null|
|Prabhat| | 27| | 15|null|
| Zyan| | 30| | 17|null|
| Jack| | 35| | 21|null|
+-------+---+---+---+-----+----+
and i wanted to remove the empty column in one go like:
temp = list(set(names.columns))
temp.remove(" ")
names = names.select(temp)
names.show
+-------+---+-----+
| Name|Age|Class|
+-------+---+-----+
|Diwakar| 25| 12|
|Prabhat| 27| 15|
| Zyan| 30| 17|
| Jack| 35| 21|
+-------+---+-----+
I have a dataset in which I paste values in a dplyr chain and collapse with the pipe character (e.g. " | "). If any of the values in the dataset are blank, I just get recurring pipe characters in the pasted list.
Some of the values look like this, for example:
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
I want to match all the pipes that occur more than once and delete them, so that just the names appear like so:
correctstring = "| GHOULSBY,SCROGGINS | CAT,JOHNSON | |BURGLAR,PALA |"
I tried the following, but to no avail:
mutate(names = gsub('[\\|]{2,}', '', name_list))
The difficulty in this question is in formulating a regex which can selectively remove every pipe, except the ones we want to remain as actual separators between terms. We can match on the following pattern:
\|\s+(?=\|)
and then replace just empty string. This pattern will remove any pipe (and any following whitespace) so long as what follows is another pipe. A removal would not occur when a pipe is followed by an actual term, or when it is followed by the end of the string.
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
result <- gsub("\\|\\s+(?=\\|)", "", badstring, perl=TRUE)
result
[1] "| GHOULSBY,SCROGGINS | CAT,JOHNSON | BURGLAR,PALA |"
Demo
Edit:
If you expect to have inputs like | | | which are devoid of any terms, and you would expect empty string as the output, then my solution would fail. I don't see an obvious way to modify the above regex, but you can handle this case with one more call to sub:
result <- sub("^\\|$", "", result)
We also might be able to modify the original pattern to use an alternation covering all cases:
result <- gsub("\\|\\s+(?=\\|)|(?:^\\|$)", "", badstring, perl=TRUE)
I have Dataframe contains "time" column I want to add a new column contain period number after dividing the time into periods each 30 min
for example,
The original Dataframe
l = [('A','2017-01-13 00:30:00'),('A','2017-01-13 00:00:01'),('E','2017-01-13 14:00:00'),('E','2017-01-13 12:08:15')]
df = spark.createDataFrame(l,['test','time'])
df1 = df.select(df.test,df.time.cast('timestamp'))
df1.show()
+----+-------------------+
|test| time|
+----+-------------------+
| A|2017-01-13 00:30:00|
| A|2017-01-13 00:00:01|
| E|2017-01-13 14:00:00|
| E|2017-01-13 12:08:15|
+----+-------------------+
The Desired Dataframe as follow:
+----+-------------------+------+
|test| time|period|
+----+-------------------+------+
| A|2017-01-13 00:30:00| 2|
| A|2017-01-13 00:00:01| 1|
| E|2017-01-13 14:00:00| 29|
| E|2017-01-13 12:08:15| 25|
+----+-------------------+------+
Are there ways to achieve that?
You can simply utilize the hour and minute inbuilt functions to get your final result with when inbuilt function as
from pyspark.sql import functions as F
df1.withColumn('period', (F.hour(df1['time'])*2)+1+(F.when(F.minute(df1['time']) >= 30, 1).otherwise(0))).show(truncate=False)
You should be getting
+----+---------------------+------+
|test|time |period|
+----+---------------------+------+
|A |2017-01-13 00:30:00.0|2 |
|A |2017-01-13 00:00:01.0|1 |
|E |2017-01-13 14:00:00.0|29 |
|E |2017-01-13 12:08:15.0|25 |
+----+---------------------+------+
I hope the answer is helpful
This question already has answers here:
Cast column containing multiple string date formats to DateTime in Spark
(3 answers)
Closed 5 years ago.
Hi have a requirement to convert raw dates to timestamp
data
id,date,date1,date2,date3
1,161129,19960316,992503,20140205
2,961209,19950325,992206,20140503
3,110620,19960522,991610,20131302
4,160928,19930506,992205,20160112
5,021002,20000326,991503,20131112
6,160721,19960909,991212,20151511
7,160721,20150101,990809,20140809
8,100903,20151212,990605,20011803
9,070713,20170526,990702,19911010
here i have columns "date","date1","date2" and "date3" where dates are in string format. generally i convert the raw date using unix_timestamp("<col>","<formate>").cast("timestamp") but now i dont want mention format, i want dynamic method because later few more columns may get added to my table. in this case static method wont play a best role.
In some columns we will be having 6 characters of date where first 2 characters represents "year" and next 4 represents "date" and "month" i.e yyddmm or
yymmdd.
Some other columns we will be having 8 characters of date where first 4 characters represents "year" and next 4 represents "date" and "month" i.e yyyyddmm or yyyymmdd.
we have same format for each column which needs to find out dynamically and convert that to time stamp without hard coding.
Output should be in time stamp.
+---+-------------------+-------------------+-------------------+-------------------+
| id| date| date1| date2| date3|
+---+-------------------+-------------------+-------------------+-------------------+
| 1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
| 2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
| 3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
| 4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
| 5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
| 6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
| 7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
| 8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
| 9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+
Here with the above requirement i have. Given some conditions in UDF to find the format of each date column.
def udf_1(x:String):
if len(x)==6 and int(x[-2:]) > 12: return "yyMMdd"
elif len(x)==8 and int(x[-2:]) > 12: return "yyyyMMdd"
elif len((x))==6 and int(x[2:4]) <12 and int(x[-2:]) >12: return "yyMMdd"
elif len((x))==8 and int(x[4:6]) <12 and int(x[-2:]) >12: return "yyyyMMdd"
elif len((x))==6 and int(x[2:4]) >12 and int(x[-2:]) <12: return "yyddMM"
elif len((x))==8 and int(x[4:6]) >12 and int(x[-2:]) <12: return "yyyyddMM"
elif len((x))==6 and int(x[2:4]) <=12 and int(x[-2:]) <=12: return "N"
elif len((x))==8 and int(x[4:6]) <=12 and int(x[-2:]) <=12: return "NA"
else: return "null"
udf_2 = udf(udf_1, StringType())
c1 = c.withColumn("date_formate",udf_2("date"))
c2 = c1.withColumn("date1_formate",udf_2("date1"))
c3 = c2.withColumn("date2_formate",udf_2("date2"))
c4 = c3.withColumn("date3_formate",udf_2("date3"))
c4.show()
with the specified conditions, i have extracted formats for some rows and in the case of date and month having <= 12 i have given "N" for 6 characters and "NA" for 8 characters.
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
| date| date1| date2| date3| id|date_formate|date1_formate|date2_formate|date3_formate|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
|161129|19960316|992503| 20140205| 1| yyMMdd| yyyyMMdd| yyddMM| NA|
|961209|19950325|992206| 20140503| 2| N| yyyyMMdd| yyddMM| NA|
|110620|19960522|991610| 20131302| 3| yyMMdd| yyyyMMdd| yyddMM| yyyyddMM|
|160928|19930506|992205| 20160112| 4| yyMMdd| NA| yyddMM| NA|
|021002|20000326|991503| 20131112| 5| N| yyyyMMdd| yyddMM| NA|
|160421|19960909|991212| 20151511| 6| yyMMdd| NA| N| yyyyddMM|
|160721|20150101|990809| 20140809| 7| yyMMdd| NA| N| NA|
|100903|20151212|990605| 20011803| 8| N| NA| N| yyyyddMM|
|070713|20170526|990702|19911010 | 9| yyMMdd| yyyyMMdd| N| yyyyddMM|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
Now i have taken extracted format and stored it in a variable and called that variable in unix_timestamp to convert raw date to time stamp.
r1 = c4.where(c4.date_formate != ('NA' or 'N'))[['date_formate']].first().date_formate
t_s = unix_timestamp("date",r1).cast("timestamp")
c5=c4.withColumn("date",t_s)
r2 = c5.where(c5.date1_formate != ('NA' or 'N'))[['date1_formate']].first().date1_formate
t_s1 = unix_timestamp("date1",r2).cast("timestamp")
c6 = c5.withColumn("date1",t_s1)
r3 = c6.where(c6.date2_formate != ('NA' or 'N'))[['date2_formate']].first().date2_formate
t_s2 = unix_timestamp("date2",r3).cast("timestamp")
c7 = c6.withColumn("date2",t_s2)
r4 = c7.where(c7.date3_formate != ('NA' or 'N'))[['date3_formate']].first().date3_formate
t_s3 = unix_timestamp("date3",r4).cast("timestamp")
c8 = c7.withColumn("date3",t_s3)
c8.select("id","date","date1","date2","date3").show()
Output
+---+-------------------+-------------------+-------------------+-------------------+
| id| date| date1| date2| date3|
+---+-------------------+-------------------+-------------------+-------------------+
| 1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
| 2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
| 3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
| 4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
| 5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
| 6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
| 7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
| 8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
| 9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+