I have a column as datetime in a pandas data frame. with this function:
data['yearMonth'] = data.ts_placed.map(lambda x: '{year}-{month}'.format(year=x.year,month=x.month))
I convert the datetime object from
2012-08-06 10:25:39
to
2012-8
what i need is to get the object as
2012-08
You could use string formatting:
data['yearMonth'] = data.ts_placed.map(lambda x: '{year}-{month:02}'.format(year=x.year,month=x.month))
or, if x is a pandas Timestamp or datetime.datetime, use strftime:
data['yearMonth'] = data.ts_placed.map(lambda x: x.strftime('%Y-%m'))
Related
I tried to make a tasks.loop() for checking a muted user that needs to be unmuted, but there's a few problem while doing this, i can't use fetchall() for some reason because it will gives me this error
toremove = muteremove[2]
IndexError: list index out of range
If i use fetchone() maybe it only fetch 1 user every 10 secs, i mean how to fetch all the data every 10 sec to unmute a user?
Also if i use fetchone() it will say that it can't convert str into datetime.datetime object, how can i fix this?
#tasks.loop(seconds=10)
async def muted_user_check(self):
self.cur.execute(f"SELECT userId, guildId, expiredAt FROM mutedlist")
muteremove = self.cur.fetchall()
if muteremove is None:
print("No user to unmute :D")
if muteremove is not None:
toremove = muteremove[2]
timenow = datetime.utcnow()
if timenow > toremove:
self.cur.execute(f"DELETE FROM mutedlist WHERE guildId = {muteremove[1]} and userId = {muteremove[0]}")
To convert a string into a datetime object, you can use the strptime() method:
from datetime import datetime
def convert(date, format):
return datetime.strptime(date, format)
[input] convert('22/08/2020', '%d/%m/%Y')
[output] 2020-08-22 00:00:00
The output will be a datetime object that you can format with the strftime() method like so:
#Example
from datetime import datetime
now = datetime.now() #now will be a datetime object
now.strftime('%d/%m/%Y - %H:%M:%S') # DD/MM/YYYY - hours:minutes:seconds
Here's a list of some formats:
%A → Weekday (%a for abreviations and %w for numbers)
%-d → day of the mount (1, 2, 3, 4, ...)
%B → Mounth name (%b for abreviations and %-m for numbers)
%I → Hour (12h clock)
%p → AM or PM
%H → Hour (24h clock)
%M → Minutes
%S → Seconds
%f → Microseconds
%c → Local date and time representation
Using your code, it would be:
#tasks.loop(seconds=10)
async def muted_user_check(self):
self.cur.execute(f"SELECT * FROM mutedlist")
mute_list = self.cur.fetchall()
if not mute_list:
print("No user to unmute :D")
else:
timeNow = datetime.utcnow()
for mute in mute_list:
muteExpire = datetime.strptime(mute[3], '%Y-%m-%d %H:%M:%S')
if timeNow > muteExpire :
self.cur.execute(f"DELETE FROM mutedlist WHERE guildId=? AND userId=?", (mute[0], mute[1]))
I'm not sure why my datatime is being printed the way it is. I'm expecting the format of "%Y-%M-%D" (2020-05-11)
import datetime
from pyspark.sql.functions import *
currentdate = datetime.datetime.now().strftime("%Y-%M-%D")
print(dateValue)
Output:
2020-09-05/11/20
Try with %Y-%m-%d instead of %Y-%M-%D
currentdate = datetime.datetime.now().strftime("%Y-%m-%d")
print (currentdate)
#2020-05-11
#or using spark sql
currentdate=spark.sql("select string(current_date)").collect()[0][0]
print(currentdate)
#2020-05-11
I have a csv file with two fields, a key and a value:
{1Y4dZ123eAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZ123433MGooBmVzBLUWEZ1234CUY91},8.530366
{1YdZ2344AMGooBmVzBLUWE123JfCCUY91},8.530366
{1YdECDNthiMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBDJTdBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZ123qeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBmVzBLUWEZ2JfCCUY91},8.530366
{1YdZDNHqeAMGooBm123LUWEZ2JfCCUY91},8.530366
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{17RJgv5ujkFerSd48Akdd2GneUAW47nphQ},20.0
{13uZ6tSr5oh1ui9Hd1tEqJKo2AHhJ6JdFS},0.03895804
What I'm trying to do is sum up the second column and group by the first column, then derive the top 10 keys with the highest values.
Below is the code I've tried using but I get a 'tuple index out of range' error:
import re
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.session import SparkSession
sc = pyspark.SparkContext()
spark = SparkSession(sc)
voutFile = sc.textFile("input/voutfiltered.csv")
features=voutFile.map(lambda l:
(l.split(',')[0],float(l.split(',')[1])))
top10 = features.takeOrdered(10, key = lambda x: -x[2])
for record in top10:
print("{}: {};{}".format(record[0],record[1],record[2]))```
Any particular reason why you're not using the DataFrame API? It's much more flexible, convenient and faster than the RDD API.
import pyspark.sql.functions as f
df = spark.read.format("csv").option("header", "true").load("/path/to/your/file.csv/")
(df.groupBy(f.col("key_col"))
.agg(f.count(f.col("value_col")).alias("count_value_col"))
.sort(col("count_value_col").desc())
.limit(10)
.show())
Edit: Apologies, the sample data frame is a little off. Below is the corrected sample dataframe I'm trying to convert:
Timestamp (CST)
12/8/2018 05:23 PM
11/29/2018 10:20 PM
I tried the following code based on recommendation below but got null values returned.
df = df.withColumn('Timestamp (CST)_2', from_unixtime(unix_timestamp(col(('Timestamp (CST)')), "yyyy/MM/dd hh:mm:ss aa"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
df = df.withColumn("Timestamp (CST)_3", F.to_timestamp(F.col("Timestamp (CST)_2")))
--------------------------------------------------------------------------------
I have a field called "Timestamp (CST)" that is a string. It is in Central Standard Time.
Timestamp (CST)
2018-11-21T5:28:56 PM
2018-11-21T5:29:16 PM
How do I create a new column that takes "Timestamp (CST)" and change it to UTC and convert it to a datetime with the time stamp on the 24 hour clock?
Below is my desired table and I would like the datatype to be timestamp:
Timestamp (CST)_2
2018-11-21T17:28:56.000Z
2018-11-21T17:29:16.000Z
I tried the following code but all the results came back null:
df = df.withColumn("Timestamp (CST)_2", to_timestamp("Timestamp (CST)", "yyyy/MM/dd h:mm p"))
Firstly, import from_unixtime, unix_timestamp and col using
from pyspark.sql.functions import from_unixtime, unix_timestamp, col
Then, reconstructing your scenario in a DataFrame df_time
>>> cols = ['Timestamp (CST)']
>>> vals = [
... ('2018-11-21T5:28:56 PM',),
... ('2018-11-21T5:29:16 PM',)]
>>> df_time = spark.createDataFrame(vals, cols)
>>> df_time.show(2, False)
+---------------------+
|Timestamp (CST) |
+---------------------+
|2018-11-21T5:28:56 PM|
|2018-11-21T5:29:16 PM|
+---------------------+
Then, my approach would be
>>> df_time_twenfour = df_time.withColumn('Timestamp (CST)', \
... from_unixtime(unix_timestamp(col(('Timestamp (CST)')), "yyyy-MM-dd'T'hh:mm:ss aa"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
>>> df_time_twenfour.show(2, False)
+------------------------+
|Timestamp (CST) |
+------------------------+
|2018-11-21T17:28:56.000Z|
|2018-11-21T17:29:16.000Z|
+------------------------+
Notes
If you want time to be in 24-Hour format then, you would use HH instead of hh.
Since, you have a PM, you use aa in yyyy-MM-dd'T'hh:mm:ss aa to specify PM.
Your, input string has T in it so, you have to specify it as above format.
the option aa as mentioned in #pyy4917's answer might give legacy errors. To fix it, replace aa with a.
The full code as below:
df_time_twenfour = df_time.withColumn('Timestamp (CST)', \ ...
from_unixtime(unix_timestamp(col(('Timestamp (CST)')), \...
"yyyy-MM-dd'T'hh:mm:ss a"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
I need to standardise and compare date/time fields that are in differnt timezones. eg How do you find the time difference between the following two times?...
"18-05-2012 09:29:41 +0800"
"18-05-2012 09:29:21 +0900"
What's the best way to initialise standard varaibles with the date/time?
The output needs to display the difference and normalised data in a timezone (eg +0100) that is different to the incoming values and different to the local environment.
Expected Output:
18-05-2012 02:29:41 +0100
18-05-2012 01:29:21 +0100
Difference: 01:00:20
import java.text.SimpleDateFormat
def dates = ["18-05-2012 09:29:41 +0800",
"18-05-2012 09:29:21 +0900"].collect{
new SimpleDateFormat("dd-MM-yyyy HH:mm:ss Z").parse(it)
}
def dayDiffFormatter = new SimpleDateFormat("HH:mm:ss")
dayDiffFormatter.setTimeZone(TimeZone.getTimeZone("UTC"))
println dates[0]
println dates[1]
println "Difference "+dayDiffFormatter.format(new Date(dates[0].time-dates[1].time))
wow. doesn't look readable, does it?
Or, use the JodaTime package
#Grab( 'joda-time:joda-time:2.1' )
import org.joda.time.*
import org.joda.time.format.*
String a = "18-05-2012 09:29:41 +0800"
String b = "18-05-2012 09:29:21 +0900"
DateTimeFormatter dtf = DateTimeFormat.forPattern( "dd-MM-yyyy HH:mm:ss Z" );
def start = dtf.parseDateTime( a )
def end = dtf.parseDateTime( b )
assert 1 == Hours.hoursBetween( end, start ).hours
Solution:
Groovy/Java Date objects are stored as the number of milliseconds after
1970 and so do not contain any timezone information directly
Use Date.parse method to initialise the new date to the specified format
Use SimpleDateFormat class to specify the required output format
Use SimpleDateFormat.setTimeZone to specifiy the timezone of the output
data
By using European/London timezone rather than GMT it will
automatically adjusts for day light savings time
See here for a full list of the options for date time patterns
-
import java.text.SimpleDateFormat
import java.text.DateFormat
//Initialise the dates by parsing to the specified format
Date timeDate1 = new Date().parse("dd-MM-yyyy HH:mm:ss Z","18-05-2012 09:29:41 +0800")
Date timeDate2 = new Date().parse("dd-MM-yyyy HH:mm:ss Z","18-05-2012 09:29:21 +0900")
DateFormat yearTimeformatter = new SimpleDateFormat("dd-MM-yyyy HH:mm:ss Z")
DateFormat dayDifferenceFormatter= new SimpleDateFormat("HH:mm:ss") //All times differences will be less than a day
// The output should contain the format in UK time (including day light savings if necessary)
yearTimeformatter.setTimeZone(TimeZone.getTimeZone("Europe/London"))
// Set to UTC. This is to store only the difference so we don't want the formatter making further adjustments
dayDifferenceFormatter.setTimeZone(TimeZone.getTimeZone("UTC"))
// Calculate difference by first converting to the number of milliseconds
msDiff = timeDate1.getTime() - timeDate2.getTime()
Date differenceDate = new Date(msDiff)
println yearTimeformatter.format(timeDate1)
println yearTimeformatter.format(timeDate2)
println "Difference " + dayDifferenceFormatter.format(differenceDate)