I have a dataframe containing a timeseries, one column being ISO 8601 datetime strings of the form 2020-12-27T23:59:59+01:00. This is a long running timeseries spanning multiple timezone offset changes due to DST (for reference, the data can be found here).
I try to parse those into pl.Datetime via pl.col("date").str.strptime(pl.Datetime, fmt="%+")
This used to work but since version 0.15.7 of polars, this throws the following error:
exceptions.ComputeError: Different timezones found during 'strptime' operation.
I also tried a an explicit format string fmt="%Y-%m-%dT%H:%M:%S%:z" which yields the same error.
Not sure if this is a bug or user error. I read the release notes for 0.15.7 on github and there are some mentions on ISo 8601 parsing, but but nothing that hints at why this would no longer work.
This is due to https://github.com/pola-rs/polars/pull/6434/files
Previously, the timezone was ignored when parsing with '%+'. As of 0.15.17, it is respected.
In pandas, you could get around this by doing:
In [22]: pd.to_datetime(dfp['date'], utc=True).dt.tz_convert('Europe/Vienna')
Out[22]:
0 2020-12-27 23:59:59+01:00
1 2020-12-27 23:59:59+01:00
2 2020-12-27 23:59:59+01:00
3 2020-12-27 23:59:59+01:00
4 2020-12-27 23:59:59+01:00
...
255355 2023-01-25 23:59:59+01:00
255356 2023-01-25 23:59:59+01:00
255357 2023-01-25 23:59:59+01:00
255358 2023-01-25 23:59:59+01:00
255359 2023-01-25 23:59:59+01:00
Name: date, Length: 255360, dtype: datetime64[ns, Europe/Vienna]
As of polars 0.16.0, you can do
pl.col("date").str.strptime(pl.Datetime, fmt="%+", utc=True)
Related
I have a file named data.json. It has the following contents:
{
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],
"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
In RStudio, I have installed the 'rjson' package and have the following code:
library("rjson")
myData <- fromJSON(file="data.json")
print(myData)
As per the description of the fromJSON() function, it should read the contents of 'data.json' file into a R object 'myData'. When I executed it, I got the following error:
Error in fromJSON(file = "data.json") :
not all data was parsed (0 chars were parsed out of a total of 3 chars)
I validated the structure of the 'data.json' file on https://jsonlint.com/. It was valid.
I searched stackoverflow.com and got the following page: Error in fromJSON("employee.json") : not all data was parsed (0 chars were parsed out of a total of 13 chars)
My program already complies with the answers given here but the 'data.json' file is still not getting parsed.
I would be grateful if you could point out what mistake I am making in the R program or JSON file as I am new to both.
Thank You.
I can confirm the error for rjson, but jsonlite::fromJSON appears to work.
jsonlite::fromJSON('foo.dat') |> as.data.frame()
# ID Name Salary StartDate Dept
# 1 1 Rick 623.3 1/1/2012 IT
# 2 2 Dan 515.2 9/23/2013 Operations
# 3 3 Michelle 611 11/15/2014 IT
# 4 4 Ryan 729 5/11/2014 HR
# 5 5 Gary 843.25 3/27/2015 Finance
# 6 6 Nina 578 5/21/2013 IT
# 7 7 Simon 632.8 7/30/2013 Operations
# 8 8 Guru 722.5 6/17/2014 Finance
I have a dataframe which contains date and time for the columns. Let's name this dataframe date_time. Since the data type is factor type, I would like to convert the whole column of date_time to numerics without changing anything, eg 2020-01-20 14:02:50 to 20200120140250.
I have about 1000 rows of data. Does anyone knows how to produce the output? I have tried as.numeric and gsub but they doesnt work. I think using POSIXct might work but I do not understand the reasoning behind it.
example of my data:
2020-07-08 21:40:26
2020-07-08 16:48:57
2020-07-01 15:54:10
2020-07-13 20:27:06
2020-07-27 16:08:12
and the list goes on.
You can try:
gsub("[[:punct:] ]", "", as.character(as.POSIXct("2020-01-20 14:02:50")))
The as.character keeps the visual output instead working with the underlying numbers.
UDPATE:
date_time <- data.frame(time = as.POSIXct(
c("2020-07-08 21:40:26", "2020-07-08 16:48:57", "2020-07-01 15:54:10",
"2020-07-13 20:27:06", "2020-07-27 16:08:12", "2020-01-20 14:02:50")))
date_time$num_time <- gsub("[[:punct:] ]", "", as.character(date_time$time))
Solution with lubricdate
dt1 <- as.factor(c("2020-07-08 21:40:26", "2020-07-08 16:48:57", "2020-07-01 15:54:10",
"2020-07-13 20:27:06", "2020-07-27 16:08:1"))
dt <- data.frame(date=ymd_hms(dt1))
dt
class(dt$date)
Result
date
1 2020-07-08 21:40:26
2 2020-07-08 16:48:57
3 2020-07-01 15:54:10
4 2020-07-13 20:27:06
5 2020-07-27 16:08:01
> class(dt$date)
[1] "POSIXct" "POSIXt"
Please help!
I have a .csv file with 4 columns: Date, VBLTX, FMAGX and SBUX. The three latter columns are adjusted closing prices of some stocks. And the date column are the months from Jan-1998 to Dec-2009. Here is the first couple or rows:
Date |VBLTX |FMAGX |SBUX
1/01/1998 |4.36 |44.38 |4.3
1/02/1998 |4.34 |47.74 |4.66
1/03/1998 |4.35 |47.74 |5.33
I am trying to read this into R as a zoo object that should look like this:
|VBLTX |FMAGX |SBUX
Jan 1998 |4.36 |44.38 |4.3
Feb 1998 |4.34 |47.74 |4.66
Mar 1998 |4.35 |47.74 |5.33
I have no idea how to make this work. I am currently using this line of code:
all_prices <- read.zoo("all_prices.csv", FUN = identity)
And this produces this zoo series:
|V2 |V3 |V4
Apr-00 |4.63 |73.15 |7.12
Apr-01 |5.22 |63.05 |9.11
Apr-02 |5.71 |53.88 |10.74
It appears to have sorted the csv file alphabetically rather than by date. Also if I scroll through the zoo series there is a row which is the column names from the csv file.
Any help would be appreciated
Thanks!
If you have "no idea" how to use a command then read the help file for it carefully -- in this case ?read.zoo. Also there is a vignette that comes with zoo entirely devoted to read.zoo examples: vignette("zoo-read") . Also reviewing ?yearmon would be useful here.
Assuming that the input file is as shown reproducibly in the Note at the end and NOT as shown in the question it should NOT have a .csv extension since it is not a csv file; however, ignoring that we have the following.
header = TRUE says the first line is a header, FUN = as.yearmon says we want to convert the first column to a yearmon class time index and format specifies its format (using the percent codes defined in ?strptime).
library(zoo)
read.zoo("all_prices.csv", header = TRUE, FUN = as.yearmon, format = "%d/%m/%Y")
giving:
VBLTX FMAGX SBUX
Jan 1998 4.36 44.38 4.30
Feb 1998 4.34 47.74 4.66
Mar 1998 4.35 47.74 5.33
Note
Lines <- "
Date VBLTX FMAGX SBUX
1/01/1998 4.36 44.38 4.3
1/02/1998 4.34 47.74 4.66
1/03/1998 4.35 47.74 5.33
"
cat(Lines, file = "all_prices.csv")
I have a pandas.DataFrame (df), which consists of some values and a datetime which is a string at first but which I convert to a Timestamp using
df['datetime'] = pd.to_datetime(df['Time [dd.mm.yyyy hh:mm:ss.ms]'], format="%d.%m.%Y %H:%M:%S.%f")
It seems to work and I can access the new column's element's properties like obj.day and such. So the resulting column contains a Timestamp. When I try to plot this by using either pyplot.plot(df['datetime'],df['value_name']) or df.plot(x='datetime',y='value_name'),the picture below is the reslut. I tried converting the Timestamps using obj.to_pydatetime() but that did not change anything. The dataframe itself is populated by some data coming from csvs. What confuses me, is that with a certain csvs it works but with others not. I am pretty sure that the conversion to Timestamps was successful but I could be wrong. Also my time window should be from 2015-2016 not from 1981-1700. If I try to locate the min and max Timestamp from the DataFrame, I get the right Timestamps in 2015 and 2016 respectively.
Resulting Picture form pyplot.plot
Edit:
df.head() gives:
Sweep Time [dd.mm.yyyy hh:mm:ss.ms] Frequency [Hz] Voltage [V]
0 1.0 11.03.2014 10:13:04.270 50.0252 230.529
1 2.0 11.03.2014 10:13:06.254 49.9515 231.842
2 3.0 11.03.2014 10:13:08.254 49.9527 231.754
3 4.0 11.03.2014 10:13:10.254 49.9490 231.678
4 5.0 11.03.2014 10:13:12.254 49.9512 231.719
datetime
0 2014-03-11 10:13:04.270
1 2014-03-11 10:13:06.254
2 2014-03-11 10:13:08.254
3 2014-03-11 10:13:10.254
4 2014-03-11 10:13:12.254
and df.info() gives:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33270741 entries, 0 to 9140687
Data columns (total 5 columns):
Sweep float64
Time [dd.mm.yyyy hh:mm:ss.ms] object
Frequency [Hz] float64
Voltage [V] float64
datetime datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 1.5+ GB
I am trying to plot 'Frequency [Hz]'vs 'datetime'.
I think you need set_index and then set formatting of both axis:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df['datetime'] = pd.to_datetime(df['Time [dd.mm.yyyy hh:mm:ss.ms]'],
format="%d.%m.%Y %H:%M:%S.%f")
print (df)
df.set_index('datetime', inplace=True)
ax = df['Frequency [Hz]'].plot()
ticklabels = df.index.strftime('%Y-%m-%d')
ax.xaxis.set_major_formatter(ticker.FixedFormatter(ticklabels))
ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.2f'))
plt.show()
I have a Formal Class DataFrame object that was uploaded to SparkR from MySQL (via a json file), which contains formatted strings like this:
"2012-07-02 20:14:00"
I need to convert these to a datetime type in SparkR, but this does not seem to be supported yet. Is there an undocumented function or a recipe for doing this with a UDF? (Nb. I haven't actually tried creating a SparkR UDF before, so I'm grasping at straws, here.)
Spark SQL doesn't support R UDFs but in this particular case you can simply cast to timestamp:
df <- createDataFrame(sqlContext,
data.frame(dts=c("2012-07-02 20:14:00", "2015-12-28 00:10:00")))
dfWithTimestamp <- withColumn(df, "ts", cast(df$dts, "timestamp"))
printSchema(dfWithTimestamp)
## root
## |-- dts: string (nullable = true)
## |-- ts: timestamp (nullable = true)
head(dfWithTimestamp)
## dts ts
## 1 2012-07-02 20:14:00 2012-07-02 20:14:00
## 2 2015-12-28 00:10:00 2015-12-28 00:10:00