I have a Formal Class DataFrame object that was uploaded to SparkR from MySQL (via a json file), which contains formatted strings like this:
"2012-07-02 20:14:00"
I need to convert these to a datetime type in SparkR, but this does not seem to be supported yet. Is there an undocumented function or a recipe for doing this with a UDF? (Nb. I haven't actually tried creating a SparkR UDF before, so I'm grasping at straws, here.)
Spark SQL doesn't support R UDFs but in this particular case you can simply cast to timestamp:
df <- createDataFrame(sqlContext,
data.frame(dts=c("2012-07-02 20:14:00", "2015-12-28 00:10:00")))
dfWithTimestamp <- withColumn(df, "ts", cast(df$dts, "timestamp"))
printSchema(dfWithTimestamp)
## root
## |-- dts: string (nullable = true)
## |-- ts: timestamp (nullable = true)
head(dfWithTimestamp)
## dts ts
## 1 2012-07-02 20:14:00 2012-07-02 20:14:00
## 2 2015-12-28 00:10:00 2015-12-28 00:10:00
Related
Building off of this question (Retrieve modified DateTime of a file from an FTP Server), it's clear how to get the date modified value. However, the full date is not returned even though it's visible from the FTP site.
This shows how to get the date modified values for files at ftp://ftp.FreeBSD.org/pub/FreeBSD/
library(curl)
library(stringr)
con <- curl("ftp://ftp.FreeBSD.org/pub/FreeBSD/")
dat <- readLines(con)
close(con)
no_dirs <- grep("^d", dat, value=TRUE, invert=TRUE)
date_and_name <- sub("^[[:alnum:][:punct:][:blank:]]{43}", "", no_dirs)
dates <- sub('\\s[[:alpha:][:punct:][:alpha:]]+$', '', date_and_name)
dates
## [1] "May 07 2015" "Apr 22 15:15" "Apr 22 10:00"
Some dates are in month/day/year format, others are in month/day/hour/minute format.
Looking at the FTP site, all dates in month/day/year hour/minutes/seconds format.
I assume it's got something to do with Unix format standards (explained in FTP details command doesn't seem to return the year the file was modified, is there a way around this?). It would be nice to get the full date.
If you use download.file you get an html representation of the directory which you can parse with the xml2 package.
read_ftp <- function(url)
{
tmp <- tempfile()
download.file(url, tmp, quiet = TRUE)
html <- xml2::read_html(readChar(tmp, 1e6))
file.remove(tmp)
lines <- strsplit(xml2::xml_text(html), "[\n\r]+")[[1]]
lines <- grep("(\\d{2}/){2}\\d{4}", lines, value = TRUE)
result <- read.table(text = lines, stringsAsFactors = FALSE)
setNames(result, c("Date", "Time", "Size", "File"))
}
Which allows you to just do this:
read_ftp("ftp://ftp.FreeBSD.org/pub/FreeBSD/")
#> Date Time Size File
#> 1 05/07/2015 12:00AM 4,259 README.TXT
#> 2 04/22/2020 08:00PM 35 TIMESTAMP
#> 3 04/22/2020 08:00PM Directory development
#> 4 04/22/2020 10:00AM 2,325 dir.sizes
#> 5 11/12/2017 12:00AM Directory doc
#> 6 11/12/2017 12:00AM Directory ports
#> 7 04/22/2020 08:00PM Directory releases
#> 8 11/09/2018 12:00AM Directory snapshots
Created on 2020-04-22 by the reprex package (v0.3.0)
I want to insert data from an R dataframe into a MySql table.
Everything works fine except the column geburtstage which is of the type timestamp.
The class of the column geburtstage in the dataframe is "POSIXct" "POSIXt".
The result in the database is always 0000-00-00 00:00:00.
Here my R session:
library(XLConnect)
excel.file <- file.path("c:/path/test.xlsx")
elements <- readWorksheetFromFile(excel.file, sheet=1)
elements
name nummer geburtsdatum
1 Anton 1 1967-05-11
2 Berti 2 1964-05-14
3 Conni 3 1967-01-01
4 Det 4 1967-01-01
5 Edi 5 1967-01-01
6 Fritzchen 6 1967-01-01
class(elements$geburtsdatum)
[1] "POSIXct" "POSIXt"
library(RMySQL)
library(DBI)
con <- dbConnect(RMySQL::MySQL(), host = "127.0.0.1", user = "root", password = "xxxx", dbname = "test")
dbWriteTable(
+ conn = con,
+ name='testdaten3',
+ value = elements,
+ row.names = FALSE,
+ append = TRUE,
+ field.types = c(
+ name = "varchar(45)",
+ nummer = "tinyint",
+ geburtsdatum = 'timestamp'
+ )
+ )
[1] TRUE
--- end of R session ---
MySql database table testdaten3:
id name nummer geburtsdatum
1 Anton 1 0000-00-00 00:00:00
2 Berti 2 0000-00-00 00:00:00
3 Conni 3 0000-00-00 00:00:00
4 Det 4 0000-00-00 00:00:00
5 Edi 5 0000-00-00 00:00:00
6 Fritzchen 6 0000-00-00 00:00:00
I already tried to convert the data like that:
elements$geburtsdatum <- format(elements$geburtsdatum,'%Y-%m-%d %H:%M:%S')
But the result was the same.
I use RStudio Version 1.1.456 with R 3.5.1 under Windows 8.1 and a MySql Server 5.6.
Can anybody help?
Kind regards
Goetz Edinger
From your example, it seems like geburtsdatum is just a date, with no time value. In that case, why not use as.Date(elements$geburtsdatum) to change it to a date type in your data frame and then use CONCAT to add it to the MySQL db?
Like this:
CONCAT(elements$geburtsdatum, " ", "00:00:00")
Basically, you are adding the birthday to a placeholder time value in order to make a timestamp.
Thank You!!
I found the mistake. If I use a date before '1970-01-01 01:00:01' the date is changed by the database to '0000-00-00 00:00:00'. So if I use a date which is equal to '1970-01-01 01:00:01' or newer the result is correct. It doesn't matter if I do it over R or over MySQL workbench.
* PROBLEM SOLVED *
I have 100+ csv files in the current directory, all with the same characteristics. Some examples:
ABC.csv
,close,high,low,open,time,volumefrom,volumeto,timestamp
0,0.05,0.05,0.05,0.05,1405555200,100.0,5.0,2014-07-17 02:00:00
1,0.032,0.05,0.032,0.05,1405641600,500.0,16.0,2014-07-18 02:00:00
2,0.042,0.05,0.026,0.032,1405728000,12600.0,599.6,2014-07-19 02:00:00
...
1265,0.6334,0.6627,0.6054,0.6266,1514851200,6101389.25,3862059.89,2018-01-02 01:00:00
XYZ.csv
,close,high,low,open,time,volumefrom,volumeto,timestamp
0,0.0003616,0.0003616,0.0003616,0.0003616,1412640000,11.21,0.004054,2014-10-07 02:00:00
...
1183,0.0003614,0.0003614,0.0003614,0.0003614,1514851200,0.0,0.0,2018-01-02 01:00:00
The idea is to build in R a time series dataset in xts so that I could use the PerformanceAnalyticsand quantmod libraries. Something like that:
## ABC XYZ ... ... JKL
## 2006-01-03 NaN 20.94342
## 2006-01-04 NaN 21.04486
## 2006-01-05 9.728111 21.06047
## 2006-01-06 9.979226 20.99804
## 2006-01-09 9.946529 20.95903
## 2006-01-10 10.575626 21.06827
## ...
Any idea? I can provide my trials if required.
A solution using base R
If you know that your files are formatted the same way then you can merge them. Below is what I would have done.
Get a list a files (this assumes that all the .csv files are the one you actually need and they are placed in the working directory)
vcfl <- list.files(pattern = "*.csv")
lapply() to open all files and store them as.data.frame:
lsdf <- lapply(lsfl, read.csv)
Merge them. Here I used the column high but you can apply the same code on any variable (there likely is a solution without a loop)
out_high <- lsdf[[1]][,c("timestamp", "high")]
for (i in 2:length(vcfl)) {
out_high <- merge(out_high, lsdf[[i]][,c("timestamp", "high")], by = "timestamp")
}
Rename the column using the vector of files' names:
names(lsdf)[2:length(vcfl)] <- gsub(vcfl, pattern = ".csv", replacement = "")
You can now use as.xts() fron the xts package https://cran.r-project.org/web/packages/xts/xts.pdf
I guess there is an alternative solution using tidyverse, somebody else?
Hope this helps.
I have a dataframe data,Which Contains the columns having integers,and columns containing date and time,As shown
>head(data,2)
PRESSURE AMBIENT_TEMP OUTLET_PRESSURE COMP_STATUS DATE TIME predict
1 14 65 21 0 2014-01-09 12:45:00 0.6025863
2 17 65 22 0 2014-01-10 06:00:00 0.6657910
And Now i'm going to write this back to Sql database by the chunck
sqlSave(channel,data,tablename = "ANL_ASSET_CO",append = T)
Where channel is connection name,But this gives error
[RODBC] Failed exec in Update
22018 1722 [Oracle][ODBC][Ora]ORA-01722: invalid number
But When i try excluding the date column ,it writes back without any error.
> sqlSave(channel,data[,c(1:4,7)],tablename = "ANL_ASSET_CO",append = T)
> sqlSave(channel,data[,c(1:4,6:7)],tablename = "ANL_ASSET_CO",append = T)
Because of the date column the data is not writing to ORACLE SQL developer,Could be problem with the hyphen.
How can i write , Any help !!
>class(data$DATE)
[1] "POSIXct" "POSIXt"
So had to change the data type as character
>data$DATE <- as.character(data$DATE)
>sqlSave(channel,data,tablename = "ANL_ASSET_CO",append=T)
This one worked!!
I installed the quantmod package and I'm trying to import a csv file with 1 minute intraday data. Here is a sample GAZP.csv file:
"D";"T";"Open";"High";"Low";"Close";"Vol"
20130902;100100;132.2000000;133.0500000;131.9200000;132.5000000;131760
20130902;100200;132.3700000;132.5700000;132.2500000;132.2900000;66090
20130902;100300;132.3600000;132.5000000;132.2600000;132.4700000;37500
I've tried:
> getSymbols('GAZP',src='csv')
Error in `colnames<-`(`*tmp*`, value = c("GAZP.Open", "GAZP.High", "GAZP.Low", :
length of 'dimnames' [2] not equal to array extent
> getSymbols.csv('GAZP',src='csv')
> # or
> getSymbols.csv('GAZP',env,dir="c:\\!!",extension="csv")
Error in missing(verbose) : 'missing' can only be used for arguments
How should I properly use the getSymbols.csv command to read such data?
#Vladimir, if you are not insisting to use the "getSymbols" function from the quantmod package you can import your csv file - assuming it is in your working directory - as zoo object with the line:
GAZP=read.zoo("GAZP.csv",sep=";",header=TRUE,index.column=list(1,2),FUN = function(D,T) as.POSIXct(paste(D, T), format="%Y%m%d %H%M%S"))
and convert it to a xts object if you want.
GAZP.xts <- as.xts(GAZP)
> GAZP
Open High Low Close Vol
2013-09-02 10:01:00 132.20 133.05 131.92 132.50 131760
2013-09-02 10:02:00 132.37 132.57 132.25 132.29 66090
2013-09-02 10:03:00 132.36 132.50 132.26 132.47 37500