I have a table stored in a postgreSQL and a column(named js) has the following format:
I am using the following code to import the data:
library('RPostgreSQL')
drv <- dbDriver('PostgreSQL')
CON <- dbConnect(drv,host='bi*********zonaws.com',port=****,user='****',password='*****')
dbGetQuery(CON,'SELECT js FROM ln2 LIMIT 1')
and I have the following resulst for the first row:
{"id": "pub%2Fc%25C3%25B3nal-o-meara%2F27%2F4a5%2F933", "age": null, "name": "Cónal O'Meara", "emails": [] }#continues but i stop it here. . . .
My question is this conversion in some letters. In the original table the name is Cónal O'Meara and in the imported in R is "Cónal O'Meara". How can I overcame that?
Related
Recent airflow-providers-amazon has deprecated MySQLToS3Operator and introduced SqlToS3Operator and now it is adding an index column in the beginning of the CSV dump.
For example, if I run the following
sql_to_s3_task = SqlToS3Operator(
task_id="sql_to_s3_task",
sql_conn_id=conn_id_name,
query="SELECT created_at, score FROM my_table",
s3_bucket=bucket_name,
s3_key=key,
replace=True,
)
The S3 file has something like this:
,created_at,score
1,2023-01-01,5
2,2023-01-02,6
The output seems to be a direct dump from Pandas. How can I remove this unwanted preceding index column?
The operator uses pandas DataFrame under the hood.
You should use pd_kwargs. It allows you to pass arguments to include in DataFrame .to_parquet(), .to_json() or .to_csv().
Since your output is csv the relevant pandas.DataFrame.to_csv parameters are:
header: bool or list of str, default True
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
index: bool, default True
Write row names (index).
Thus you can do:
sql_to_s3_task = SqlToS3Operator(
task_id="sql_to_s3_task",
sql_conn_id=conn_id_name,
query="SELECT created_at, score FROM my_table",
s3_bucket=bucket_name,
s3_key=key,
replace=True,
file_format="csv",
pd_kwargs={"index": False, "header": False},
)
I am attempting to use a .json file I found online, but I'm starting to think that there is an underlying issue with the file. I am not very knowledgeable in .json files, so I am trying to convert it into a CSV file. I have yet to find a website that can do that for me.
I've tried using R to convert the file since the file is also quite large and I can only assume that most websites have a size limit. I have tried flattening it in r with this code:
library(jsonlite)
library(tidyr)
library(tidyverse)
json_string <- readLines("data.json")
json_data <- fromJSON(json_string)
json_data <- flatten(json_data)
df <- as_data_frame(json_data)
write_csv(df, "output.csv")
but it returns this error:
! Tibble columns must have compatible sizes.
* Size 2: Columns `A-Alrund, God of the Cosmos // A-Hakka, Whispering Raven`, `A-Blessed Hippogriff // A-Tyr's Blessing`, `A-Emerald Dragon // A-Dissonant Wave`, `A-Monster Manual // A-Zoological Study`, `A-Rowan, Scholar of Sparks // A-Will, Scholar of Frost`, and 484 more.
* Size 3: Column `Smelt // Herd // Saw`.
* Size 5: Column `Who // What // When // Where // Why`.
* Size 6: Columns `Everythingamajig`, `Garbage Elemental`, `Ineffable Blessing`, `Knight of the Kitchen Sink`, `Scavenger Hunt`, and 4 more.
i Only values of size one are recycled.
Backtrace:
1. tibble::as_data_frame(json_data)
3. tibble:::as_tibble.list(x, ...)
4. tibble:::lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
5. tibble:::recycle_columns(x, .rows, lengths)
Here is what the first 2 items of the .json file look like
{"data": {"\"Ach! Hans, Run!\"": [{"colorIdentity": ["G", "R"], "colors": ["G", "R"], "convertedManaCost": 6.0, "foreignData": [], "identifiers": {"scryfallOracleId": "a2c5ee76-6084-413c-bb70-45490d818374"}, "isFunny": true, "layout": "normal", "legalities": {}, "manaCost": "{2}{R}{R}{G}{G}", "manaValue": 6.0, "name": "\"Ach! Hans, Run!\"", "printings": ["UNH"], "purchaseUrls": {"cardKingdom": "https://mtgjson.com/links/84dfefe718a51cf8", "cardKingdomFoil": "https://mtgjson.com/links/d8c9f3fc1e93c89c", "cardmarket": "https://mtgjson.com/links/b9d69f0d1a9fb80c", "tcgplayer": "https://mtgjson.com/links/c51d2b13ff76f1f0"}, "rulings": [], "subtypes": [], "supertypes": [], "text": "At the beginning of your upkeep, you may say \"Ach! Hans, run! It's the . . .\" and the name of a creature card. If you do, search your library for a card with that name, put it onto the battlefield, then shuffle. That creature gains haste. Exile it at the beginning of the next end step.", "type": "Enchantment", "types": ["Enchantment"]}], "\"Brims\" Barone, Midway Mobster": [{"colorIdentity": ["B", "W"], "colors": ["B", "W"], "convertedManaCost": 5.0, "foreignData": [], "identifiers": {"scryfallOracleId": "c64c31f2-c1be-414e-9dff-c3b77ba97545"}, "isFunny": true, "layout": "normal", "leadershipSkills": {"brawl": false, "commander": true, "oathbreaker": false}, "legalities": {}, "manaCost": "{3}{W}{B}", "manaValue": 5.0, "name": "\"Brims\" Barone, Midway Mobster", "power": "5", "printings": ["UNF"], "purchaseUrls": {"cardKingdom": "https://mtgjson.com/links/d1e320bd9d6813c0", "cardKingdomFoil": "https://mtgjson.com/links/18f86e8a04682c34", "cardmarket": "https://mtgjson.com/links/d5a3d8cfb60767d4", "tcgplayer": "https://mtgjson.com/links/980f45f2bc8c3733"}, "rulings": [], "subtypes": ["Human", "Rogue"], "supertypes": ["Legendary"], "text": "When \"Brims\" Barone, Midway Mobster enters the battlefield, put a +1/+1 counter on each other creature you control that has a hat.\n\"Brims\" Barone, Midway Mobster has menace as long as you're wearing a hat.", "toughness": "4", "type": "Legendary Creature — Human Rogue", "types": ["Creature"]}]}
I am hoping that the resulting csv file has the keys as the column names, and the values to be assigned to the columns based on their keys.
EDIT:
I have now attached a screenshot of what the json_data structure looks like.Structure of json_data
Assuming it's one of the JSON dumps from scryfall, try this:
library(jsonlite)
library(tidyr)
library(tidyverse)
todo <- list.files(pattern = ".json")
json_data <- fromJSON(todo)
json_data_flat_jsl <- jsonlite::flatten(json_data)
df <- as_tibble(json_data_flat_jsl)
write_csv(df, "output.csv")
I have a csv file look like this (it is saved from pyspark output)
name_value
"[quality1 -> good, quality2 -> OK, quality3 -> bad]"
"[quality1 -> good, quality2 -> excellent]"
how can I use pyspark to read this csv file and convert name_value column into a map type?
Something like the below
data = {}
line = '[quality1 -> good, quality2 -> OK, quality3 -> bad]'
parts = line[1:-1].split(',')
for part in parts:
k,v = part.split('->')
data[k.strip()] = v.strip()
print(data)
output
{'quality1': 'good', 'quality2': 'OK', 'quality3': 'bad'}
Using a combination of split and regexp_replace cuts the string into key value pairs. In a second step each key value pair is transformed first into a struct and then into a map element:
from pyspark.sql import functions as F
df=spark.read.option("header","true").csv(...)
df1=df.withColumn("name_value", F.split(F.regexp_replace("name_value", "[\\[\\]]", ""),",")) \
.withColumn("name_value", F.map_from_entries(F.expr("""transform(name_value, e -> (regexp_extract(e, '^(.*) ->',1),regexp_extract(e, '-> (.*)$',1)))""")))
df1 has now the schema
root
|-- name_value: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and contains the same data like the original csv file.
Question
Using the mongolite package in R, how do you query a database for a given date?
Example Data
Consider a test collection with two entries
library(mongolite)
## create dummy data
df <- data.frame(id = c(1,2),
dte = as.POSIXct(c("2015-01-01","2015-01-02")))
> df
id dte
1 1 2015-01-01
2 2 2015-01-02
## insert into database
mong <- mongo(collection = "test", db = "test", url = "mongodb://localhost")
mong$insert(df)
Mongo shell query
To find the entries after a given date I would use
db.test.find({"dte" : {"$gt" : new ISODate("2015-01-01")}})
How can I reproduce this query in R using mongolite?
R attempts
So far I have tried
qry <- paste0('{"dte" : {"$gt" : new ISODate("2015-01-01")}}')
mong$find(qry)
Error: Invalid JSON object: {"dte" : {"$gt" : new ISODate("2015-01-01")}}
qry <- paste0('{"dte" : {"$gt" : "2015-01-01"}}')
mong$find(qry)
Imported 0 records. Simplifying into dataframe...
data frame with 0 columns and 0 rows
qry <- paste0('{"dte" : {"gt" : ', as.POSIXct("2015-01-01"), '}}')
mong$find(qry)
Error: Invalid JSON object: {"dte" : {"gt" : 2015-01-01}}
qry <- paste0('{"dte" : {"gt" : new ISODate("', as.POSIXct("2015-01-01"), '")}}')
mong$find(qry)
Error: Invalid JSON object: {"dte" : {"gt" : new ISODate("2015-01-01")}}
#user2754799 has the correct method, but I've made a couple of small changes so that it answers my question. If they want to edit their answer with this solution I'll accept it.
d <- as.integer(as.POSIXct(strptime("2015-01-01","%Y-%m-%d"))) * 1000
## or more concisely
## d <- as.integer(as.POSIXct("2015-01-01")) * 1000
data <- mong$find(paste0('{"dte":{"$gt": { "$date" : { "$numberLong" : "', d, '" } } } }'))
as this question keeps showing up at the top of my google results when i forget AGAIN how to query dates in mongolite and am too lazy to go find the documentation:
the above Mongodb shell query,
db.test.find({"dte" : {"$gt" : new ISODate("2015-01-01")}})
now translates to
mong$find('{"dte":{"$gt":{"$date":"2015-01-01T00:00:00Z"}}}')
optionally, you can add millis:
mong$find('{"dte":{"$gt":{"$date":"2015-01-01T00:00:00.000Z"}}}')
if you use the wrong datetime format, you get a helpful error message pointing you to the correct format: use ISO8601 format yyyy-mm-ddThh:mm plus timezone, either "Z" or like "+0500"
of course, this is also documented in the mongolite manual
try mattjmorris's answer from github
library(GetoptLong)
datemillis <- as.integer(as.POSIXct("2015-01-01")) * 1000
data <- data_collection$find(qq('{"createdAt":{"$gt": { "$date" : { "$numberLong" : "#{datemillis}" } } } }'))
reference: https://github.com/jeroenooms/mongolite/issues/5#issuecomment-160996514
Prior converting your date by multiplying it with 1000, do this: options(scipen=1000), as the lack of this workaround will affect certain dates.
This is explained here:
head(users)
1 jay chennai
2 kumar bangalore
3 vinoth Trichy
4 saswath perambalur
I want to store this output to cassandra table . I tried the below lines to store
users.write
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "sparkusers", "keyspace" -> "bigdata"))
.save()
throws error
unexpected symbol in test.write.format("org.apache.spark.sql.cassandra").options" please help me on this?
You are using the wrong syntax for R (That is the python/scala syntax)
read.df(sqlContext, source = "org.apache.spark.sql.cassandra", keyspace = "ks", table = "table")
See Spark R Dataframe Documentation