R: JBDC result query using a loop - r

I am trying to run several queries with one chunk of code to save time. Normally, I enter the variable and time, and return the results one at a time. However, there should be a way to use a loop that I iterate several variables through with one line, and collect the results in a data.frame with embedded code. However, I keept getting met with error "Error in dbSendQuery(con, sql) : Unable to retrieve JDBC result set..."
#enter sensor
sensor <- a
#enter start
t_start <- c("2022-07-29 04:54:00.0")
#enter end
t_end <-c("2022-08-01 23:59:59.0")
#open con
con <- openConn()
s <- tbl(con, in_schema("doc","sensors"))
consol_final <- s %>%
filter(timeStamp >= t_start & t_end >= timeStamp,
topic %in% sensor) %>%
select("timeStamp", "ID", "running", "value") %>%
collect() %>%
arrange(timeStamp)
closeConn
There must be a way to do this much faster than manually changing this using iteration. Here is my attempt that won't work no matter how much I tinker with it:
solution <- list()
sensor_list <- c(a,b,c,d,e,f) #note each of these is an object containing a topic URL
s <- tbl(con, in_schema("doc","sensors"))
con <- openConn()
for (i in 1:length(sensor_list)){
consol_final <- s %>%
filter(timeStamp >= t_start & t_end >= timeStamp,
topic %in% sensor_list[i]) %>%
select("timeStamp", "ID", "running", "value") %>%
collect() %>%
arrange(timeStamp)
rbind(consol_final, solution)
}
I am very limited with my understanding of JBDC queries so it is difficult to wrap my head around the problem here. There also must be a solution to iterate through several time periods stored in a list as well. If possible, please advise on this as well. Thank you!!

Related

Determine the most popular song based on number of streams by a specific artist in R

I have a music data in R and I have to determine the most popular song based on the number of streams for one specific artist. I have to create a new data.frame that only contains the songs from this artist, save it and sort it by number of streams.
The data provides a list of songs and includes columns, such as the number of streams, name of song, name of artist etc. I started like this, is there a simpler way to do it?
filter(music_data, artistName == "Billie Eilish")
billie_songs <- data.frame(filter(music_data, artistName == "Billie Eilish"))
billie_songs_ordered <- billie_songs[order(billie_songs$streams, decreasing = TRUE),]
print(paste("Most Popular Song: ", head(billie_songs_ordered$trackName, 1)))
Thank you!
The code you've added looks pretty good. Here's some comments:
filter(music_data, artistName == "Billie Eilish")
# this prints its result when you run it, but the result is not
# assigned with `<-` or `=`, so it is not saved.
# it's good to run code like this in your console, but you don't need it
# in the script file.
billie_songs <- data.frame(filter(music_data, artistName == "Billie Eilish"))
# here you repeat the code above, assigning it. This emphasizes that the
# first line could be deleted. Also `data.frame()` is unnecessary.
# Change to `billie_songs <- filter(music_data, artistName == "Billie Eilish")`
billie_songs_ordered <- billie_songs[order(billie_songs$streams, decreasing = TRUE),]
# this is fine. This is a great way to order rows of data using base R.
# You used `dplyr` above with `filter`, the dplyr way would have you use
# `arrange(billie_songs, desc(streams))` instead
print(paste("Most Popular Song: ", head(billie_songs_ordered$trackName, 1)))
# The `print()` is unnecessary, but this is good
If I were writing it I would use all dplyr functions and not save the result each step, instead using the %>% pipe to chain the commands together, like this:
music_data %>%
filter(artistName == "Billie Eilish") %>%
arrange(desc(streams)) %>%
head(1) %>%
pull(trackName) %>%
paste("Most Popular Song:", .)
Or I might use the dplyr convenience function slice_max that pulls the row with the maximum value of a particular column:
music_data %>%
filter(artistName == "Billie Eilish") %>%
slice_max(order_by = streams, n = 1) %>%
pull(trackName) %>%
paste("Most Popular Song:", .)

Iterating through values in R

I'm new-ish to R and am having some trouble iterating through values.
For context: I have data on 60 people over time, and each person has his/her own dataset in a folder (I received the data with id #s 00:59). For each person, there are 2 values I need - time of response and picture response given (a number 1 - 16). I need to convert this data from wide to long format for each person, and then eventually append all of the datasets together.
My problem is that I'm having trouble writing a loop that will do this for each person (i.e. each dataset). Here's the code I have so far:
pam[x] <- fromJSON(file = "PAM_u[x].json")
pam[x]df <- as.data.frame(pam[x])
#Creating long dataframe for times
pam[x]_long_times <- gather(
select(pam[x]df, starts_with("resp")),
key = "time",
value = "resp_times"
)
#Creating long dataframe for pic_nums (affect response)
pam[x]_long_pics <- gather(
select(pam[x]df, starts_with("pic")),
key = "picture",
value = "pic_num"
)
#Combining the two long dataframes so that I have one df per person
pam[x]_long_fin <- bind_cols(pam[x]_long_times, pam[x]_long_pics) %>%
select(resp_times, pic_num) %>%
add_column(id = [x], .before = 1)
If you replace [x] in the above code with a person's id# (e.g. 00), the code will run and will give me the dataframe I want for that person. Any advice on how to do this so I can get all 60 people done?
Thanks!
EDIT
So, using library(jsonlite) rather than library(rjson) set up the files in the format I needed without having to do all of the manipulation. Thanks all for the responses, but the solution was apparently much easier than I'd thought.
I don't know the structure of your json files. If you are not in the same folder, like the json files, try that:
library(jsonlite)
# setup - read files
json_folder <- "U:/test/" #adjust you folder here
files <- list.files(path = paste0(json_folder), pattern = "\\.json$")
# import data
pam <- NULL
pam_df <- NULL
for (i in seq_along(files)) {
pam[[i]] <- fromJSON(file = files[i])
pam_df[[i]] <- as.data.frame(pam[[i]])
}
Here you generally read all json files in the folder and build a vector of a length of 60.
Than you sequence along that vector and read all files.
I assume at the end you can do bind_rowsor add you code in the for loop. But remember to set the data frames to NULL before the loop starts, e.g. pam_long_pics <- NULL
Hope that helped? Let me know.
Something along these lines could work:
#library("tidyverse")
#library("jsonlite")
file_list <- list.files(pattern = "*.json", full.names = TRUE)
Data_raw <- tibble(File_name = file_list) %>%
mutate(File_contents = map(File_name, fromJSON)) %>% # This should result in a nested tibble
mutate(File_contents = map(File_contents, as_tibble))
Data_raw %>%
mutate(Long_times = map(File_contents, ~ gather(key = "time", value = "resp_times", starts_with("resp"))),
Long_pics = map(File_contents, ~ gather(key = "picture", value = "pic_num", starts_with("pic")))) %>%
unnest(Long_times, Long_pics) %>%
select(File_name, resp_times, pic_num)
EDIT: you may or may not need not to include as_tibble() after reading in the JSON files, depending on how your data looks like.

Efficiently calculate row totals of a wide Spark DF

I have a wide spark data frame of a few thousand columns by about a million rows, for which I would like to calculate the row totals. My solution so far is below. I used:
dplyr - sum of multiple columns using regular expressions and
https://github.com/tidyverse/rlang/issues/116
library(sparklyr)
library(DBI)
library(dplyr)
library(rlang)
sc1 <- spark_connect(master = "local")
wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200))
wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name = "wide_sdf")
col_eqn = paste0(colnames(wide_df), collapse = "+" )
# build up the SQL query and send to spark with DBI
query = paste0("SELECT (",
col_eqn,
") as total FROM wide_sdf")
dbGetQuery(sc1, query)
# Equivalent approach using dplyr instead
col_eqn2 = quo(!! parse_expr(col_eqn))
wide_sdf %>%
transmute("total" := !!col_eqn2) %>%
collect() %>%
as.data.frame()
The problems come when the number of columns is increased. On spark SQL it seems to be calculated one element at a time i.e. (((V1 + V1) + V3) + V4)...) This is leading to errors due to very high recursion.
Does anyone have an alternative more efficient approach? Any help would be much appreciated.
You're out of luck here. One way or another you're are going to hit some recursion limits (even if you go around SQL parser, sufficiently large sum of expressions will crash query planner). There are some slow solutions available:
Use spark_apply (at the cost of conversion to and from R):
wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) })
Convert to long format and aggregate (at the cost of explode and shuffle):
key_expr <- "monotonically_increasing_id() AS key"
value_expr <- paste(
"explode(array(", paste(colnames(wide_sdf), collapse=","), ")) AS value"
)
wide_sdf %>%
spark_dataframe() %>%
# Add id and explode. We need a separate invoke so id is applied
# before "lateral view"
sparklyr::invoke("selectExpr", list(key_expr, "*")) %>%
sparklyr::invoke("selectExpr", list("key", value_expr)) %>%
sdf_register() %>%
# Aggregate by id
group_by(key) %>%
summarize(total = sum(value)) %>%
arrange(key)
To get something more efficient you should consider writing Scala extension and applying sum directly on a Row object, without exploding:
package com.example.sparklyr.rowsum
import org.apache.spark.sql.{DataFrame, Encoders}
object RowSum {
def apply(df: DataFrame, cols: Seq[String]) = df.map {
row => cols.map(c => row.getAs[Double](c)).sum
}(Encoders.scalaDouble)
}
and
invoke_static(
sc, "com.example.sparklyr.rowsum.RowSum", "apply",
wide_sdf %>% spark_dataframe
) %>% sdf_register()

How can I read in and bind together several data frames without using a for loop?

for() loops in R always seem to be my go-to for reading in multiple things but there must be a better way of doing what I want to do.
Let's say I have several data sets that all come from an automated data-pull system:
#Fake data set up
library(lubridate)
dir_path <- tempdir()
file_1 <- paste0(Sys.Date() - days(2), ".rds")
file_2 <-paste0(Sys.Date() - days(1), ".rds")
file_3 <- paste0(Sys.Date(), ".rds")
data.frame(thing = rnorm(100)) %>%
saveRDS(file.path(dir_path, file_1))
data.frame(thing = rnorm(100)) %>%
saveRDS(file.path(dir_path, file_2))
data.frame(thing = rnorm(100)) %>%
saveRDS(file.path(dir_path, file_3))
I want to read each of these into my R session, do a small bit of processing, to each, then stick them all into the same dataframe:
read_in_data <- function(file_name, dir){
d <- substr(file_name, 1, 10)
thing <-
readRDS(file.path(dir, file_name)) %>%
mutate(date = d)
}
files <- list.files(temp_dir(), pattern = "^2017-1")
this_thing <- NULL
for(i in 1:length(files)){
this_thing <-
this_thing %>%
bind_rows(read_in_data(files[i], dir_path))
}
This is great and does exactly what I want, but I have the sneaking suspicion that, as the number of files I want to read in and bind together grows, the for() loop will end up being very slow.
I could do something like
this_thing <-
read_in_data(files[1], dir_path) %>%
bind_rows(read_in_data(files[2], dir_path)) %>%
bind_rows(read_in_data(files[3], dir_path))
but this is gross and will be impossible to maintain, especially as the number of files I want to read in grows.
How can I get rid of this for loop? I know that growing things in a for() loop is a bad idea but I don't know how else to do this kind of operation. What am I missing? Probably something pretty simple.
I ended up using the purrr package:
library(purrr)
files %>%
map(safely(read_in_data, quiet = FALSE)) %>%
transpose() %>%
simplify_all() %>%
.result() %>%
bind_rows() %>%
saveRDS(file.path("path to .rds file"))

Using Dates with RSQLite

How do you write a SQL query with a date using RSQLite. Here is an example below. The dbGetQuery does not return any rows.
require(RSQLite)
require(ggplot2)
data(presidential)
m <- dbDriver("SQLite")
tmpfile <- tempfile('presidential', fileext='.db')
conn <- dbConnect(m, dbname=tmpfile)
dbWriteTable(conn, "presidential", presidential)
dbGetQuery(conn, "SELECT * FROM presidential WHERE Date(start) >= Date('1980-01-01')")
Just to illustrate, this works fine:
tmpfile <- tempfile('presidential', fileext='.db')
conn <- dbConnect(m, dbname=tmpfile)
p <- presidential
p$start <- as.character(p$start)
p$end <- as.character(p$end)
dbWriteTable(conn, "presidential", p)
dbGetQuery(conn, "SELECT * FROM presidential WHERE start >= '1980-01-01'")
You can read about the lack of native date types in SQLite in the docs here. I've been using strings as dates for so long in SQLite that I'd actually forgotten about the issue completely.
And yes, I've written a small R function that converts any Date column in a data frame to character. For simple comparisons, keeping them in YYYY-MM-DD is enough, and if I need to do arithmetic I convert them after the fact in R.
Following on from #joran's answer, here's a simple function to convert date columns to string for a data.frame.
mutate(df, across(where(is.Date), ~ format(.x, "%Y.%m.%d")))
I found working with RSQLite and dplyr to be the most convenient way to stay type-consistent using R and SQLite. In particular, extendend_types = TRUE ensures that columns of type DATE, DATETIME / TIMESTAMP, and TIME are mapped to the corresponding R-classes (at least after version 2.2.8 for RSQLite).
library(dplyr)
library(RSQLite)
library(ggplot2)
data(presidential)
mydb <- dbConnect(SQLite(), "presidential.sqlite", extended_types = TRUE)
dbWriteTable(mydb, "presidential", presidential)
tbl(mydb, "presidential") %>%
filter(start >= as.Date("1980-01-01")) %>%
collect()
You can also formulate the latter collection as a get query:
dbGetQuery(mydb, "SELECT * FROM presidential WHERE start >= CAST('1980-01-01' AS DATE)")
As #joran suggests keeping dates in text in SQLlite seems like the best way to go for the time being.
I used #Richard Knight's approach for conversion in, but with ISO format, to change the date to string before writing the dataframe:
local_df %>% mutate(across(where(lubridate::is.Date), ~ format(.x, "%Y-%m-%d")))
Manipulating the dates remotely can be done using sql translation, particularly:
remote_df %>% mutate(date_as_number = julianday(date_as_string))
remote_df %>% mutate(date_as_string = date(date_as_number))
N.b. that is date not as.Date in the second one. This is because as.Date will get translated to CAST(date_as_number AS DATE) whereas what we want is to use SQLLite's date() function with a floating point number as returned by julianday().
Mapping the remote datestrings back into dates can be done automatically, if you :
collect <- function(remote_df, ...) {
raw = remote_df %>% dplyr::collect(...)
isoDateString = function(x) return(is.character(x) & all(na.omit(stringr::str_detect(x,"[0-9]{4}-[0-9]{2}-[0-9]{2}"))) & !all(is.na(x)))
raw = raw %>% mutate(across(where(isoDateString), ~ as.Date(.x, "%Y-%m-%d")))
maybeJulian = function(x) {return(is.double(x) & all(na.omit(x>2440587.5)) & all(na.omit(x<2488069.5)) & !all(is.na(x)))}
raw = raw %>% mutate(across(matches(".*(D|d)ate.*") & where(maybeJulian), ~ as.Date(.x-2440587.5, "1970-01-01")))
return(raw)
}
The apparently random numbers in the maybeJulian function correspond to 1970-01-01 and 2100-01-01

Resources