Error when using SparkR insertInto via databricks - r

I am trying to insert values from a dataframe into a database table (impala) using SparkR in a databricks notebook:
require(SparkR)
test_df <- data.frame(row_no = c(2,3,4,5,6,7,8)
,row_dat = c('dat_2','dat_3','dat_4','dat_5','dat_6','dat_7','dat_8')
)
test_df <- as.data.frame(test_df)
sparkR.session()
insertInto(test_df,"db_name.table_name",overwrite = false)
I get the error: "unable to find an inherited method for function ‘insertInto’ for signature ‘"data.frame", "character"’"
I have checked the connection to this table and using SparkR::collect I can return the data from it no problem. So why isn't the insert working?

Instead of as.data.frame that returns R dataframe you need to use as.DataFrame that returns Spark dataframe that could be used with insertInto (see doc). Change code to:
require(SparkR)
test_df <- data.frame(row_no = c(2,3,4,5,6,7,8)
,row_dat = c('dat_2','dat_3','dat_4','dat_5','dat_6','dat_7','dat_8')
)
test_df <- as.DataFrame(test_df)
sparkR.session()
insertInto(test_df,"db_name.table_name",overwrite = FALSE)

Related

SQL Create table from R - String data, right truncation

I am working with Microsoft SQL Azure version 12, by operating on an RStudio-server and the DBI library. I need to create multiple SQL tables from dataframes with a variable of length 4000. This can be done as
# Create dataframe
df <- data.frame("myid" = stringi::stri_rand_strings(5, 4000),
"mydate" = c(Sys.time(), Sys.time()-1, Sys.time()-2, Sys.time()-3, Sys.time()-4) )
# Create SQL table sschema.ttable
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "sschema", table = "ttable"),
value = df,
overwrite = TRUE)
This fails with the following error
Error in result_insert_dataframe(rs#ptr, values, batch_rows) :
nanodbc/nanodbc.cpp:1617: 00000: [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation
I tried
Truncating variables (suboptimal)
Create table > alter variables to be of format VARCHAR(6000) instead of VARCHAR(255) > append dataframe. This results in the same "String data, right truncation" error.
Any solutions how to create SQL tables directly from R dataframes?
The answer is to define the variables and their desired SQL variable class with field.types as in
# Create SQL table sschema.ttable
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "sschema", table = "ttable"),
field.types=c(myid="varchar(6000)"),
value = df,
overwrite = TRUE)

Issue with SparkR regexp_extract function

Data
I work with a large dataset (280 million rows) for which Spark and R seems to work nicely.
Problem
I'd had problems with SparkR's regexp_extract function. I thought it to work analogically to Stringr's str_detect but I haven't managed to get it to work. The documentation for regexp_extract is limited. Could you please give me a hand?
Reprex
Here is a reprex where I try to identify strings that do not have a space and paste " 00:01" as a suffix.
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken = ifelse(regexp_extract(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Error
error: org.apache.spark.sql.AnalysisException: cannot resolve '(NOT regexp_extract(df.sampletaken, ' ', 1))' due to data type mismatch: argument 1 requires boolean type, however, 'regexp_extract(df.sampletaken, ' ', 1)' is of string type.; line 1 pos 80;
Solution
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken1 = ifelse(rlike(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Probably rlike is what you're after if you're looking for the analog to str_detect, see the SQL API docs:
str rlike regexp - Returns true if str matches regexp, or false otherwise.
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
true
on a Column (i.e., in R, rather than in SparkQL through sql()), it would be like:
rlike(Column, 'regex.*pattern')
# i.e., in magrittr form
Column %>% rlike('regex.*pattern')
Note that like is usually more efficient if you can use it since the set of valid like patterns is much smaller.
I'm not familiar with SparkR, but it seems that the function regex_extract returns a string (presumably the matched pattern in the string) instead of a boolean, as required by the function ifelse.
You may try to match the returned value against the empty string.

Getting "table of extent 0" in Shiny Web app output

I have a data file that I read in in my Shiny server function. I would like to display a frequency table of the two columns the user selects using drop-downs. I get the error "table of extent 0". I have looked at R error - Table of extent 0 and Can't solve table issue but I have imported my data correctly and the column names match as well. The same line of code works when I run it in the console.
Here is my code:
shinyServer(function(input, output) {
output$courseData = renderPrint( {
data = read.csv(file = 'FourCourseTableLetterGrades_POLISHED.tsv', sep = '\t', header = TRUE)
c1 = input$course1
c2 = input$course2
tbl = table(data$c1, data$c2)
tbl
}
)
}
)
Update: this is what the table looks like right now:
I would like the output to be in matrix format, just as what you get when running the table command in console. I also don't know why the columns are named Var1 and Var2 and where to change them.
the first problem is that c1 and c2 are character variables therefore you need to use [[]] instead of $. The second problem is that what you see ist the table format of the result from table if you rather have the matrix you can calculate it quite easy with the package dplyr fro example
library(dplyr)
data = read.csv(file = 'FourCourseTableLetterGrades_POLISHED.tsv', sep = '\t', header = TRUE)
c1 = input$course1
c2 = input$course2
tbl = tibble(data[[c1]], data[[c2]]) %>%
group_by_all() %>%
tally() %>%
tidyr::spread(2,n)
tbl
hope this helps!!
Using data[[c1]] instead of data$c1 as suggested in the comments removed the error and showed a basic (although malformed) output. I did not understand why.

Concat_ws() function in Sparklyr is missing

I am following a tutorial on web (Adobe) analytics, where I want to build a Markov Chain Model. (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/).
In the example they are using the function:
concat_ws (from library(sparklyr)). But it looks like the function does not exists (after installing the package, and calling the library, I receive an error that the function does not exists...).
Comment author of the blog: concat_ws is a Spark SQL function:
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html
So, you’ll have to rely on sparklyr to have that function work.
My question: are there workarounds to get access to the concat_ws() function? I tried:
Searched on Github (https://github.com/rstudio/sparklyr) if I could find the function (or the source code).. unfortunately no result..
What is the goal of the function?
Concatenates multiple input string columns together into a single string column, using the given separator.
You can simply use paste from base R.
library(sparklyr)
library(dplyr)
config <- spark_config()
sc <- spark_connect(master = "local", config = config)
df <- as.data.frame(cbind(c("1", "2", "3"), c("a", "b", "c")))
sdf <- sdf_copy_to(sc, df, overwrite = T)
sdf %>%
mutate(concat = paste(V1, V2, sep = "-"))
You cannot find the function because it doesn't exist in sparklyr package. concat_ws is a Spark SQL function (org.apache.spark.sql.functions.concat_ws).
sparklyr depends on a SQL translation layer - function calls are translated into SQL expressions with dbplyr:
> dbplyr::translate_sql(concat_ws("-", foo, bar))
<SQL> CONCAT_WS('-', "foo", "bar")
This means that the function can be applied only in the sparklyr context:
sc <- spark_connect(master = "local[*]")
df <- copy_to(sc, tibble(x="foo", y="bar"))
df %>% mutate(xy = concat_ws("-", x, y))
# # Source: spark<?> [?? x 3]
# x y xy
# * <chr> <chr> <chr>
# 1 foo bar foo-bar
I had a similar problem with dbplyr (BigQuery database).
Problem
I kept getting the error:
my_dbplyr_object %>%
mutate(datetime_char = paste(date_char, time_char))
# failed x Function not found: CONCAT_WS at [1:147] [invalidQuery]
Solution
I wrote custom SQL and placed it inside sql().
Example
Once you know the SQL that will generate what you're after (in my case it was CONCAT(date_char, ' ', time_char)), then simply place it inside the sql() function, like so:
my_dbplyr_object %>%
mutate(datetime_char = sql("CONCAT(date_char, ' ', time_char)"))

Filter table from redshift database using R dplyr

I have a table saved in AWS redshift that has lots of rows and I want to collect only a subset of them using a "user_id" column. I am trying to use R with the dplyr library to accomplish this (see below).
conn_dplyr <- src_postgres('dev',
host = '****',
port = ****,
user = "****",
password = "****")
df <- tbl(conn_dplyr, "redshift_table")
However, when I try to subset over a collection of user ids it fails (see below). Can someone help me understand how I might be able to collect the data table over a collection of user id elements? The individual calls work, but when I combine them both it fails. In this case there are only 2 user ids, but in general it could be hundreds or thousands, so I don't want to do each one individually. Thanks for your help.
df_subset1 <- filter(df, user_id=="2239257806")
df_subset1 <- collect(df_subset1)
df_subset2 <- filter(df, user_id=="22159960")
df_subset2 <- collect(df_subset2)
df_subset_both <- filter(df, user_id==c("2239257806", "22159960"))
df_subset_both <- collect(df_subset_both)
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: operator does not exist: character varying = record
HINT: No operator matches the given name and argument type(s). You may need to add explicit type casts.
)
Try this:
df_subset_both <- filter(df, user_id %in% c("2239257806", "22159960"))
Also you can add condition in the query you uploaded from redshift.
install.packages("RPostgreSQL")
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
conn <-dbConnect(drv,host='host link',port='5439',dbname='dbname',user='xxx',password='yyy')
df_subset_both <- dbSendQuery(conn,"select * from my_table where user_id in (2239257806,22159960)")

Resources