dbplyr filter() and lazy evaluation - r

I have a large MySQL table (92 cols, 3 million rows) that I'm wrangling in R with dbplyr. I'm new to the package and had a question or two about dbplyr's lazy evaluation as I'm experiencing some considerable slowness.
Suppose I have connected to my database and wish to perform a select operation to reduce the number of columns:
results<- my_conn_table %>%
select(id, type, date)
Despite there being millions of rows, viewing results is near instantaneous (only showing 3 rows for clarity):
> results
# Source: lazy query [?? x 3]
# Database: mysql []
id type date
<int> <chr> <date>
1 1 Customer 1997-01-04
2 2 Customer 1997-02-08
3 3 Business 1997-03-29
...
However, if I attempt to perform a filter operation with:
results<- my_df %>%
select(id, type, date) %>%
filter(type == "Business")
the operation takes a very long time to process (tens of minutes in some instances). Is this long processing time a simple function of nrow ~= 3 million? (In other words, there's nothing I can do about it because the dataset is large.) Or is someone able to suggest some general ways to perhaps make this more performant?
My initial understanding of the lazy evaluation was that the filter() query would only return the top few rows, to prevent the long run time scenario. If I wanted all the data I can run collect() to gather the results into my local R session (which I would expect to take a considerable amount of time depending on the query.)

Building on #NovaEthos's answer, you can call show_query(results) to get the SQL query that dbplyr generated and is passing to the database. Posting this query here will make it clear whether there is any inefficiency in how the database is being queried.
A further thing to investigate is how your data is indexed. Like an index in a book, an index on a database table helps find records faster.
You might only be asking for 1000 records with type = 'business' but if these records only start from row 2 million, then the computer has to scan two-thirds of your data before it finds the first record of interest.
You can add indexes to existing database tables using something like the following:
query <- paste0("CREATE NONCLUSTERED INDEX my_index_name ON", database_name, ".", schema_name, ".", table_name, "(", column_name, ");")
DBI::dbExecute(db_connection, as.character(query))
Note that this syntax is for SQL Server. Your database may require slightly different syntax. In practice I wrap this in a function with additional checks such as 'does the indexing column exist in the table?' this is demonstrated here.

One thing you could try is to run the same query but using the DBI package:
res <- DBI::dbSendQuery(con, "SELECT id,type,date FROM your_table_name WHERE type = 'business'")
If it's the same speed, it's your data. If not, it's dbplyr. You could also try playing around with different filter parameters (maybe specify a single ID or date and see how long that takes) and see if the problem persists.

Related

Snowflake, Python/Jupyter analysis

I am new to Snowflake, and running a query to get a couple of day's data - this returns more than 200 million rows, and take a few days. I tried running the same query in Jupyter - and the kernel restars/dies before the query ends. Even if it got into Jupyter - I doubt I could analyze the data in any reasonable timeline (but maybe using dask?).
I am not really sure where to start - I am trying to check the data for missing values, and my first instinct was to use Jupyter - but I am lost at the moment.
My next idea is to stay within Snowflake - and check the columns there with case statements (e.g. sum(case when column_value = '' then 1 else 0 end) as number_missing_values
Does anyone have any ideas/direction I could try - or know if I'm doing something wrong?
Thank you!
not really the answer you are looking for but
sum(case when column_value = '' then 1 else 0 end) as number_missing_values`
when you say missing value, this will only find values that are an empty string
this can also be written is a simpler form as:
count_if(column_value = '') as number_missing_values
The data base already knows how many rows are in a column, and it knows how many null columns there are. If loading data into a table, it might make more sense to not load empty strings, and use null instead then, for not compute cost you can run:
count(*) - count(column) as number_empty_values
also of note, if you have two tables in snowflake you can compare the via the MINUS
aka
select * from table_1
minus
select * from table_2
is useful to find missing rows, you do have to do it in both directions.
Then you can HASH rows, or hash the whole table via HASH_AGG
But normally when looking for missing data, you have an external system, so the driver is 'what can that system handle' and finding common ground.
Also in the past we where search for bugs in our processing that cause duplicate data (where we needed/wanted no duplicates) so then the above, and COUNT DISTINCT like commands come in useful.

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

SparkR DataFrame partitioning issue

In my R script, I have a SparkDataFrame of two columns (time, value) containing data for four different months. Because of the fact that I need to apply my function to an each month separately, I figured I would repartition it into four partitions where each of them would hold data for a separate month.
I created an additional column called partition, having an integer value 0 - 3 and after that called the repartition method by this specific column.
Sadly as it's being described in this topic:
Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy?, with the repartition method we are only sure that all the data with the same key will end up in the same partition, however data with a different key can also end up in the same partition.
In my case, executing code visible below results in creating 4 partitions but populating only 2 of them with data.
I guess I should be using the partitionBy method, however in case of SparkR I have no idea how to do that.
The official documentation states that this method is applied to something called WindowSpec and not a DataFrame.
I would really appreciate some help with this matter as I have no idea how to incorporate this method into my code.
sparkR.session(
master="local[*]", sparkConfig = list(spark.sql.shuffle.partitions="4"))
df <- as.DataFrame(inputDat) # this is a dataframe with added partition column
repartitionedDf <- repartition(df, col = df$partition)
schema <- structType(
structField("time", "timestamp"),
structField("value", "double"),
structField("partition", "string"))
processedDf <- dapply(repartitionedDf,
function(x) { data.frame(produceHourlyResults(x), stringsAsFactors = FALSE) },
schema)
You are using wrong method. If you
need to apply my function to an each month separately
you should use gapply that
Groups the SparkDataFrame using the specified columns and applies the R function to each group.
df %>% group_by("month") %>% gapply(fun, schema)
or
df %>% gapply("month", fun, schema)
In my case, executing code visible below results in creating 4 partitions but populating only 2 of them with data.
This suggests hash collisions. Increasing number of partitions reasonably above the number of unique keys should resolve the problem:
spark.sql.shuffle.partitions 17
I guess i should be using the partitionBy method, however
No. partitionBy is used with window functions (SparkR window function).
To address your comment:
i decided to use dapply with separate partitions in order to be able to easily save each month into separate CSV file
Hash partitioner doesn't work like that How does HashPartitioner work?
You can try with partitionBy in the writer, but I am not sure if it directly supported in SparkR. It is supported in structured streaming, for batch you may have to call Java methods or use tables with metastore:
createDataFrame(iris) %>% createOrReplaceTempView("iris_view")
sql(
"CREATE TABLE iris
USING csv PARTITIONED BY(species)
LOCATION '/tmp/iris' AS SELECT * FROM iris_view"
)

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

How to efficiently merge these data.tables

I want to create a certain data.table to be able to check for missing data.
Missing data in this case does not mean there will be an NA, but the entire row will just be left out. So I need to be able to see of a certain time dependent column which values are missing for which level from another column. Also important is if there are a lot of missing values together or if they are spread across the dataset.
So I have this 6.000.000x5 data.table (Call it TableA) containing the time dependent variable, an ID for the level and the value N which I would like to add to my final table.
I have another table (TableB) which is 207x2. This couples the ID's for the factor to the columns in TableC.
TableC is 1.500.000x207 of which each of the 207 columns correspond to an ID according to TableB and the rows correspond to the time dependent variable in TableA.
These tables are large and although I recently acquired extra RAM (totalling now to 8GB) my computer keeps swapping away TableC and for each write it has to be called back, and gets swapped away again after. This swapping is what is consuming all my time. About 1.6 seconds per row of TableA and as TableA has 6.000.000 rows this operation would take more than a 100 days running non stop..
Currently I am using a for-loop to loop over the rows of TableA. Doing no operation this for-loop loops almost instantly. I made a one-line command looking up the correct column and row number for TableC in TableA and TableB and writing the value from TableA to TableC.
I broke up this one-liner to do a system.time analysis and each step takes about 0 seconds except writing to the big TableC.
This showed that writing the value to the table was the most time consuming and looking at my memory use I can see a huge chunk appearing whenever a write happens and it disappears as soon as it is finished.
TableA <- data.table("Id"=round(runif(200, 1, 100)), "TimeCounter"=round(runif(200, 1, 50)), "N"=round(rnorm(200, 1, 0.5)))
TableB <- data.table("Id"=c(1:100),"realID"=c(100:1))
TSM <- matrix(0,ncol=nrow(TableB), nrow=50)
TableC <- as.data.table(TSM)
rm(TSM)
for (row in 1:nrow(TableA))
{
TableCcol <- TableB[realID==TableA[row,Id],Id]
TableCrow <- (TableA[row,TimeCounter])
val <- TableA[row,N]
TableC[TableCrow,TableCcol] <- val
}
Can anyone advise me on how to make this operation faster, by preventing the memory swap at the last step in the for-loop?
Edit: On the advice of #Arun I took some time to develop some dummy data to test on. It is now included in the code given above.
I did not include wanted results because the dummy data is random and the routine does work. It's the speed that is the problem.
Not entirely sure about the results, but give it a shot with the dplyr/tidyr packages for, as they seem to be more memory efficient than for loops.
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
TableC <- TableC %>% gather(tableC_id, value, 1:207)
This turns TableC from 1,500,000x207 to a long format 310,500,000x2 table with 'tableC_id' and 'tableC_value' columns.
TableD <- TableA %>%
left_join(TableB, c("LevelID" = "TableB_ID")) %>%
left_join(TableC, c("TableB_value" = "TableC_id")
This is a couple of packages I've been using of late, and they seem to be very efficient, but the data.table package is used specifically for management of large tables so there could be useful functions there. I'd also take a look at sqldf which allows you to query your data.frames via SQL commands.
Rethinking my problem I came to a solution which works much faster.
The thing is that it does not follow from the question posed above, because I already did a couple of steps to come to the situation described in my question.
Enter TableX from which I aggregated TableA. TableX contains Id's and TimeCounters and much more, that's why I thought it would be best to create a smaller table containing only the information I needed.
TableX also contains the relevant times while in my question I am using a complete time series from the beginning of time (01-01-1970 ;) ). It was way smarter to use the levels in my TimeCounter column to build my TableC.
Also I forced myself to set values individually while merging is a lot faster in data.table. So my advice is: whenever you need to set a lot of values try finding a way to merge instead of just copying them individually.
Solution:
# Create a table with time on the row dimension by just using the TimeCounters we find in our original data.
TableC <- data.table(TimeCounter=as.numeric(levels(factor(TableX[,TimeCounter]))))
setkey(TableC,TimeCounter) # important to set the correct key for merge.
# Loop over all unique Id's (maybe this can be reworked into something *apply()ish)
for (i in levels(factor(TableX[,Id])))
{
# Count how much samples we have for Id and TimeCounter
TableD <- TableX[Id==i,.N,by=TimeCounter]
setkey(TableD,TimeCounter) # set key for merge
# Merge with Id on the column dimension
TableC[TableD,paste("somechars",i,sep=""):=N]
}
There could be steps missing in the TimeCounter so now I have to check for gaps in TableC and insert rows which were missing for all Id's. Then I can finally check where and how big my data gaps are.

Resources