I currently have the following SQL query:
SELECT sector, COUNT(sector)
FROM clients
GROUP BY sector
ORDER BY COUNT(sector) DESC LIMIT 3;
So I am trying to make a move to tidyverse from SQL Developer but I am encountering difficulties when trying to run certain sequences, because obviously it's not as straightforward (or not to me at least).
So I was wondering if anyone could help me.
This is just a basic query from a single table in which I have to see how many clients are a certain sector.
What is its equivalent in R?
Can anyone assist me, please?
With tidyverse, this can be achieved by getting the frequency count of 'sector' arrange the frequency column 'n' in descending order and slice the first 3 rows
library(tidyverse)
clients %>%
count(sector) %>%
arrange(desc(n)) %>%
slice(1:3)
data
set.seed(24)
clients <- data.frame(sector = sample(letters[1:10], 50,
replace = TRUE), val = rnorm(50))
The sqldf library, if you are open to it, would actually let you to continue using your SQL syntax:
library(sqldf)
sql <- "SELECT sector, COUNT(sector)
FROM clients
GROUP BY sector
ORDER BY COUNT(sector) DESC LIMIT 3"
result <- sqldf(sql)
The sqldf package runs SQLite under the hood by default, though you may change to another database flavor if you want to. This suggestion might make sense for you if you have to port over a lot of SQL logic to R, but you don't want to take the risk of having to rewrite everything using tidyverse, base R, or another package.
Related
I have a large MySQL table (92 cols, 3 million rows) that I'm wrangling in R with dbplyr. I'm new to the package and had a question or two about dbplyr's lazy evaluation as I'm experiencing some considerable slowness.
Suppose I have connected to my database and wish to perform a select operation to reduce the number of columns:
results<- my_conn_table %>%
select(id, type, date)
Despite there being millions of rows, viewing results is near instantaneous (only showing 3 rows for clarity):
> results
# Source: lazy query [?? x 3]
# Database: mysql []
id type date
<int> <chr> <date>
1 1 Customer 1997-01-04
2 2 Customer 1997-02-08
3 3 Business 1997-03-29
...
However, if I attempt to perform a filter operation with:
results<- my_df %>%
select(id, type, date) %>%
filter(type == "Business")
the operation takes a very long time to process (tens of minutes in some instances). Is this long processing time a simple function of nrow ~= 3 million? (In other words, there's nothing I can do about it because the dataset is large.) Or is someone able to suggest some general ways to perhaps make this more performant?
My initial understanding of the lazy evaluation was that the filter() query would only return the top few rows, to prevent the long run time scenario. If I wanted all the data I can run collect() to gather the results into my local R session (which I would expect to take a considerable amount of time depending on the query.)
Building on #NovaEthos's answer, you can call show_query(results) to get the SQL query that dbplyr generated and is passing to the database. Posting this query here will make it clear whether there is any inefficiency in how the database is being queried.
A further thing to investigate is how your data is indexed. Like an index in a book, an index on a database table helps find records faster.
You might only be asking for 1000 records with type = 'business' but if these records only start from row 2 million, then the computer has to scan two-thirds of your data before it finds the first record of interest.
You can add indexes to existing database tables using something like the following:
query <- paste0("CREATE NONCLUSTERED INDEX my_index_name ON", database_name, ".", schema_name, ".", table_name, "(", column_name, ");")
DBI::dbExecute(db_connection, as.character(query))
Note that this syntax is for SQL Server. Your database may require slightly different syntax. In practice I wrap this in a function with additional checks such as 'does the indexing column exist in the table?' this is demonstrated here.
One thing you could try is to run the same query but using the DBI package:
res <- DBI::dbSendQuery(con, "SELECT id,type,date FROM your_table_name WHERE type = 'business'")
If it's the same speed, it's your data. If not, it's dbplyr. You could also try playing around with different filter parameters (maybe specify a single ID or date and see how long that takes) and see if the problem persists.
How do I perform the GROUP BY ... HAVING query using dbplyr in dplyr?
I have a list of IDs and I have to group by IDs which are not in this list.
Is there a way I can directly execute the query with tbl(), if not what is the dplyr verb for the same?
Using group_by_if function from dplyr doesn't seem to do it.
I want to execute something like
SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
GROUP BY COL1 HAVING COL2 NOT IN ID_LIST
where ID_LIST is an R vector
For the example you have given, it is not clear to me how
SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
GROUP BY COL1
HAVING COL2 NOT IN ID_LIST
Is different from
SELECT * FROM TBL
WHERE YEAR(DATE) = 2001
AND COL2 NOT IN ID_LIST
GROUP BY COL1
Hence #Rohit's suggestion of applying a filter is an effective solution.
HAVING largely operates the same way as WHERE but after aggregation with the added feature that you can use aggregators in the HAVING clause. See this discussion. But in this case you are not applying aggregators in the HAVING clause so you should be free to use a WHERE clause instead.
Regarding the nested SQL queries that dbplyr produces. It might seem counter intuitive given the usual emphasis on clean, human-readable code, but for dbplyr auto-generated queries I recommend not worrying about the quality of the machine generated code. It is written by a machine and it is (mostly) read by a machine, so its human readability is less important.
Efficiency could be a concern with many layers of nesting. However, in 2017-06-09 dbplyr was given a basic SQL optimiser. I have not found (though I have not tested extensively) nested auto-generated queries to perform significantly worse than non-nested user written queries. But if performance is critical, you probably want to build your SQL query manually by paste-ing together text strings in R.
One final thought - the length of ID_LIST is also important to consider. It is discussed in this question.
I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.
I want to create a certain data.table to be able to check for missing data.
Missing data in this case does not mean there will be an NA, but the entire row will just be left out. So I need to be able to see of a certain time dependent column which values are missing for which level from another column. Also important is if there are a lot of missing values together or if they are spread across the dataset.
So I have this 6.000.000x5 data.table (Call it TableA) containing the time dependent variable, an ID for the level and the value N which I would like to add to my final table.
I have another table (TableB) which is 207x2. This couples the ID's for the factor to the columns in TableC.
TableC is 1.500.000x207 of which each of the 207 columns correspond to an ID according to TableB and the rows correspond to the time dependent variable in TableA.
These tables are large and although I recently acquired extra RAM (totalling now to 8GB) my computer keeps swapping away TableC and for each write it has to be called back, and gets swapped away again after. This swapping is what is consuming all my time. About 1.6 seconds per row of TableA and as TableA has 6.000.000 rows this operation would take more than a 100 days running non stop..
Currently I am using a for-loop to loop over the rows of TableA. Doing no operation this for-loop loops almost instantly. I made a one-line command looking up the correct column and row number for TableC in TableA and TableB and writing the value from TableA to TableC.
I broke up this one-liner to do a system.time analysis and each step takes about 0 seconds except writing to the big TableC.
This showed that writing the value to the table was the most time consuming and looking at my memory use I can see a huge chunk appearing whenever a write happens and it disappears as soon as it is finished.
TableA <- data.table("Id"=round(runif(200, 1, 100)), "TimeCounter"=round(runif(200, 1, 50)), "N"=round(rnorm(200, 1, 0.5)))
TableB <- data.table("Id"=c(1:100),"realID"=c(100:1))
TSM <- matrix(0,ncol=nrow(TableB), nrow=50)
TableC <- as.data.table(TSM)
rm(TSM)
for (row in 1:nrow(TableA))
{
TableCcol <- TableB[realID==TableA[row,Id],Id]
TableCrow <- (TableA[row,TimeCounter])
val <- TableA[row,N]
TableC[TableCrow,TableCcol] <- val
}
Can anyone advise me on how to make this operation faster, by preventing the memory swap at the last step in the for-loop?
Edit: On the advice of #Arun I took some time to develop some dummy data to test on. It is now included in the code given above.
I did not include wanted results because the dummy data is random and the routine does work. It's the speed that is the problem.
Not entirely sure about the results, but give it a shot with the dplyr/tidyr packages for, as they seem to be more memory efficient than for loops.
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
TableC <- TableC %>% gather(tableC_id, value, 1:207)
This turns TableC from 1,500,000x207 to a long format 310,500,000x2 table with 'tableC_id' and 'tableC_value' columns.
TableD <- TableA %>%
left_join(TableB, c("LevelID" = "TableB_ID")) %>%
left_join(TableC, c("TableB_value" = "TableC_id")
This is a couple of packages I've been using of late, and they seem to be very efficient, but the data.table package is used specifically for management of large tables so there could be useful functions there. I'd also take a look at sqldf which allows you to query your data.frames via SQL commands.
Rethinking my problem I came to a solution which works much faster.
The thing is that it does not follow from the question posed above, because I already did a couple of steps to come to the situation described in my question.
Enter TableX from which I aggregated TableA. TableX contains Id's and TimeCounters and much more, that's why I thought it would be best to create a smaller table containing only the information I needed.
TableX also contains the relevant times while in my question I am using a complete time series from the beginning of time (01-01-1970 ;) ). It was way smarter to use the levels in my TimeCounter column to build my TableC.
Also I forced myself to set values individually while merging is a lot faster in data.table. So my advice is: whenever you need to set a lot of values try finding a way to merge instead of just copying them individually.
Solution:
# Create a table with time on the row dimension by just using the TimeCounters we find in our original data.
TableC <- data.table(TimeCounter=as.numeric(levels(factor(TableX[,TimeCounter]))))
setkey(TableC,TimeCounter) # important to set the correct key for merge.
# Loop over all unique Id's (maybe this can be reworked into something *apply()ish)
for (i in levels(factor(TableX[,Id])))
{
# Count how much samples we have for Id and TimeCounter
TableD <- TableX[Id==i,.N,by=TimeCounter]
setkey(TableD,TimeCounter) # set key for merge
# Merge with Id on the column dimension
TableC[TableD,paste("somechars",i,sep=""):=N]
}
There could be steps missing in the TimeCounter so now I have to check for gaps in TableC and insert rows which were missing for all Id's. Then I can finally check where and how big my data gaps are.
I am good with SQL and I naturaly use sqldf package.
However, it is useful to know native R way to achieve various SQL commands.
on a dataframe column, how can I achieve a similar count as in the last command?
library(ggplot2)
head(tips,3)
sqldf("select count(distinct day) from tips")
ok. I got little better now and can answer my own question.
d <- table(tips$day)
and then count how long is that list dim(d)