how to count distinct entries within a column of a data.frame - r

I am good with SQL and I naturaly use sqldf package.
However, it is useful to know native R way to achieve various SQL commands.
on a dataframe column, how can I achieve a similar count as in the last command?
library(ggplot2)
head(tips,3)
sqldf("select count(distinct day) from tips")

ok. I got little better now and can answer my own question.
d <- table(tips$day)
and then count how long is that list dim(d)

Related

Translating from SQL to R

I currently have the following SQL query:
SELECT sector, COUNT(sector)
FROM clients
GROUP BY sector
ORDER BY COUNT(sector) DESC LIMIT 3;
So I am trying to make a move to tidyverse from SQL Developer but I am encountering difficulties when trying to run certain sequences, because obviously it's not as straightforward (or not to me at least).
So I was wondering if anyone could help me.
This is just a basic query from a single table in which I have to see how many clients are a certain sector.
What is its equivalent in R?
Can anyone assist me, please?
With tidyverse, this can be achieved by getting the frequency count of 'sector' arrange the frequency column 'n' in descending order and slice the first 3 rows
library(tidyverse)
clients %>%
count(sector) %>%
arrange(desc(n)) %>%
slice(1:3)
data
set.seed(24)
clients <- data.frame(sector = sample(letters[1:10], 50,
replace = TRUE), val = rnorm(50))
The sqldf library, if you are open to it, would actually let you to continue using your SQL syntax:
library(sqldf)
sql <- "SELECT sector, COUNT(sector)
FROM clients
GROUP BY sector
ORDER BY COUNT(sector) DESC LIMIT 3"
result <- sqldf(sql)
The sqldf package runs SQLite under the hood by default, though you may change to another database flavor if you want to. This suggestion might make sense for you if you have to port over a lot of SQL logic to R, but you don't want to take the risk of having to rewrite everything using tidyverse, base R, or another package.

Difference between dates in SQLDF in R

I am using the R package SQLDF and am having trouble finding the number of days between two date time variables. The variables ledger_entry_created_at and created_at are Unix Epochs and when I try to subtract them after casting to julianday, I return a vector of NA's.
I've taken a look at this previous question and didn't find it useful since my answer to be given in SQL for reasons that are outside the scope of this question.
If anyone could help me figure out a way to do this inside SQLDF I would be grateful.
EDIT:
SELECT strftime('%Y-%m-%d %H:%M:%S', l.created_at, 'unixepoch') ledger_entry_created_at,
l.ledger_entry_id, l.account_id, l.amount, a.user_id, u.created_at
FROM ledger l
LEFT JOIN accounts a
ON l.account_id = a.account_id
LEFT JOIN users u
ON a.user_id = u.user_id
This answer is trivial, but if you already have two UNIX timestamps, and you want to find out how many days have elapsed between them, you can simply take the difference in seconds (their original unit), and convert to days, e.g.
SELECT
(l.created_at - u.created_at) / (3600*24) AS diff
-- any maybe other columns here
FROM ledger l
LEFT JOIN accounts a
ON l.account_id = a.account_id
LEFT JOIN users u
ON a.user_id = u.user_id;
I don't know why your current approach is failing, as the timestamps you have in the screen capture should be valid inputs to SQLite's julianday function. But, again, you may not need such a complicated route to get the result you want.

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

How to efficiently merge these data.tables

I want to create a certain data.table to be able to check for missing data.
Missing data in this case does not mean there will be an NA, but the entire row will just be left out. So I need to be able to see of a certain time dependent column which values are missing for which level from another column. Also important is if there are a lot of missing values together or if they are spread across the dataset.
So I have this 6.000.000x5 data.table (Call it TableA) containing the time dependent variable, an ID for the level and the value N which I would like to add to my final table.
I have another table (TableB) which is 207x2. This couples the ID's for the factor to the columns in TableC.
TableC is 1.500.000x207 of which each of the 207 columns correspond to an ID according to TableB and the rows correspond to the time dependent variable in TableA.
These tables are large and although I recently acquired extra RAM (totalling now to 8GB) my computer keeps swapping away TableC and for each write it has to be called back, and gets swapped away again after. This swapping is what is consuming all my time. About 1.6 seconds per row of TableA and as TableA has 6.000.000 rows this operation would take more than a 100 days running non stop..
Currently I am using a for-loop to loop over the rows of TableA. Doing no operation this for-loop loops almost instantly. I made a one-line command looking up the correct column and row number for TableC in TableA and TableB and writing the value from TableA to TableC.
I broke up this one-liner to do a system.time analysis and each step takes about 0 seconds except writing to the big TableC.
This showed that writing the value to the table was the most time consuming and looking at my memory use I can see a huge chunk appearing whenever a write happens and it disappears as soon as it is finished.
TableA <- data.table("Id"=round(runif(200, 1, 100)), "TimeCounter"=round(runif(200, 1, 50)), "N"=round(rnorm(200, 1, 0.5)))
TableB <- data.table("Id"=c(1:100),"realID"=c(100:1))
TSM <- matrix(0,ncol=nrow(TableB), nrow=50)
TableC <- as.data.table(TSM)
rm(TSM)
for (row in 1:nrow(TableA))
{
TableCcol <- TableB[realID==TableA[row,Id],Id]
TableCrow <- (TableA[row,TimeCounter])
val <- TableA[row,N]
TableC[TableCrow,TableCcol] <- val
}
Can anyone advise me on how to make this operation faster, by preventing the memory swap at the last step in the for-loop?
Edit: On the advice of #Arun I took some time to develop some dummy data to test on. It is now included in the code given above.
I did not include wanted results because the dummy data is random and the routine does work. It's the speed that is the problem.
Not entirely sure about the results, but give it a shot with the dplyr/tidyr packages for, as they seem to be more memory efficient than for loops.
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
TableC <- TableC %>% gather(tableC_id, value, 1:207)
This turns TableC from 1,500,000x207 to a long format 310,500,000x2 table with 'tableC_id' and 'tableC_value' columns.
TableD <- TableA %>%
left_join(TableB, c("LevelID" = "TableB_ID")) %>%
left_join(TableC, c("TableB_value" = "TableC_id")
This is a couple of packages I've been using of late, and they seem to be very efficient, but the data.table package is used specifically for management of large tables so there could be useful functions there. I'd also take a look at sqldf which allows you to query your data.frames via SQL commands.
Rethinking my problem I came to a solution which works much faster.
The thing is that it does not follow from the question posed above, because I already did a couple of steps to come to the situation described in my question.
Enter TableX from which I aggregated TableA. TableX contains Id's and TimeCounters and much more, that's why I thought it would be best to create a smaller table containing only the information I needed.
TableX also contains the relevant times while in my question I am using a complete time series from the beginning of time (01-01-1970 ;) ). It was way smarter to use the levels in my TimeCounter column to build my TableC.
Also I forced myself to set values individually while merging is a lot faster in data.table. So my advice is: whenever you need to set a lot of values try finding a way to merge instead of just copying them individually.
Solution:
# Create a table with time on the row dimension by just using the TimeCounters we find in our original data.
TableC <- data.table(TimeCounter=as.numeric(levels(factor(TableX[,TimeCounter]))))
setkey(TableC,TimeCounter) # important to set the correct key for merge.
# Loop over all unique Id's (maybe this can be reworked into something *apply()ish)
for (i in levels(factor(TableX[,Id])))
{
# Count how much samples we have for Id and TimeCounter
TableD <- TableX[Id==i,.N,by=TimeCounter]
setkey(TableD,TimeCounter) # set key for merge
# Merge with Id on the column dimension
TableC[TableD,paste("somechars",i,sep=""):=N]
}
There could be steps missing in the TimeCounter so now I have to check for gaps in TableC and insert rows which were missing for all Id's. Then I can finally check where and how big my data gaps are.

Using dplyr::filter, how can the output be limited to just first 500 rows?

dplyr is a great and fast library.
Using the %>% operator enables powerful manipulation.
In my first step, I need to limit the output to only 500 rows max (for display purposes).
How can I do that?
par<-filter(pc,Child_Concept_GID==as.character(mcode)) %>% select(Parent_Concept_GID)
what I need is something like
filter(pc,CONDITION,rows=500)
Is there direct way or a nice workaround without making the first step a separate step (outside the %>% "stream")
There are a couple of ways to do this. Assuming you are pipe-lining your data (using %>%)
top_n(tn) works with grouped data. It will not return tn rows, if the data is sorted with arrange()
head(500) takes the first 500 rows (can be used after arrange(), for example)
sample_n(size=500) can be used to select 500 arbitrary rows
If you are looking for the R equivalent to SQL's LIMIT, use head().
I think you're actually looking for slice() here.
filter(pc, condition) %>% slice(1:500)
This does not rank the results. It merely pulls a slice, by position. In this case positions 1 through 500.
If this is coming from a relational db, head is a better option.
To limit the output of filter, one can add after filter a function
top_n()
credit goes to commenter joran
solution
par<-filter(pc,Child_Concept_GID==as.character(mcode)) %>% top_n(500) %>% select(Parent_Concept_GID)

Resources