I'm using RSQlite to import Datasets from an SQlite-Database. There are multiple millions of observations within the Database. Therefor I'd like to do as much as possible of Data selection and aggregation within the Database.
At some point I need to aggregate a character variable. I want to get the value which occures the most ordered by a group. How can I edit the following dplyr-chain so it works also with RSQlite?
library(tidyverse)
library(RSQLite)
# Path to Database
DATABASE="./xxx.db"
# Connect Database
mydb <- dbConnect(RSQLite::SQLite(), DATABASE)
# Load Database
data = tbl(mydb, "BigData")
# Query Database
Summary <- data %>%
filter(year==2020) %>%
group_by(Grouping_variable) %>%
summarize(count=n(),
Item_variable=names(which.max(table(Item_variable))))
Within R that code would do it's job. Querying the database I get an error code Error: near "(": syntax error
Original pipe contains more filters and steps.
Example Database would basically look like:
data.frame(Grouping_variable=c("A","A","B","C","C","C","D","D","D","D"),
year=c(2019,2020,2019,2020,2020,2020,2020,2020,2020,2021),
Item_variable=c("X","Y","Y","X","X","Y","Y","Y","X","X"))
Grouping_variable year Item_Variable
1 A 2019 X
2 A 2020 Y
3 B 2019 Y
4 C 2020 X
5 C 2020 X
6 C 2020 Y
7 D 2020 Y
8 D 2020 Y
9 D 2020 X
10 D 2021 X
Result should look like:
Grouping_variable count Item_variable
<chr> <int> <chr>
1 A 1 Y
2 C 3 X
3 D 3 Y
Assuming that DF is the data frame defined in the question and using SQL we calculate the count of each item within group in the year 2020 giving tmp and then take the row whose count is maximum giving tmp2 - SQLite guarantees that when using group by and max that the other fields come from the row where the maximum was found. Also take the sum of the counts in tmp2 and finally select just the desired columns.
library(sqldf)
sql <- "with tmp as (
select Grouping_variable, count(*) count, Item_variable from DF
where year = 2020
group by Grouping_variable, Item_variable
),
tmp2 as (
select Grouping_variable, max(count), sum(count) count, Item_variable
from tmp
group by Grouping_variable
)
select Grouping_variable, count, Item_variable
from tmp2
"
sqldf(sql)
giving:
Grouping_variable count Item_variable
1 A 1 Y
2 C 3 X
3 D 3 Y
Added
Suppose that DF were a table in your database. This code creates such a database.
library(RSQLite)
m <- dbDriver("SQLite")
con <- dbConnect(m, dbname = "database.sqlite")
dbWriteTable(con, 'DF', DF, row.names = FALSE)
dbDisconnect(con)
then this would run the sql command in the sql string defined above on that database and return the result.
library(RSQLite)
m <- dbDriver("SQLite")
con <- dbConnect(m, dbname = "database.sqlite")
result <- dbGetQuery(con, sql)
dbDisconnect(con)
Related
I am reading in data from a .txt file that contains over thousands of records
table1 <- read.table("teamwork.txt", sep ="|", fill = TRUE)
Looks like:
f_name l_name hours_worked code
Jim Baker 8.5 T
Richard Copton 4.5 M
Tina Bar 10 S
However I only want to read in data that has a 'S' or 'M' code:
I tried to concat the columns:
newdata <- subset(table1, code = 'S' |'M')
However I get this issue:
operations are possible only for numeric, logical or complex types
If there are thousands or tens of thousands of records (maybe not for millions), you should just be able to filter after you read in all the data:
> library(tidyverse)
> df %>% filter(code=="S"|code=="M")
# A tibble: 2 x 4
f_name l_name hours_worked code
<fct> <fct> <dbl> <fct>
1 Richard Copton 4.50 M
2 Tina Bar 10.0 S
If you really want to just pull in the rows that meet your condition, try sqldf package as in example here: How do i read only lines that fulfil a condition from a csv into R?
You can try
cols_g <- table1[which(table1$code == "S" | table1$code == "M",]
OR
cols_g <- subset(table1, code=="S" | code=="M")
OR
library(dplyr)
cols_g <- table1 %>% filter(code=="S" | code=="M")
If you want to add column cols_g on table1, you can use table1$cols_g assigned anything from these 3 methods instead of cols_g.
I have the following xlsx file df.xlsx which looks like this:
client id dax dpd
1 2000-05-30 7
1 2000-12-31 6
2 2003-05-21 6
3 1999-12-30 5
3 2000-10-30 6
3 2001-12-30 5
4 1999-12-30 5
4 2002-05-30 6
It's about a loan migration from a snapshot to another. The problem is that I don't have all the months in between. (ie: client_id = 1 , dax is from 2000-05-30 and 2000-12-30) . I have tried several approaches but no result. I need to populate by client_id all the months in between dax and keep the same "dpd" as the first month. (ie client_id = 1 , dax is from 2000-05-30 and 2000-12-30, dpd=7 for all months except the last one "2000-12-31" where dpd= 6). If the client_id appears only once (like client_id = 2 ) it should remain the same.
(dpd means days past due aka rating bucket)
I have tried this code:
df2 <- data.frame(dax=seq(min(df$dax), max(df$dax), by="month"))
df3 <- merge(x=df2a, y=df, by="dax", all.x=T)
idx <- which(is.na(df3$values))
for (client_id in idx)
df3$values[client_id] <- df3$values[client_id-1]
df3
but the results were not quite okay for what i need.
i appreciate any advice. thank you very much!
If I understand your question correctly, you want to generate seqence of dates, given the start/end date.
R code to do this would be (insert values from your dataframe):
seq(as.Date("2017-01-30"), as.Date("2017-12-30"), "month")
Edit after comment:
In this case you can split your data by clients first and then generate the sequences:
new_data <- data.frame()
customerslist <- split(YOURDATA, YOURDATA$id)
for(i in 1:length(customerslist)){
dates <- seq(min(as.Date(customerslist[[i]]$dax)), max(as.Date(customerslist[[i]]$dax)), "month")
id <- rep(customerslist[[i]]$id[1], length(dates))
dpd <- rep(customerslist[[i]]$dpd[1], length(dates))
add <- cbind(id, as.character(dates), dpd)
new_data <- rbind(new_data, add)
}
new_data$V2 <- as.Date(new_data$V2)
I am trying to add a vector which I generated in R to a sqlite table as a new column. For this I wanted to use dplyr (I installed the most recent dev. version along with the dbplyr package according to this post here). What I tried:
library(dplyr)
library(DBI)
#creating initial database and table
dbcon <- dbConnect(RSQLite::SQLite(), "cars.db")
dbWriteTable(dbcon, name = "cars", value = cars)
cars_tbl <- dplyr::tbl(dbcon, "cars")
#new values which I want to add as a new column
new_values <- sample(c("A","B","C"), nrow(cars), replace = TRUE)
#attempt to add new values as column to the table in the database
cars_tbl %>% mutate(new_col = new_values) #not working
What is an easy way to achieve this (not necessarily with dplyr)?
Not aware of a way of doing this with dyplr, but you can do it with RSQLite directly. The problem is not actually with RSQLite, but the fact that I don't know how to pass a list to mutate. Note that, in your code, something like this would work:
cars_tbl %>% mutate(new_col = another_column / 3.14)
Anyway, my alternative. I've created a toy cars dataframe.
cars <- data.frame(year=c(1999, 2007, 2009, 2017), model=c("Ford", "Toyota", "Toyota", "BMW"))
I open connection and actually create the table,
dbcon <- dbConnect(RSQLite::SQLite(), "cars.db")
dbWriteTable(dbcon, name = "cars", value = cars)
Add the new column and check,
dbGetQuery(dbcon, "ALTER TABLE cars ADD COLUMN new_col TEXT")
dbGetQuery(dbcon, "SELECT * FROM cars")
year model new_col
1 1999 Ford <NA>
2 2007 Toyota <NA>
3 2009 Toyota <NA>
4 2017 BMW <NA>
And then you can update the new column, but the only tricky thing is that you have to provide a where statement, in this case I use the year.
new_values <- sample(c("A","B","C"), nrow(cars), replace = TRUE)
new_values
[1] "C" "B" "B" "B"
dbGetPreparedQuery(dbcon, "UPDATE cars SET new_col = ? where year=?",
bind.data=data.frame(new_col=new_values,
year=cars$year))
dbGetQuery(dbcon, "SELECT * FROM cars")
year model new_col
1 1999 Ford C
2 2007 Toyota B
3 2009 Toyota B
4 2017 BMW B
As a unique index, you could always use rownames(cars), but you would have to add it as a column in your dataframe and then in your table.
EDIT after suggestion by #krlmlr: indeed much better using dbExecute instead of deprecated dbGetPreparedQuery,
dbExecute(dbcon, "UPDATE cars SET new_col = :new_col where year = :year",
params=data.frame(new_col=new_values,
year=cars$year))
EDIT after comments: I did not think about this a few days ago, but even if it is a SQLite you can use the rowid. I've tested this and it works.
dbExecute(dbcon, "UPDATE cars SET new_col = :new_col where rowid = :id",
params=data.frame(new_col=new_values,
id=rownames(cars)))
Although you have to make sure that the rowid's in the table are the same as your rownames. Anyway you can always get your rowid's like this:
dbGetQuery(dbcon, "SELECT rowid, * FROM cars")
rowid year model new_col
1 1 1999 Ford C
2 2 2007 Toyota B
3 3 2009 Toyota B
4 4 2017 BMW B
This question already has answers here:
How to number/label data-table by group-number from group_by?
(6 answers)
Closed 6 years ago.
I am using a dplyr table in R. Typical fields would be a primary key, an id number identifying a group, a date field, and some values. There are numbersI did some manipulation that throws out a bunch of data in some preliminary steps.
In order to do the next step of my analysis (in MC Stan), It'll be easier if both the date and the group id fields are integer indices. So basically, I need to re-index them as integers between 1 and whatever the total number of distinct elements are (about 750 for group_id and about 250 for date_id, the group_id is already integer, but the date is not). This is relatively straightforward to do after exporting it to a data frame, but I was curious if it is possible in dplyr.
My attempt at creating a new date_val (called date_val_new) is below. Per the discussion in the comments I have some fake data. I purposefully made the group and date values not be 1 to whatever, but I didn't make the date an actual date. I made the data unbalanced, removing some values to illustrate the issue. The dplyr command re-starts the index at 1 for each new group, regardless of what date_val it is. So every group starts at 1, even if the date is different.
df1 <- data.frame(id = 1:40,
group_id = (10 + rep(1:10, each = 4)),
date_val = (20 + rep(rep(1:4), 10)),
val = runif(40))
for (i in c(5, 17, 33))
{
df1 <- df1[!df1$id == i, ]
}
df_new <- df1 %>%
group_by(group_id) %>%
arrange(date_val) %>%
mutate(date_val_new=row_number(group_id)) %>%
ungroup()
This is the base R method:
df1 %>% mutate(date_val_new = match(date_val, unique(date_val)))
Or with a data.table, df1[, date_val_new := .GRP, by=date_val].
Use group_indices_() to generate a unique id for each group:
df1 %>% mutate(date_val_new = group_indices_(., .dots = "date_val"))
Update
Since group_indices() does not handle class tbl_postgres, you could try dense_rank()
copy_to(my_db, df1, name = "df1")
tbl(my_db, "df1") %>%
mutate(date_val_new = dense_rank(date_val))
Or build a custom query using sql()
tbl(my_db, sql("SELECT *,
DENSE_RANK() OVER (ORDER BY date_val) AS DATE_VAL_NEW
FROM df1"))
Alternatively, I think you can try getanID() from the splitstackshape package.
library(splitstackshape)
getanID(df1, "group_id")[]
# id group_id date_val val .id
# 1: 1 11 21 0.01857242 1
# 2: 2 11 22 0.57124557 2
# 3: 3 11 23 0.54318903 3
# 4: 4 11 24 0.59555088 4
# 5: 6 12 22 0.63045007 1
# 6: 7 12 23 0.74571297 2
# 7: 8 12 24 0.88215668 3
In SQLite I would like to find the standard deviation of the first differences of a (logged) series that I define with GROUP BY. My data provider gives me a daily price series, but I would like to find annualized daily volatility (the standard deviation of daily returns -- first difference of the natural log of the series -- over each year). I can bring the data to R, then use ddply(), but I would like to do this entirely in SQLite. I tried the difference() function from the RSQLite.extfunctions package, but my usage is wrong. I expected it to work like diff() in R, but I can't find much documentation.
This generates some data.
stocks <- 5
years <- 5
list.n <- as.list(rep(252, stocks * years))
list.mean <- as.list(rep(0, stocks * years))
list.sd <- as.list(abs(runif(stocks * years, min = 0, max = 0.1)))
list.po <- as.list(runif(n = stocks, min = 25, max = 100))
list.ret <- mapply(rnorm, n = list.n, mean = list.mean, sd = list.sd, SIMPLIFY = F)
my.price <- function(po, ret) po * exp(cumsum(ret))
list.price <- mapply(my.price, po = list.po, ret = list.ret, SIMPLIFY = F)
gvkey <- rep(seq(stocks), each = 252 * years)
day <- rep(seq(252), n = stocks * years)
fyr <- rep(seq(years), n = stocks, each = 252)
data.dly <- data.frame(gvkey, fyr, day, p = unlist(list.price))
Here is how I would do it with ddply() and the result.
# I could do this easily with ddply and subset
library(plyr)
data.dly <- ddply(data.dly, .(gvkey, fyr), transform, vol = sd(diff(log(p))))
data.ann <- subset(data.dly, day == 252)
head(data.ann)
gvkey fyr day p vol
252 1 1 252 86.08568 0.077287182
504 1 2 252 43.32113 0.066741862
756 1 3 252 68.69734 0.084419564
1008 1 4 252 75.37267 0.006003969
1260 1 5 252 17.53583 0.083688727
1512 2 1 252 168.44656 0.035959492
And here is my (failed) SQLite attempt and error.
# but I can't figure it out in SQLite
library(RSQLite)
library(RSQLite.extfuns)
db <- dbConnect(SQLite())
init_extensions(db)
[1] TRUE
dbWriteTable(db, name = "data_dly", value = data.dly)
[1] TRUE
temp <- dbGetQuery(db, "SELECT stdev(difference(log(p))) FROM data_dly GROUP BY gvkey, fyr ORDER BY gvkey, fyr, day")
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: wrong number of arguments to function difference())
Does difference() need a comma separated list of numbers? Can I do this entirely in SQLite? Or do I need to perform in R? Thanks!
Try this where data.dly is the data frame in the post:
library(sqldf)
out <- sqldf("select A.gvkey, A.fyr, stdev(log(A.p) - log(B.p)) vol
from `data.dly` A join `data.dly` B
where A.day = B.day + 1
and A.gvkey = B.gvkey
and A.fyr = B.fyr
group by A.gvkey, A.fyr")
This gives:
> head(out)
gvkey fyr vol
1 1 1 0.09312510
2 1 2 0.01905447
3 1 3 0.01651095
4 1 4 0.06962667
5 1 5 0.05243940
6 2 1 0.03039751
The difference SQL command takes two character arguments, and has a different meaning to R's diff command.
Retrieve the data with an SQL command, then do stats using R.
temp <- dbGetQuery(db, "SELECT p FROM data_dly GROUP BY gvkey, fyr ORDER BY gvkey, fyr, day")
sd(diff(log(temp$p)))