I have to use 2 data frames 2 million records and another 2 million records. I used a for loop to obtain the data from one another but it is too slow. I've created an example to demonstrate what I need to do.
ratings = data.frame(id = c(1,2,2,3,3),
rating = c(1,2,3,4,5),
timestamp = c("2006-11-07 15:33:57","2007-04-22 09:09:16","2010-07-16 19:47:45","2010-07-16 19:47:45","2006-10-29 04:49:05"))
stats = data.frame(primeid = c(1,1,1,2),
period = c(1,2,3,4),
user = c(1,1,2,3),
id = c(1,2,3,2),
timestamp = c("2011-07-01 00:00:00","2011-07-01 00:00:00","2011-07-01 00:00:00","2011-07-01 00:00:00"))
ratings$timestamp = strptime(ratings$timestamp, "%Y-%m-%d %H:%M:%S")
stats$timestamp = strptime(stats$timestamp, "%Y-%m-%d %H:%M:%S")
for (i in(1:nrow(stats)))
{
cat("Processing ",i," ...\r\n")
temp = ratings[ratings$id == stats$id[i],]
stats$idrating[i] = max(temp$rating[temp$timestamp < stats$timestamp[i]])
}
Can someone provide me with an alternative for this? I know apply may work but I have no idea how to translate the for function.
UPDATE: Thank you for the help. I am providing more information.
The table stats has unique combinations of primeid,period,user,id.
The table ratings has multiple id records with different ratings and timestamps.
What I want to do is the following. For each id found in stats, to find all the records in the ratings table (id column) and then get the max rating according to a specific timestamp obtained also from stats.
I love plyr, and most of the tools created by Hadley Wickham, but I find that it can be painfully slow, especially if I'm trying to split on an ID field. When this happens, I turn to sqldf. I usually get a speed up of 20x.
First I need to use lubridate because sqldf chokes on POSIXlt types:
library(lubridate)
ratings$timestamp = ymd_hms(ratings$timestamp)
stats$timestamp = ymd_hms(stats$timestamp)
Merge the dataframes, as Vincent did, and remove those violating the date constraint:
tmp <- merge(stats, ratings, by="id")
tmp <- subset(tmp, timestamp.y < timestamp.x )
Lastly, grab the max rating for each ID:
library(sqldf)
sqldf("SELECT *, MAX(rating) AS rating FROM tmp GROUP BY id")
Depending on the ratio of ids to data points this may work better:
r = split(ratings, ratings$id)
stats$idrating = sapply(seq.int(nrow(stats)), function(i) {
rd = r[[stats$id[i]]]
if (length(rd))
max(rd$rating[rd$timestamp < stats$timestamp[i]])
else NA
})
If your IDs are not contiguous integers (you can check that with all(names(r) == seq_along(r))) you'll have to add as.character() when referencing r[[ or use match once to create the mapping and it will cost you some speed.
Obviously, you can do the same without the split, but that's typically slower yet will use less memory:
stats$idrating = sapply(seq.int(nrow(stats)), function(i) {
rd = ratings[ratings$id == stats$id[i],]
if (nrow(rd))
max(rd$rating[rd$timestamp < stats$timestamp[i]])
else NA
})
You can also drop the if if you know there will be no mismatches.
I voted the answer provided although I used another approach to get to the same result
In the merge dataset I first removed dates that were older than the conditioned date and then run this:
aggregate (rating ~ id+primeid+period+user, data=new_stats, FUN = max)
From a data structure perspective it seems that you want to merge two tables and then perform a split-group-apply method.
Instead of for looping to check what row belongs to what row you can simply merge the two tables (much like a JOIN statement in SQL) and then perform an 'aaply' type of method. I recommend you download the 'plyr' library.
new_stats = merge(stats, ratings, by='id')
library(plyr)
ddply(new_stats,
c('primeid', 'period', 'user'),
function(new_stats)
c( max(new_stats[as.Date(new_stats$timestamp.x) > as.Date(new_stats$timestamp.y)]$rating )))
If the use of plyr confuses you, please visit this tutorial: http://www.creatapreneur.com/2013/01/split-group-apply/.
Related
I'm a newbie to the bigstatsr package. I have a sqlite database which I want to convert to an FBM matrix of 40k rows (genes) 60K columns (samples) for later consumption. I found examples of how to populate the matrix with random values but I'm not sure of what would be the best way to populate it with values from my sqlite database.
Currently I do it sequentially, here's some mock code:
library(bigstatsr)
library(RSQLite)
library(dplyr)
number_genes <- 50e3
number_samples <- 70e3
large_genomic_matrix <- bigstatsr::FBM(nrow = number_genes,
ncol = number_samples,
type = "double",
backingfile = "fbm_large_genomic_matrix")
# Code to get a single df at the time
database_connection <- dbConnect(RSQLite::SQLite(), "database.sqlite")
sample_index_counter <- 1
for(current_sample in vector_with_sample_names){
sqlite_df <- DBI::dbListTables(conn = database_connection) %>%
dplyr::tbl("genomic_data") %>%
dplyr::filter(sample == current_sample) %>%
dplyr::collect()
large_genomic_matrix[, sample_index_counter] <- sqlite_df$value
sample_index_counter <- sample_index_counter + 1
}
big_write(large_genomic_matrix, "large_genomic_matrix.out", every_nrow = 1000, progress = interactive())
I have two questions:
Is there a way of populating the matrix more efficiently? Not sure if big_apply could be used here, perhaps foreach
Do I always have to use big_write in order to load my matrix later? If so why can't I just use the bk file?
Thanks in advance
That is a very good first try that you have by yourself.
What is inefficient here is to test for dplyr::filter(sample == current_sample) for every single sample. I would try to use match() first to get the indices. Then, what would be a bit inefficient is to populate each column individually. As you said, you could use big_apply() to do this by blocks.
big_write() is for writing the FBM to some text file (e.g. csv). What you want here is to use FBM()$save() (second line of the example in the README), and then use big_attach() on the .rds file (next line of the README).
I'm using fingertipsR to obtain public health data.
There are indicators at different geographic levels and these indicators are also grouped at profile level.
Here's some code:
library(fingertipsR)
library(fingertipscharts)
library(tidyverse)
library(ggthemes)
fingertips_stats()
inds<-indicators_unique()
profs<-profiles()
it's possible to pull unique indicators for profiles like this and then to add a column like this
smoking<-indicators_unique(ProfileID = 18,DomainID = NULL)%>%mutate(prof_id="18")
What I'd like to do is:
for each unique profile ID generate a dataframe of indicators. There are 53 unique profiles
uniqueprofs<-as_tibble(unique(profs$ProfileID))
How can I purr through this? or loop?
I am routinely stuck on these iteration type problems.
EDIT:
so. if you ctrl + click on
indicators_unique
you'll see the bit:
df <- unique(df[, c("IndicatorID", "IndicatorName")])
I copied all of the function and called it something else
function (ProfileID = NULL, DomainID = NULL, path)
{
if (missing(path))
path <- fingertips_endpoint()
#fingertips_ensure_api_available(endpoint = path)
df <- indicators(ProfileID, DomainID, path = path)
df <- unique(df[, c("IndicatorID", "IndicatorName","ProfileID")])
return(df)
}
And I now get a dataframe containing the ProfileID. If I add "DomainID" I can have that too....
Edit:
Annoyingly, I've asked a similar question and updated it with dplyr group_by and group_walk
I can do this:
inds%>%group_by(ProfileID)%>%group_walk(~ write.csv(.x, paste0(.y$ProfileID, ".csv")))
How do I group_walk and write the dataframes/tibbles to the environment rather than writing them a drive and then loading them in?
Start with some minimal initial code
library(fingertipsR)
library(tidyverse)
profs<-profiles()
indictators_unique is already vectorized so rather than loading the ProfileIDs into a tibble, put them in a list and then you can do a simple
unique_profs <- list(unique(profs$ProfileID))
indicators_unique(ProfileID = unique_profs, DomainID = NULL)
The issue is adding your desired prof_id column. I'm not familiar with these packages. Is there any dataframe that links ProfileID to either IndicatorID or IndicatorName that you can do a join on?
First and foremost - thank you for taking your time to view my question, regardless of if you answer or not!
I am trying to create a function that loops through my df and queries in the necessary data from SQL using the RODBC package in R. However, I am having trouble setting up the query, since the parameter of the query change through each iteration (example below)
So my df looks like this:
ID Start_Date End_Date
1 2/2/2008 2/9/2008
2 1/1/2006 1/1/2007
1 5/7/2010 5/15/2010
5 9/9/2009 10/1/2009
How would I go about specifying the start date and end date in my sql program?
here's what i have so far:
data_pull <- function(df) {
a <- data.frame()
b <- data.frame()
for (i in df$id)
{
dbconnection <- odbcDriverConnect(".....")
query <- paste("Select ID, Date, Account_Balance from Table where ID = (",i,") and Date > (",df$Start_Date,") and Date <= (",df$End_Date,")")
a <- sqlQuery(dbconnection, paste(query))
b <- rbind(b,a)
}
return(b)
}
However, this doesn't query in anything. I believe it has something to do with how I am specifying the start and the end date for the iteration.
If anyone can help on this it would be greatly appreciated. If you need further explanation, please don't hesitate to ask!
A couple of syntax issues arise from current setup:
LOOP: You do not iterate through all rows of data frame but only the atomic ID values in the single column, df$ID. In that same loop you are passing the entire vectors of df$Start_Date and df$End_Date into query concatenation.
DATES: Your date formats do not align to most data base date formats of 'YYYY-MM-DD'. And still some others like Oracle, you require string to data conversion: TO_DATE(mydate, 'YYYY-MM-DD').
A couple of aforementioned performance / best practices issues:
PARAMETERIZATION: While parameterization is not needed for security reasons since your values are not generated by user input who can inject malicious SQL code, for maintainability and readability, parameterized queries are advised. Hence, consider doing so.
GROWING OBJECTS: According to Patrick Burn's Inferno Circle 2: Growing Objects, R programmers should avoid growing multi-dimensional objects like data frames inside a loop which can cause excessive copying in memory. Instead, build a list of data frames to rbind once outside the loop.
With that said, you can avoid any looping or listing needs by saving your data frame as a database table then joined to final table for a filtered, join query import. This assumes your database user has CREATE TABLE and DROP TABLE privileges.
# CONVERT DATE FIELDS TO DATE TYPE
df <- within(df, {
Start_Date = as.Date(Start_Date, format="%m/%d/%Y")
End_Date = as.Date(End_Date, format="%m/%d/%Y")
})
# SAVE DATA FRAME TO DATABASE
sqlSave(dbconnection, df, "myRData", rownames = FALSE, append = FALSE)
# IMPORT JOINED AND DATE FILTERED QUERY
q <- "SELECT ID, Date, Account_Balance
FROM Table t
INNER JOIN myRData r
ON r.ID = t.ID
AND t.Date BETWEEN r.Start_Date AND r.End_Date"
final_df <- sqlQuery(dbconnection, q)
I am trying to build a data frame with book id, title, author, rating, collection, start and finish date from the LibraryThing api with my personal data. I am able to get a nested list fairly easily, and I have figured out how to build a data frame with everything but the dates (perhaps in not the best way but it works). My issue is with the dates.
The list I'm working with normally has 20 elements, but it adds the startfinishdates element only if I added dates to the book in my account. This is causing two issues:
If it was always there, I could extract it like everything else and it would have NA most of the time, and I could use cbind to get it lined up correctly with the other information
When I extract it using the name, and get an object with less elements, I don't have a way to join it back to everything else (it doesn't have the book id)
Ultimately, I want to build this data frame and an answer that tells me how to pull out the book id and associate it with each startfinishdate so I can join on book id is acceptable. I would just add that to the code I have.
I'm also open to learning a better approach from the jump and re-designing the entire thing as I have not worked with lists much in R and what I put together was after much trial and error. I do want to use R though, as ultimately I am going to use this to create an R Markdown page for my web site (for instance, a plot that shows finish dates of books).
You can run the code below and get the data (no api key required).
library(jsonlite)
library(tidyverse)
library(assertr)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
books.lst<-data$books
#create df from json
create.df<-function(item){
df<-map_df(.x=books.lst,~.x[[item]])
df2 <- t(df)
return(df2)
}
ids<-create.df(1)
titles<-create.df(2)
ratings<-create.df(12)
authors<-create.df(4)
#need to get the book id when i build the date df's
startdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(started_stamp,started_date)
finishdates.df<-map_df(.x=books.lst,~.x$startfinishdates) %>% select(finished_stamp,finished_date)
collections.df<-map_df(.x=books.lst,~.x$collections)
#from assertr: will create a vector of same length as df with all values concatenated
collections.v<-col_concat(collections.df, sep = ", ")
#assemble df
books.df<-as.data.frame(cbind(ids,titles,authors,ratings,collections.v))
names(books.df)<-c("ID","Title","Author","Rating","Collections")
books.df<-books.df %>% mutate(ID=as.character(ID),Title=as.character(Title),Author=as.character(Author),
Rating=as.character(Rating),Collections=as.character(Collections))
This approach is outside the tidyverse meta-package. Using base-R you can make it work using the following code.
Map will apply the user defined function to each element of data$books which is provided in the argument and extract the required fields for your data.frame. Reduce will take all the individual dataframes and merge them (or reduce) to a single data.frame booksdf.
library(jsonlite)
data<-fromJSON("http://www.librarything.com/api_getdata.php?userid=cau83&key=392812157&max=450&showCollections=1&responseType=json&showDates=1")
booksdf=Reduce(function(x,y){rbind(x,y)},
Map(function(x){
lenofelements = length(x)
if(lenofelements>20){
if(!is.null(x$startfinishdates$started_date)){
started_date = x$startfinishdates$started_date
}else{
started_date=NA
}
if(!is.null(x$startfinishdates$started_stamp)){
started_stamp = x$startfinishdates$started_date
}else{
started_stamp=NA
}
if(!is.null(x$startfinishdates$finished_date)){
finished_date = x$startfinishdates$finished_date
}else{
finished_date=NA
}
if(!is.null(x$startfinishdates$finished_stamp)){
finished_stamp = x$startfinishdates$finished_stamp
}else{
finished_stamp=NA
}
}else{
started_stamp = NA
started_date = NA
finished_stamp = NA
finished_date = NA
}
book_id = x$book_id
title = x$title
author = x$author_fl
rating = x$rating
collections = paste(unlist(x$collections),collapse = ",")
return(data.frame(ID=book_id,Title=title,Author=author,Rating=rating,
Collections=collections,Started_date=started_date,Started_stamp=started_stamp,
Finished_date=finished_date,Finished_stamp=finished_stamp))
},data$books))
The idea is as follows. Every patient has a unique patient id, which we call hidenic_id. However this patient may be admitted to the hospital multiple times. On the other hand every entry has unique emtek_id.
Patient 110380 was admitted to the hospital 4/14/2001 11:08 and then transferred through the hospital and discharged on 4/24/2001 18:16. Now this patient again admitted on 5/11/2001 23:24 because he has different emtek_id now. He is discharged from the hospital on 5/25/2001 16:26. So you need to assign correct emtek_ids by checking the dates. If the date in the combined file is within the admission and discharge time period (or very close like 24 hours) we can assign that emtek_id.
How can I assign different emtek_IDs to entries with hidenic_id and admit time?
I had a couple ideas worth sharing.
First, make emtek_id from hidenic_id and date. Second, make the emtek_id logical for parsing, e.g., emtek_id#dataTime. Third, make the database a global vector. Depending on memory limits, there has to be a faster way than this, but it might give you a few ideas.
The main problems are handling NA values and incorrect hidenic_id, validating hidenic_id(s), and padding the IDs if you don't characters leading (which would be a quick fix). Lastly, how do you want to handle input that's incorrect but not NA/null? For instance, say you input "ID" instead of "ID12345", do you want to treat that as a call to assign a new value or prompt for a correct input XOR NA value? I will assume you only feed it correct ID inputs or NA values, but this is my trivializing assumption.
Here's some pseudo-code to start the idea. You choose how to store the data (eg. csv file then use data.table::fread()):
#this file's name is "make.hidenic_id.R"
library(data.table)
library(stringr)
set.seed(101)
#one might one a backup written, perhaps conditionally updating it every hour or so.
database.hidenic_id <<-data.table::fread("database.filename.hidenic_id.csv")
database.emtek_id <<-data.table::fread("database.filename.emtek_id.csv")
make.hidenic_Id = function(in.hidenic_id){
if(is.na(in.hidenic_id) | !(in.hidenic_id %in% database.hidenic_id)){
new.hidenic_id=NA
#conditionally make new hidenic_id
while( new.hidenic_id %in% database.hidenic_id){
new.hidenic_id = paste0("ID",str_pad(sample.int(99999, 1),5,pad=0))
}
#make new emtek_id
new.emtek_id <- paste0(new.hidenic_id, sep="#", str_sub(Sys.time(),1,16))
#update databases; e.g., c(database.emtek_id, new.emtek_id)
database.hidenic_id <<- c(database.hidenic_id, new.hidenic_id)
database.emtek_id <<- c(database.emtek_id, new.emtek_id)
}else{
new.emtek_id <- paste0(in.hidenic_id, sep="#", str_sub(Sys.time(),1,16))
# update database.emtek_id
database.emtek_id <<- c(database.emtek_id, new.emtek_id)
}
return(new.emtek_id)
}
temp = readline(prompt="Enter hidenic_id OR type \"NA\": ")
data.table::fwrite(database.emtek_id, "database.filename.emtek_id.csv")
data.table::fwrite(database.hidenic_id,"database.filename.hidenic_id.csv")
and call the file with
source("make.hidenic_id.R")
There are a lot of "good-practice" things I don't do to manage poor input data or optimizing searching, but this is a strong start. Some other good-practice would be to have longer integers or a different leading string, but you never said we could use input value to make the IDs.
You could say this was inspired by the census since everything is just one massive string per geographic ID variable.
I was intrested in your problem so I created some mock data and tried to solve the problem but I ran into some confusion myself and then posted my question, which I think is what you are asking but more general. You can see the response here: How can I tell if a time point exists between a set of before and after times
My post generates what I believe is what you are starting with and the checked answer is what I believe you are looking for. The full code is below. You will need to install zoo and IRanges.
Also, I did this in version 2.15.3. IRanges did not install properly in 3.0.0.
## package installation
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
install.packages("zoo")
## generate the emtek and hidenic file data
library(zoo)
date_string <- paste("2001", sample(12, 10, 3), sample(28,10), sep = "-")
time_string <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26",
"23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")
entry_emtek <- strptime(paste(date_string, time_string), "%Y-%m-%d %H:%M:%S")
entry_emtek <- entry_emtek[order(entry_emtek)]
exit_emtek <- entry_emtek + 3600 * 24
emtek_file <- data.frame(emtek_id = 1:10, entry_emtek, exit_emtek)
hidenic_id <- 110380:110479
date_string <- paste("2001", sample(12, 100, replace = TRUE), sample(28,100, replace = T), sep = "-")
time_string <- rep(c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26",
"23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26"),10)
hidenic_time <- strptime(paste(date_string, time_string), "%Y-%m-%d %H:%M:%S")
hidenic_time <- hidenic_time[order(hidenic_time)]
hidenic_file <- data.frame(hidenic_id, hidenic_time)
## Find the intersection of emtek and hidenic times. This part was done by user: agstudy
library(IRanges)
## create a time intervals
subject <- IRanges(as.numeric(emtek_file$entry_emtek),
as.numeric(emtek_file$exit_emtek))
## create a time intervals (start=end here)
query <- IRanges(as.numeric(hidenic_file$hidenic_time),
as.numeric(hidenic_file$hidenic_time))
## find overlaps and extract rows (both time point and intervals)
emt.ids <- subjectHits(findOverlaps(query,subject))
hid.ids <- queryHits(findOverlaps(query,subject))
cbind(hidenic_file[hid.ids,],emtek_file[emt.ids,])