The idea is as follows. Every patient has a unique patient id, which we call hidenic_id. However this patient may be admitted to the hospital multiple times. On the other hand every entry has unique emtek_id.
Patient 110380 was admitted to the hospital 4/14/2001 11:08 and then transferred through the hospital and discharged on 4/24/2001 18:16. Now this patient again admitted on 5/11/2001 23:24 because he has different emtek_id now. He is discharged from the hospital on 5/25/2001 16:26. So you need to assign correct emtek_ids by checking the dates. If the date in the combined file is within the admission and discharge time period (or very close like 24 hours) we can assign that emtek_id.
How can I assign different emtek_IDs to entries with hidenic_id and admit time?
I had a couple ideas worth sharing.
First, make emtek_id from hidenic_id and date. Second, make the emtek_id logical for parsing, e.g., emtek_id#dataTime. Third, make the database a global vector. Depending on memory limits, there has to be a faster way than this, but it might give you a few ideas.
The main problems are handling NA values and incorrect hidenic_id, validating hidenic_id(s), and padding the IDs if you don't characters leading (which would be a quick fix). Lastly, how do you want to handle input that's incorrect but not NA/null? For instance, say you input "ID" instead of "ID12345", do you want to treat that as a call to assign a new value or prompt for a correct input XOR NA value? I will assume you only feed it correct ID inputs or NA values, but this is my trivializing assumption.
Here's some pseudo-code to start the idea. You choose how to store the data (eg. csv file then use data.table::fread()):
#this file's name is "make.hidenic_id.R"
library(data.table)
library(stringr)
set.seed(101)
#one might one a backup written, perhaps conditionally updating it every hour or so.
database.hidenic_id <<-data.table::fread("database.filename.hidenic_id.csv")
database.emtek_id <<-data.table::fread("database.filename.emtek_id.csv")
make.hidenic_Id = function(in.hidenic_id){
if(is.na(in.hidenic_id) | !(in.hidenic_id %in% database.hidenic_id)){
new.hidenic_id=NA
#conditionally make new hidenic_id
while( new.hidenic_id %in% database.hidenic_id){
new.hidenic_id = paste0("ID",str_pad(sample.int(99999, 1),5,pad=0))
}
#make new emtek_id
new.emtek_id <- paste0(new.hidenic_id, sep="#", str_sub(Sys.time(),1,16))
#update databases; e.g., c(database.emtek_id, new.emtek_id)
database.hidenic_id <<- c(database.hidenic_id, new.hidenic_id)
database.emtek_id <<- c(database.emtek_id, new.emtek_id)
}else{
new.emtek_id <- paste0(in.hidenic_id, sep="#", str_sub(Sys.time(),1,16))
# update database.emtek_id
database.emtek_id <<- c(database.emtek_id, new.emtek_id)
}
return(new.emtek_id)
}
temp = readline(prompt="Enter hidenic_id OR type \"NA\": ")
data.table::fwrite(database.emtek_id, "database.filename.emtek_id.csv")
data.table::fwrite(database.hidenic_id,"database.filename.hidenic_id.csv")
and call the file with
source("make.hidenic_id.R")
There are a lot of "good-practice" things I don't do to manage poor input data or optimizing searching, but this is a strong start. Some other good-practice would be to have longer integers or a different leading string, but you never said we could use input value to make the IDs.
You could say this was inspired by the census since everything is just one massive string per geographic ID variable.
I was intrested in your problem so I created some mock data and tried to solve the problem but I ran into some confusion myself and then posted my question, which I think is what you are asking but more general. You can see the response here: How can I tell if a time point exists between a set of before and after times
My post generates what I believe is what you are starting with and the checked answer is what I believe you are looking for. The full code is below. You will need to install zoo and IRanges.
Also, I did this in version 2.15.3. IRanges did not install properly in 3.0.0.
## package installation
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
install.packages("zoo")
## generate the emtek and hidenic file data
library(zoo)
date_string <- paste("2001", sample(12, 10, 3), sample(28,10), sep = "-")
time_string <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26",
"23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")
entry_emtek <- strptime(paste(date_string, time_string), "%Y-%m-%d %H:%M:%S")
entry_emtek <- entry_emtek[order(entry_emtek)]
exit_emtek <- entry_emtek + 3600 * 24
emtek_file <- data.frame(emtek_id = 1:10, entry_emtek, exit_emtek)
hidenic_id <- 110380:110479
date_string <- paste("2001", sample(12, 100, replace = TRUE), sample(28,100, replace = T), sep = "-")
time_string <- rep(c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26",
"23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26"),10)
hidenic_time <- strptime(paste(date_string, time_string), "%Y-%m-%d %H:%M:%S")
hidenic_time <- hidenic_time[order(hidenic_time)]
hidenic_file <- data.frame(hidenic_id, hidenic_time)
## Find the intersection of emtek and hidenic times. This part was done by user: agstudy
library(IRanges)
## create a time intervals
subject <- IRanges(as.numeric(emtek_file$entry_emtek),
as.numeric(emtek_file$exit_emtek))
## create a time intervals (start=end here)
query <- IRanges(as.numeric(hidenic_file$hidenic_time),
as.numeric(hidenic_file$hidenic_time))
## find overlaps and extract rows (both time point and intervals)
emt.ids <- subjectHits(findOverlaps(query,subject))
hid.ids <- queryHits(findOverlaps(query,subject))
cbind(hidenic_file[hid.ids,],emtek_file[emt.ids,])
Related
In one data set I have account numbers (all unique) and the date that some event happened in the account.
In another data set I have the account numbers and an account status indicator along with the date the account entered and exitied that status level. The accounts are listed multiple times as the status dates produce a history of when the account moved from different statuses.
I wish to append the status that the account was in on the date the event happend to the first data set.
I have built a loop that performs this task but given the number of accounts I am working with, the length of the history, and the number of status "switches", the loop takes a few hours to run on my system.
Since the account numbers are the same between the two files I was wondering if there was a way to use the setkey functionality on account number using the data.table package and to use a data.table approach to speed up the append process. Basically within each account I need to see which interval the date from the first data set is in in the second data set to get the status.
Here is the code I've produced for my loop along with some toy data. I have tried using intervals in the lubridate package but this was giving me some issues with the data.table in the loop so I've gone with the between() command.
Does anyone have any ideas for a more efficient append process.
library(data.table)
library(lubridate)
set.seed(65)
# data set 1
dt1 <- data.table(account=c(1234,1235,1236,1237,1238),
eventDate=c(ymd(20170123),ymd(20170223),ymd(20170114),ymd(20170205),ymd(20170127)))
setkey(dt1,account)
# data set 2
se1 <- seq(from=ymd(20161201),to=ymd(20170228), length.out=4)
se2 <- seq(from=ymd(20170101),to=ymd(20170228), length.out=5)
se3 <- seq(from=ymd(20170103),to=ymd(20170228), length.out=4)
se4 <- seq(from=ymd(20160101),to=ymd(20170228), length.out=3)
se5 <- seq(from=ymd(20161101),to=ymd(20170228), length.out=6)
ss1 <- c(se1[1]-days(23),se1[-length(se1)]+days(1))
ss2 <- c(se2[1]-days(13),se2[-length(se2)]+days(1))
ss3 <- c(se3[1]-days(3),se3[-length(se3)]+days(1))
ss4 <- c(se4[1]-days(53),se4[-length(se4)]+days(1))
ss5 <- c(se5[1]-days(2),se5[-length(se5)]+days(1))
dt2 <- data.table(account=c(rep(1234,4),rep(1235,5),rep(1236,4),rep(1237,3),rep(1238,6)),
status=sample(LETTERS,22, replace=T),
statusStart=c(ss1,ss2,ss3,ss4,ss5),
statusEnd=c(se1,se2,se3,se4,se5))
setkey(dt2,account)
#dt2[,interv:=interval(statusStart,statusEnd)]
# set up and do the loop
accnts <- dt1[,unique(account)]
for(i in 1:length(accnts)){
dt2[ account==accnts[i] &
between(dt1[account==accnts[i],eventDate],statusStart,statusEnd,incbounds=T),
eventDate:=dt1[account==accnts[i],eventDate]]
}
# put it back in the first data set
dt1 <- merge(dt1,dt2[!is.na(eventDate),list(account,eventDate,status)], by=c('account','eventDate'),all.x=T)
Here's one way to do it, using the foverlaps function from data.table:
dt1$statusStart <- dt1$eventDate
dt1$statusEnd <- dt1$eventDate
setkey(dt2, account, statusStart, statusEnd)
foverlaps(dt1, dt2, by.x = c('account', 'statusStart', 'statusEnd'), by.y = c('account', 'statusStart', 'statusEnd'), type = 'within')
I am trying to create minimum convex polygons for a set of GPS coordinates, each day has 32 coordinates and I want to create a MCP with 1 day,2 days,3 days... and so on worth of data. For instance in the first step I want to include rows 1-32 which I have managed:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:32, ]
Currently to select data for me to do 2 days at a time I have written:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:64, ]
Is there a way to automate adding 32 rows at each step (in a loop) rather than me running the code manually each time and changing the amount of data used manually each time?
I am very new to R so I do not know whether it is possible to do this, the way I thought would work was:
n <- 32
for (i in 1:100) {
mydata <- mydata[1:n, ]
## CREATE MCP AND STORE HOME RANGE OUTPUT
n <- n+32
}
However it is not possible to have n representing a row number but is there a way to do this?
Apologies if this is unclear but as I said I am quite new to using R and really would appreciate any help that can be given.
I have two different csv files, one is called CA_Storms and one is called CA_adj. CA_Storms has many start and end dates/times for storm events (in one column), and CA_adj has a DateTime column that includes many thousand dates/times. I want to see if any of the dates/times in CA_adj correspond with any of the storm events in CA_Storms. To do this, I am trying to make a new column in CA_adj titled Storm_ID that will identify which storm it corresponds with based on the storm start and end times/dates in CA_Storms.
This is the process I have currently undergone:
#Make a value to which the csv files are attached
CA_Storms <- read.csv(file = "CA_Storms.csv", header = TRUE, stringsAsFactors = FALSE)
CA_adj <- read.csv(file = "CA_adj.csv", header = TRUE, stringsAsFactors
#strptime function (do this for both data sets)
CA_adj$DateTime1 <- strptime(CA_adj$DateTime, format = "%m/%d/%Y %H:%M")
CA_Storms$Start.time1 <- strptime(CA_Storms$Start.time, format = "%m/%d/%Y %H:%M")
CA_Storms$End.time1 <- strptime(CA_Storms$End.time, format = "%m/%d/%Y %H:%M")
#Make a new column into CA_adj that says Storm ID. Have it by
#default hold NAs.
CA_adj$Storm_ID <- NA
#Write a which statement to see if it meets the conditions of greater than
#or equal to start time or less than or equal to end time. Put this through a
#for loop to apply it to every row within CA_adj$DateTime1
for (i in nrow(CA_adj$DateTime1))
{
CA_adj$DateTime1[which(CA_adj$DateTime1 >= CA_Storms$Start.time1 | CA_adj$DateTime1 <= CA_Storms$End.time1), "Storm_ID"]
}
This is not giving me any errors, but it's also not replacing any of the values in the Storm_ID column that I have made. In my Global Environment under "Values" it now just says: i is NULL(empty). I am pretty sure what's missing is an i within the for loop, but I do not know where to put it. I also think the other issue is that it doesn't know what value to replace the NA's in the Storm_ID column with. I would like it to replace the NA's with the correct Storm ID that corresponds with the Storm dates (in CA_Storms$Start.time1 and in CA_Storms$End.Time1). For Dates/Times within CA_adj that do not apply to a storm date, I'd just want it to continue to say NA.
Any guidance on how to do this would be greatly appreciated. I'm new to R, and I've been trying to teach it to myself, which can make figuring out how to do these things on my own a bit difficult.
Thanks so much!
Why not have a look at the lubridate package. It will let you create time/date intervals which can then be tested against a specific time/date by %within% . Your code should be simpler.
You do need to use the loop index and you also need to make an assignment to CA_adj$StormID. I'm not certain if you could also have multiple CA_adj entries in a CA_Storms interval.
# make a lubridate interval in CA_Storms
# make CA_DateTime a lubridate
# or stick with the longer code...
# loop through all CA_adj
for (i in nrow(CA_adj)) {
CA_adj$StormID[i] <- CA_Storms$StormID[CA_adj$DateTime %within% CA_Storms$interval]
}
I have to use 2 data frames 2 million records and another 2 million records. I used a for loop to obtain the data from one another but it is too slow. I've created an example to demonstrate what I need to do.
ratings = data.frame(id = c(1,2,2,3,3),
rating = c(1,2,3,4,5),
timestamp = c("2006-11-07 15:33:57","2007-04-22 09:09:16","2010-07-16 19:47:45","2010-07-16 19:47:45","2006-10-29 04:49:05"))
stats = data.frame(primeid = c(1,1,1,2),
period = c(1,2,3,4),
user = c(1,1,2,3),
id = c(1,2,3,2),
timestamp = c("2011-07-01 00:00:00","2011-07-01 00:00:00","2011-07-01 00:00:00","2011-07-01 00:00:00"))
ratings$timestamp = strptime(ratings$timestamp, "%Y-%m-%d %H:%M:%S")
stats$timestamp = strptime(stats$timestamp, "%Y-%m-%d %H:%M:%S")
for (i in(1:nrow(stats)))
{
cat("Processing ",i," ...\r\n")
temp = ratings[ratings$id == stats$id[i],]
stats$idrating[i] = max(temp$rating[temp$timestamp < stats$timestamp[i]])
}
Can someone provide me with an alternative for this? I know apply may work but I have no idea how to translate the for function.
UPDATE: Thank you for the help. I am providing more information.
The table stats has unique combinations of primeid,period,user,id.
The table ratings has multiple id records with different ratings and timestamps.
What I want to do is the following. For each id found in stats, to find all the records in the ratings table (id column) and then get the max rating according to a specific timestamp obtained also from stats.
I love plyr, and most of the tools created by Hadley Wickham, but I find that it can be painfully slow, especially if I'm trying to split on an ID field. When this happens, I turn to sqldf. I usually get a speed up of 20x.
First I need to use lubridate because sqldf chokes on POSIXlt types:
library(lubridate)
ratings$timestamp = ymd_hms(ratings$timestamp)
stats$timestamp = ymd_hms(stats$timestamp)
Merge the dataframes, as Vincent did, and remove those violating the date constraint:
tmp <- merge(stats, ratings, by="id")
tmp <- subset(tmp, timestamp.y < timestamp.x )
Lastly, grab the max rating for each ID:
library(sqldf)
sqldf("SELECT *, MAX(rating) AS rating FROM tmp GROUP BY id")
Depending on the ratio of ids to data points this may work better:
r = split(ratings, ratings$id)
stats$idrating = sapply(seq.int(nrow(stats)), function(i) {
rd = r[[stats$id[i]]]
if (length(rd))
max(rd$rating[rd$timestamp < stats$timestamp[i]])
else NA
})
If your IDs are not contiguous integers (you can check that with all(names(r) == seq_along(r))) you'll have to add as.character() when referencing r[[ or use match once to create the mapping and it will cost you some speed.
Obviously, you can do the same without the split, but that's typically slower yet will use less memory:
stats$idrating = sapply(seq.int(nrow(stats)), function(i) {
rd = ratings[ratings$id == stats$id[i],]
if (nrow(rd))
max(rd$rating[rd$timestamp < stats$timestamp[i]])
else NA
})
You can also drop the if if you know there will be no mismatches.
I voted the answer provided although I used another approach to get to the same result
In the merge dataset I first removed dates that were older than the conditioned date and then run this:
aggregate (rating ~ id+primeid+period+user, data=new_stats, FUN = max)
From a data structure perspective it seems that you want to merge two tables and then perform a split-group-apply method.
Instead of for looping to check what row belongs to what row you can simply merge the two tables (much like a JOIN statement in SQL) and then perform an 'aaply' type of method. I recommend you download the 'plyr' library.
new_stats = merge(stats, ratings, by='id')
library(plyr)
ddply(new_stats,
c('primeid', 'period', 'user'),
function(new_stats)
c( max(new_stats[as.Date(new_stats$timestamp.x) > as.Date(new_stats$timestamp.y)]$rating )))
If the use of plyr confuses you, please visit this tutorial: http://www.creatapreneur.com/2013/01/split-group-apply/.
I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb.
(it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)
Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.
Hence I want to copy the contents of tr.sql to a new data-frame tr
(where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)
Here is my solution (thanks to gsk and everyone) + workaround:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:
for (i in 0:7) {
from <- floor(i*N/8)
to <- floor((i+1)*N/8) -1
if (i==7)
to <- N
print(c("Copying tr.sql$visit_date",from,to," ..."))
tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to])
gc()
}
rm(tr.sql)
memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)
The problem is allocating then copying the visit_date column.
Here is the dataset and code, I am having multiple separate problems with this, explanation below:
'training.csv' looks like...
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
and code:
# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
The second line that tries to declare data.frame(visit_date = Date(N)) fails, I don't know the correct qualified name with namespace for Date object (tried chron::Date , Dates::Date? don't work)
Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.
Remember these objects are big, so avoid using temporaries.
Thanks in advance...
Try this ensuring that you are using the most recent version of sqldf (currently version 0.4-1.2).
(If you find you are running out of memory try putting the database on disk by adding the dbname = tempfile() argument to the read.csv.sql call. If even that fails then its so large in relation to available memory that its unlikely you are going to be able to do much analysis with it anyways.)
# create test data file
Lines <-
"customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52"
cat(Lines, file = "trainingtest.csv")
# read it back
library(sqldf)
DF <- read.csv.sql("trainingtest.csv", method = c("integer", "Date2", "numeric"))
It doesn't look to me like you've got a data.frame there (N is a vector of length 1). Should be simple:
tr <- tr.sql
tr$visit_date <- as.Date(tr.sql$visit_date)
Or even more efficient:
tr <- data.frame(colOne = tr.sql[,1], visit_date = as.Date(tr.sql$visit_date), colThree = tr.sql[,3])
As a side note, your title says "append" but I don't think that's the operation you want. You're making the data.frame wider, not appending them on to the end (making it longer). Conceptually, this is a cbind() operation.
Try this:
tr <- data.frame(visit_date= as.Date(tr.sql$visit_date, origin="1970-01-01") )
This will succeed if your format is YYYY-MM-DD or YYYY/MM/DD. If not one of those formats then post more details. It will also succeed if tr.sql$visit_date is a numeric vector equal to the number of days after the origin. E.g:
vdfrm <- data.frame(a = as.Date(c(1470, 1475, 1480), origin="1970-01-01") )
vdfrm
a
1 1974-01-10
2 1974-01-15
3 1974-01-20