R - Extract Specific Rows in Dataset And Save Into New Dataframes - r

I'm looking to sort/process some large data before I enter it into a function.
I have a large dataset of log readings, consisting of many unique addresses and timings.
The data looks a bit like this:
UNIQUE_ADDRESS1 24/08/2016 13:01
UNIQUE_ADDRESS2 24/08/2016 13:02
UNIQUE_ADDRESS3 24/08/2016 13:05
UNIQUE_ADDRESS1 25/08/2016 00:00
UNIQUE_ADDRESS2 25/08/2016 00:01
UNIQUE_ADDRESS3 25/08/2016 00:12
I am ultimately running a function that needs individual data frames consisting of a specific unique address and a specific date only.
The data frame will look like this, consisting of all rows which contain the specific unique address AND specific date.
dataframe1 <- [UNIQUE_ADDRESS1 24/08/2016 13:01,
UNIQUE_ADDRESS1 24/08/2016 13:03,
UNIQUE_ADDRESS1 24/08/2016 13:06,
UNIQUE_ADDRESS1 24/08/2016 13:08
... etc]
Where there will be a dataframe2 which corresponds to UNIQUE_ADDRESS1 again, but with all the timings within the date of 25/08/2016 instead. This will be done for each device.
I figured this needs to be done in a loop, but I can't get the syntax done correctly.
So far I am using grep to extract each unique address from the massive log file, like this, to create data frames with each device separately:
device1 <- logfile[grep("^UNIQUE_ADDRESS1", logfile[,2]), ]
Then I have created an array of dates:
dates <- c("23/09/2016", "24/08/2016", "25/08/2016")
I now want to create new data frames that combine each individual unique addresses and each date. So all the log readings for UNIQUE_ADDRESS1 on date 23/09/2016 in one data frame, then another for 24/08/2016, etc. The same for each UNIQUE_ADDRESS.
I've tried using grep and grepl, but when I have used them in an if loop or ifelse loop they claim that my dates are not in my device data frames (which they definitely are), and any value I try, it returns false no matter what.
Can anybody help me with how I can achieve my aim?
Thanks
Edit
At the moment I'm trying to do this in a for loop, where "device1" contains the unique addresses for device 1, etc. However, when I increment i it'll only save to the same data fame (device1) instead of a new dataframe which is what I need.
for (k in 1:6)
device1 <- device1[grep(dates[i], device1[,4]), ]
device2 <- device2[grep(dates[i], device1[,4]), ]
device3 <- device3[grep(dates[i], device1[,4]), ]
device4 <- device4[grep(dates[i], device1[,4]), ]
device5 <- device5[grep(dates[i], device1[,4]), ]
device6 <- device6[grep(dates[i], device1[,4]), ]
device7 <- device7[grep(dates[i], device1[,4]), ]
device8 <- device8[grep(dates[i], device1[,4]), ]

Assuming that you have your input data in a data frame, e.g.
> myTable
Var1 Var2 Var3
1 UNIQUE_ADDRESS1 24/08/2016 13:01
2 UNIQUE_ADDRESS2 24/08/2016 13:02
3 UNIQUE_ADDRESS3 24/08/2016 13:05
4 UNIQUE_ADDRESS1 25/08/2016 0:00
5 UNIQUE_ADDRESS2 25/08/2016 0:01
6 UNIQUE_ADDRESS3 25/08/2016 0:12
Consider using the dlply function of the R-package plyr.
library(plyr)
myList = dlply(aTable, ~ Var1 + Var2, .fun = identity)
Each element of the list myList will be one of your sub-tables, e.g.
> myList[[1]]
Var1 Var2 Var3
1 UNIQUE_ADDRESS1 24/08/2016 13:01

Related

Convert data on pre-post repeated measures from long to wide by filtering data to get time point as value

I have a 14K row table of 370 liver transplant patients with transplant date and various repeated lab tests done before and after the procedure. I want to get pre-transplant, immediate post-transplant, and 3/6/12/18/24/36 month lab results.
ID
Transp Date
Lab Units
Lab Type
Tme
Lab Val
0000001
2011-01-11
VCA IgG Index
0
6487.0
0000001
2011-01-11
VCA IgM Index
0
11230.0
0000002
2011-01-03
Copies/mL
CMV Quant PCR
3
100.0
0000002
2011-01-03
Copies/mL
EBV Quant PCR
3
683.0.
I did round(datediff) of transplant date and lab test date to get the month timepoint (Tme). My client wants the final table to have one record and all data values per row. Headers something like this:
ID|TrnsplDate|LabType1|Units1|PreVal|Val0|Val3|Val6|Val12|Val18|Val24|Val36|LabType2|Units2|PreVal|Val0|Val3|Val6|Val12|Val18|Val24|Val36|LabType3|Units3|PreVal|Val0|Val3|Val6|Val12|Val18|Val24|Val36|LabType4|Units4|PreVal|Val0|Val3|Val6|Val12|Val18|Val24|Val36
Can anyone knowledgeable in R guide me on where to start? I use Rstudio. Thanks in advance.
Try this, which will put everything for the same ID on one line, then you can adjust column names and order as needed using colnames(df) and indexing (i.e., something like colorder <- c(2,3,5,1,7,12,...); df[, colorder].
### Set up data
library(lubridate)
df <- data.frame(ID = rep(sprintf("SID%s",seq(1:2)),2),
transdate = seq(mdy("01/01/2000"), mdy("01/4/2000"),1),
labunits = c(NA, NA, rep("Copies/mL",2)),
labtype = c(rep("VCA IgG Index",2),"CMV Quant PCR", "EBV Quant PCR"),
time = c(0,0,2,2),
labval = sample(100:2000, 4))
# Transform
df2 <- tidyr::pivot_wider(df, names_from = labtype, values_from = -ID)
# ----------------------------
# Edit: separate by lab type
df_bylab <- split(df, df$labtype)
# output each lab type to CSV
for(i in 1:length(df)){
write.csv(df[i], paste0(names(df)[i], ".csv"))
}
Based on the limited data provided, I am not sure if some columns could be collapsed (for instance, it seems like you only have one date per SID, so all the date columns could be collapsed to one column.
I would also like to point out this is not an ideal structure for the data, so perhaps your best bet is to try to convince your client otherwise!

creating list of objects with names from a list

hopefully it is not a duplicate, rather difficult to phrase it correctly (relatively new to R)
So the problem is: I want to use sequences of dates excluding certain weekdays based on the row information. I can use bizdays and create calendar objects on the fly but it is quite inefficient - I would rather have them created before and use as needed. On the other side I do not want to create a calendar for every single object that can happen to occur (too many to bother, combination of all weekdays plus versions with/without holidays).
I can create a dataframe with list of dates between start/end date for every row, but i need to provide a calendar with weekdays
P <- setDT(R)[, list(ID=ID,
dt=bizseq(Start.Date,End.Date, cal)
), by=1:nrow(R)]
To provide a calendar I have to define it like
cal <- Calendar(weekdays=c("monday", "tuesday"))
now a working dataset that could explain what i am struggling with
> M <-c(0,1,1,0)
> T <- c(1,1,1,0)
> W <- c(0,0,0,1)
> df <- data.frame(M,T,W)
> df$S <-paste0("c",df$M,df$T,df$W)
> udf <- unique(df)
> udf
M T W S
1 0 1 0 c010
2 1 1 0 c110
4 0 0 1 c001
using udf i would like to create a list of calendar objects that i can afterwards pass to the bizseq using get(df$S), something along the lines of
require(bizdays)
loop or apply?
.... <- Calendar(weekdays=c(ifelse(udf$M==0,"","monday"), ifelse(udf$T==0,"","tuesday"),ifelse(udf$W==0,"","wednesday")))
So now the right questions;) Firstly - is it the best approach? then if so - how to create these 3 objects under their names ("c101" etc), so for example the c100 will match the calendar with Monday on - it is not a question how to create a calendar as the method above works (it is enough to substitute the dots with the name), but how to create object c101 that would become a calendar if i create names in a dynamic way? I could imagine looping through the rows, but have no idea how to force the resulting object to be named udf$S. Unless you reckon there is any better method of providing the corresponding calendar than get() from a list of pre-created objects (for a dataframe with thousands of dates and combination of days off).
I would like basically to end up with 3 calendar objects named c010, c110, c001, but if the expanded table has more unique options to create all other combinations before i run the setDT() function
Afterthought: I can add ID to the udf and call the calendars by index and then return the index to df, but I wonder if it is possible to create dynamic names of objects just as I tried
NOTE
following Sathish's lead I used what seems sufficient:
for(i in 1:nrow(udf)) {
cal <- Calendar(weekdays=c(ifelse(udf[i,1]==0,"","monday"), ifelse(udf[i,2]==0,"","tuesday"),ifelse(udf[i,3]==0,"","wednesday")))
assign(udf[i,4], cal)
}

How to deal with non-consecutive (non-daily) dates in R, while looping?

I am trying to write a script that loops through month-end dates and compares associated fields, but I am unable to find a way to way to do this.
I have my data in a flatfile and subset based on 'TheDate'
For instance I have:
date.range <- subset(raw.data, observation_date == theDate)
Say TheDate = 2007-01-31
I want to find the next month included in my data flatfile which is 2007-02-28. How can I reference this in my loop?
I currently have:
date.range.t1 <- subset(raw.data, observation_date == theDate+1)
This doesnt work obviously as my data is not daily.
EDIT:
To make it more clear, my data is like below
ticker observation_date Price
ADB 31/01/2007 1
ALS 31/01/2007 2
ALZ 31/01/2007 3
ADB 28/02/2007 2
ALS 28/02/2007 5
ALZ 28/02/2007 1
I am using a loop so I want to skip from 31/01/2007 to 29/02/2007 by recognising it is the next date, and use that value to subset my data
First get unique values of date like so:
unique_dates<-unique(raw.data$observation_date)
The sort these unique dates:
unique_dates_ordered<-unique_dates[order(as.Date(unique_dates, format="%Y-%m-%d"))]
Now you can subset based on the index of unique_dates_ordered i.e.
subset(raw.data, raw.data$observation_date == unique_dates_ordered[i])
Where i = 1 for the first value, i = 2 for the second value etc.

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

Nesting Data Frames

I have a function wrapping RODBC::sqlQuery that takes a start & end date and returns 5 columns with and roughly 1 million rows per call. I need to iterate through a list of about 60 dates storing the function's resulting data frames in a list.
What I want to know is:
How to pass both start and end date arguments to the function in an
apply-style fashion
How to store the resulting data frames neatly (like a table of |date|data.frame.pointer|)
Here's some of the code:
get.data <- function(date.start, date.end) { ... }
date.range <- seq(as.Date("2009-01-01"), Sys.Date(), by="1 month")
And sample output:
get.data(date.start="2009-01-01", date.end='2009-02-01')
date country oId eId pId
1 2009-01-01 Australia 12345 12345 12345
2 ... ... ... ... ...
Thank you for your help. I've been trying to figure out how to do this for hours to no avail.
For what you want, mapply will do the trick:
n <- length(date.range)
mapply(get.data, date.range[-n], date.range[-1])
This returns a list whose elements are the individual returned values from get.data. So in this case, you would get a list of data frames. That may well be the most appropriate way of storing the output, but it would depend on what you want to do with it.

Resources