Identify gaps in a continuous time period - r

I have a dataframe with some observations of when lines attached to IDs.
I need the period of time in days when each ID had a line/catheter attached.
Here is my dput return:
structure(list(ID = c(487622L, 487622L, 487639L, 487639L, 489027L,
489027L, 489027L, 491858L, 491858L, 491858L, 491858L, 491858L,
491858L), Line = c("Central Venous Line", "Central Venous Line",
"Central Venous Line", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Central Venous Line", "Haemodialysis Catheter",
"Central Venous Line", "Haemodialysis Catheter", "Central Venous Line",
"Peripherally Inserted Central Catheter (PICC)"), Start = structure(c(1362528000,
1363219200, 1362268800, 1363219200, 1364774400, 1365120000, 1365465600,
1364688000, 1364688000, 1365724800, 1365724800, 1366848000, 1369353600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), End = structure(c(1362787200,
1363824000, 1363305600, 1363737600, 1365465600, 1366675200, 1365638400,
1365724800, 1365724800, 1366329600, 1366848000, 1367539200, 1369612800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Days = c("3.095138889",
"7.045138889", "11.87777778", "5.736111111", "7.850694444", "18.02083333",
"1.813888889", "12.32986111", "12.71388889", "6.782638889", "13.14027778",
"7.718055556", "3.397222222"), dateOrder = c(1L, 2L, 1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("ID", "Line",
"Start", "End", "Days", "dateOrder"), row.names = 79:91, class = "data.frame")
Here is the catch. It does not matter if an ID has more than one line/catheter. I just need to take the earliest start date for each ID, the latest end date for each ID, and calculate the number of continuous days each ID has a line/catheter attached.
The problem is confounded by some cases, e.g. ID 491858. This individual had a line removed (dateOrder = 5) on 2013-05-03 and reinserted on 2013-05-24 for just over 3 days.
How I intended to handle this is to subtract the gap (number of days) from the number of days of continuous time between min(Start Date) and max(end date).
There are over 20,000 records in the data set.
Here is what I have done so far:
Converted the DF to a list of DFs based on ID.
I intended to apply a function to each DF something as follows:
If the difference in time (days) between subsequent start date and previous end date for each row exceeds 0, then add TRUE or some arbitrary column value to each data frame.
function(y){
for (i in length(y)){
if(difftime(y$Start[i+1], y$End[i], units='days') > 0){
y$test <- TRUE}
}
}
Any help would be greatly appreciated.
Thanks.
UPDATE
Ignore the days column. It is of no use. I intend to aggregate month line counts from the unique cases.

I guess something like this might help, unless I've misunderstood something:
unlist(lapply(split(DF, DF$ID),
function(x) { totaldays <- max(x$End) - min(x$Start);
x$Start <- c(x$Start[-1], NA);
res <- difftime(x$Start[-length(x$Start)], x$End[-length(x$Start)], units = "days");
res <- res[res > 0];
res <- ifelse(length(res) == 0, 0, res);
return(as.numeric(totaldays - res)) }))
#487622 487639 489027 491858
# 10 17 22 36
DF is your dput.

If I understand correctly, you want the total amount of days that the catheter was present. To do that, I would use plyr
#assume df is your dput object
library(plyr)
day.summary <- ddply(df, "ID", function(x) data.frame(total.days = sum(as.numeric(x$Days))))
print(day.summary)
ID total.days
1 487622 10.14028
2 487639 17.61389
3 489027 27.68542
4 491858 56.08194

Related

How do I avoid using for-loops

I am currently working on listening data of a music platform in R.
I have a subset (listening.subset) of the total data set. It contains 6 columns (USER, artist, Week, COUNT, user_type, binary).
Each user can either be a focal user, a friend, or a neighbour. There are separate data sets that link focal users to their friends (friend.data) and neighbours (neighbour.data), but I added a column to indicate the type of user.
Now, I have the following for-loop to indicate whether a friend has listened to an artist in the 10 weeks before the focal user has listened to that same artist. If that is the case, the binary column must show a 0, else a 1.
listening.subset$binary <- NA
for (i in 1:count(listening.subset)$n) {
test_user <- listening.subset[i,]
test_week <- test_user$Week
test_artist <- test_user$artist
if (test_user$user_type == "friend") {
foc <- vlookup(test_user$USER, friend.data, result_column = 1, lookup_column = 2)
prior_listen <- listening.subset %>% filter(USER == foc) %>% group_by(artist) %>% filter(test_week >= (Week -10) & test_week <= Week) %>% filter(artist == test_artist)
if (nrow(prior_listen) > 0) {
listening.subset[i,]$binary <- 0
}
else(
listening.subset[i,]$binary <- 1)
}
}
The problem with this for-loop is that it takes too long to apply to the full data set. Therefore, I want to apply vectorization. However, This concept is vague to me and after reading up on it online, I still do not have a clue as to how I should adjust my code.
I hope someone knows how to use vectorization and could help me.
EDIT1: the total data set contains around 50 million entries. However, I could split it up in 10 data sets of 5 million each.
EDIT2: listening.subset:
"clubanddeform", "HyprMusic", "Peter-182", "komosionmel", "SHHitsKaty",
"Sonik_Villa", "Haalf"), artist = c("Justin Timberlake", "Ediya",
"Lady Gaga", "El Guincho", "Lighthouse Family", "Pidżama Porno",
"The Men", "Modest Mouse", "Com Truise", "April Smith and The Great Picture Show"
), Week = c(197L, 213L, 411L, 427L, 443L, 232L, 431L, 312L, 487L,
416L), COUNT = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 6L, 11L), user_type = c("friend",
"friend", "friend", "friend", "neighbour", "friend", "neighbour",
"friend", "focal", "friend"), binary = c(1, 1, 1, 1, NA, 1, NA,
1, NA, 1)), row.names = c(NA, 10L), class = "data.frame")
Where Week is an indicator for which week the user listened to the particular band (ranging between 1 and 527), and COUNT equals the amount of times the user has listened to that artist in that particular week.
Recap: The binary variable should indicate whether the "friend user" has listened to the same band as the "focal user", in the 10 weeks before the focal user played the band. The social connections can be found in the friend.data, which is depicted below.
structure(list(USER = c("TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner", "TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner"), FRIEND = c("npetrovay", "marno25",
"lennonstarr116", "sachmonkey", "andrewripp", "daledrops", "Skittlebite",
"Ego_Trippin", "pistolgal5", "jjollett")), row.names = c(NA,
10L), class = "data.frame")
For each 190 focal users (first column), the friends are listed next to it, in the second column.

Removing duplicates of same date and location (3 columns) in R

I know there are like a million questions regarding duplicate removal, but unfortunately
none of them helped me so far. I struggle with the following:
I have a data frame (loc) that includes data of citizen science observations of nature (animals, plants, etc.). It has about 90.000 rows and looks like this:
ID Datum lat long Anzahl Art Gruppe Anrede Wochentag
1 1665376475 2019-05-09 51.30993 9.319896 20 Alytes obstetricans Amphibien Herr Do
2 529728479 2019-05-06 50.58524 8.503332 1 Alytes obstetricans Amphibien Frau Mo
3 1579862637 2019-05-23 50.53925 8.467546 8 Alytes obstetricans Amphibien Herr Do
4 -415013306 2019-05-06 50.58524 8.503332 3 Alytes obstetricans Amphibien Frau Mo
I also made a small sample data frame (loc_sample) of 10 observations and used dput(loc_sample):
structure(list(ID = c(688380991L, -1207894879L, 802295973L, -815104336L, -632066829L, -133354744L, 1929856503L, 952982037L, 1782222413L, 1967897802L),
Datum = structure(c(1559088000, 1558742400, 1557619200, 1557273600, 1557187200, 1557619200, 1557619200, 1557187200, 1557964800, 1556841600),
tzone = "UTC",
class = c("POSIXct", "POSIXt")),
lat = c(52.1236088700115, 51.5928822313012, 53.723426877949, 50.7737623304861, 49.9238597947287, 51.805563222817, 50.1738326622472, 51.2763067511127, 51.395189306337, 51.5732959108075),
long = c(8.62399927116144, 9.89597797393799, 9.04058595819038, 8.20740532922287, 8.29073164862348, 9.9225640296936, 8.79065646492143, 6.40700340270996, 6.47360801696777, 6.25690012620748),
Anzahl = c(2L, 25L, 4L, 1L, 1L, 30L, 2L, 1L, 1L, 1L),
Art = c("Sturnus vulgaris", "Olethreutes arcuella", "Sylvia atricapilla", "Buteo buteo", "Turdus merula", "Orchis mascula subsp. mascula", "Parus major", "Luscinia megarhynchos", "Milvus migrans", "Andrena bicolor"),
Gruppe = c("Voegel", "Schmetterlinge", "Voegel", "Voegel", "Voegel", "Pflanzen", "Voegel", "Voegel", "Voegel", "InsektenSonstige"),
Anrede = c("Herr", "Herr", "Frau", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr"),
Wochentag = structure(c(4L, 7L, 1L, 4L, 3L, 1L, 1L, 3L, 5L, 6L),
.Label = c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa"),
class = c("ordered", "factor"))),
row.names = c(NA, -10L),
class = "data.frame")
For my question only the variables Datum, latand long are important. Datum is a date and in the POSIXct format while lat and long are both numeric. There are quite a few observations that were reported on the same day from the exact same location. I would like to filter and remove those. So I have to check three separate columns and keep only one of each "same-place-same-day" observations.
I already tried putting the three variables in question into one:
loc$dupl <- paste(loc$Datum, loc$lat, loc$long, sep=" ,")
locu <- unique(loc[,2:4])
It seems like I managed to filter the duplicates, but I'm actually not sure, if that's how it is done correctly.
Also, that gives me a data frame with only Datum, lat and long. As a final result I need the original data frame without the duplicates in date and location, but with all the other information for the unique rows still left.
When I try:
locu <- unique(loc[,2:9])
It gives me all the other columns, but it doesn't remove the date and location duplicates.
Thanks in advance for your help!
This can work:
#Code
new <- loc[!duplicated(paste(loc$Datum,loc$lat,loc$long)),]
To get the full data frame back after finding the duplicates, you coudl do sth. like:
loc[!duplicated(loc[,2:4]),]
This code first detects the duplicate rows and then subsets your original data frame.
Note: this code will always keep the first occurences and delete the duplicates in subsequent rows. If you want to keep a certain ID (e.g. the second one, not the first one), we need a different solution.

Iterate by month with lubridate and merge combined

I am trying to write a function that merges based on two columns both found in two dataframes. One of the columns is an identifier string and the other is a date.
The first df ("model") includes identifiers, starting dates, and some other relevant info.
The second df ("futurevalues") is a melted df that includes the identifier, multiple months for each identifier, and the relevant value for each identifier-month pair.
I would like to merge values for each identifier based on a certain period of time in the future. So for instance, for Identifier= Mary and starting month="2005-01-31" in "model" I would like to pull in the relevant value for the next month and 11 more months after (so 12 data points for Mary for months starting month+1:starting month+12).
I can merge my dfs by the two columns to get the as-of date value (see below), but this isn't what I need.
testmerge=merge(model,futurevalues,by=c("month","identifier"),all=TRUE)
To solve this, I am trying to use the lubridate date functions. For instance, the function below will allow me to enter a month (and then lapply across the df maybe) to get the values for each of the starting months (which vary across the df, meaning it's not a standard time period across the entire thing).
monthiterate=function (x) {
x %m+% months(1:12)
}
Thanks a lot for your help.
EDIT: adding toy data (first one is model, second one is futurevalues)
structure(list(month = structure(c(12814, 12814, 12814, 12814,
12814, 12814, 12814, 12814, 12814, 12814), class = "Date"), identifier = structure(c(1L,
3L, 2L, 4L, 5L, 7L, 8L, 6L, 9L, 10L), .Label = c("AB1", "AC5",
"BB9", "C99", "D81", "GG8", "Q11", "R45", "ZA1", "ZZ9"), class = "factor"),
value = c(0.831876072999969, 0.218494398256579, 0.550872926656984,
1.81882711231324, -0.245597705276932, -0.964277509916354,
-1.84714556574606, -0.916239506529079, -0.475649743547525,
-0.227721186387637)), .Names = c("month", "identifier", "value"
), class = "data.frame", row.names = c(NA, 10L))
structure(list(identifier = structure(c(1L, 3L, 2L, 4L, 5L, 7L,
8L, 6L, 9L, 10L), .Label = c("AB1", "AC5", "BB9", "C99", "D81",
"GG8", "Q11", "R45", "ZA1", "ZZ9"), class = "factor"), month = structure(c(12814,
13238, 12814, 12814, 12964, 12903, 12903, 12842, 13148, 13148
), class = "Date"), futurereturns = c(-0.503033205660682, 1.22446988772542,
-0.825490985851348, 1.03902417581908, 0.172595565260429, 0.894967582911769,
-0.242324006922964, 0.415520398113024, -0.734437328639625, 2.64184935856802
)), .Names = c("identifier", "month", "futurereturns"), class = "data.frame", row.names
= c(NA, 10L))
You need to create a table of all the combinations of ID and month that you want. Starting with a table of each ID and their starting month:
library(lubridate)
set.seed(1834)
# 3 people, each with a different starting month
x <- data.frame(id = sample(LETTERS, 3)
, month = ymd("2005-01-01") + months(sample(0:11, 3)) - days(1))
> x
id month
1 D 2005-03-31
2 R 2005-07-31
3 Y 2005-02-28
Now add rows for the following two months, per ID. I use dplyr for this kind of thing.
library(dplyr)
y <- x %>%
rowwise %>%
do(data.frame(id = .$id
, month = seq(.$month + days(1)
, by = "1 month"
, length.out = 3) - days(1)))
> y
Source: local data frame [9 x 2]
Groups: <by row>
id month
1 D 2005-03-31
2 D 2005-04-30
3 D 2005-05-31
4 R 2005-07-31
5 R 2005-08-31
6 R 2005-09-30
7 Y 2005-02-28
8 Y 2005-03-31
9 Y 2005-04-30
Now you can use merge() (or left_join() from dplyr) to retrieve the rows you want from the full dataset.

split dataset by day and save it as data frame

I have a dataset with 2 months of data (month of Feb and March). Can I know how can I split the data into 59 subsets of data by day and save it as data frame (28 days for Feb and 31 days for Mar)? Preferably to save the data frame in different name according to the date, i.e. 20140201, 20140202 and so forth.
df <- structure(list(text = structure(c(4L, 6L, 5L, 2L, 8L, 1L), .Label = c(" Terpilih Jadi Maskapai dengan Pelayanan Kabin Pesawat cont",
"booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids",
"Can I change for the traveler details because i choose wrongly for the Mr or Ms part",
"cant do it with cards either", "Coming back home AK", "gotta try PNNL",
"Jadwal penerbangan medanjktsblm tangalmasi ada kah", "Me and my Tart would love to flyLoveisintheAir",
"my flight to Bangkok onhas been rescheduled I couldnt perform seat selection now",
"Pls checks his case as money is not credited to my bank acctThanks\n\nCASLTP",
"Processing fee Whatt", "Tacloban bound aboardto get them boats Boats boats boats Tacloban HeartWork",
"thanks I chatted with ask twice last week and told the same thing"
), class = "factor"), created = structure(c(1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1/2/2014", "2/2/2014", "5/2/2014", "6/2/2014"
), class = "factor")), .Names = c("text", "created"), row.names = c(NA,
6L), class = "data.frame")
You don't need to output multiple dataframes. You only need to select/subset them by year&month of the 'created' field. So here are two ways do do that: 1. is simpler if you don't plan on needing any more date-arithmetic
# 1. Leave 'created' a string, just use text substitution to extract its month&date components
df$created_mthyr <- gsub( '([0-9]+/)[0-9]+/([0-9]+)', '\\1\\2', df$created )
# 2. If you need to do arbitrary Date arithmetic, convert 'created' field to Date object
# in this case you need an explicit format-string
df$created <- as.Date(df$created, '%M/%d/%Y')
# Now you can do either a) split
split(df, df$created_mthyr)
# specifically if you want to assign the output it creates to 3 dataframes:
df1 <- split(df, df$created_mthyr)[[1]]
df2 <- split(df, df$created_mthyr)[[2]]
df5 <- split(df, df$created_mthyr)[[3]]
# ...or else b) do a Split-Apply-Combine and perform arbitrary command on each separate subset. This is very powerful. See plyr/ddply documentation for examples.
require(plyr)
df1 <- dlply(df, .(created_mthyr))[[1]]
df2 <- dlply(df, .(created_mthyr))[[2]]
df5 <- dlply(df, .(created_mthyr))[[3]]
# output looks like this - strictly you might not want to keep 'created','created_mthyr':
> df1
# text created created_mthyr
#1 cant do it with cards either 1/2/2014 1/2014
#2 gotta try PNNL 1/2/2014 1/2014
> df2
#3
#Coming back home AK
#4 booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids
# created created_mthyr
#3 2/2/2014 2/2014
#4 2/2/2014 2/2014

Identify duplicate data with a threshold

I am working with bluetooth sensor data and need to identify possible duplicate readings for each unique ID. The bluetooth sensor made a scan every five seconds, and may pick up the same device in subsequent readings if the device wasn't moving quickly (i.e. sitting in traffic). There may be multiple readings from the same device if that device made a round trip, but those should be separated by several minutes. I can't wrap my head around how to get rid of the duplicate data. I need to calculate a time difference column if the macid's match.
The data has the format:
macid time
00:03:7A:4D:F3:59 82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141
And I need to create:
macid time timediff
00:03:7A:4D:F3:59 82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA
My first attempt at this is extremely slow and not really usable:
dedupeIDs <- function (zz) {
#Order by macid and then time
zz <- zz[order(zz$macid, zz$time) ,]
zz$timediff <- c(999999, diff(zz$time))
for (i in 2:nrow(zz)) {
if (zz[i, "macid"] == zz[i - 1, "macid"]) {
print("Different IDs")
} else {
zz[i, "timediff"] <- 999999
}
}
return(zz)
}
I'll then be able to filter the data.frame based on the time difference column.
Sample data:
structure(list(macid = structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F",
"00:05:4F:0B:45:F7"), class = "factor"),
time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L),
class = "data.frame")
How about:
x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))

Resources