Identify duplicate data with a threshold

Identify duplicate data with a threshold - r

I am working with bluetooth sensor data and need to identify possible duplicate readings for each unique ID. The bluetooth sensor made a scan every five seconds, and may pick up the same device in subsequent readings if the device wasn't moving quickly (i.e. sitting in traffic). There may be multiple readings from the same device if that device made a round trip, but those should be separated by several minutes. I can't wrap my head around how to get rid of the duplicate data. I need to calculate a time difference column if the macid's match.
The data has the format:
macid time
00:03:7A:4D:F3:59 82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141
And I need to create:
macid time timediff
00:03:7A:4D:F3:59 82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA
My first attempt at this is extremely slow and not really usable:
dedupeIDs <- function (zz) {
#Order by macid and then time
zz <- zz[order(zz$macid, zz$time) ,]
zz$timediff <- c(999999, diff(zz$time))
for (i in 2:nrow(zz)) {
if (zz[i, "macid"] == zz[i - 1, "macid"]) {
print("Different IDs")
} else {
zz[i, "timediff"] <- 999999
}
}
return(zz)
}
I'll then be able to filter the data.frame based on the time difference column.
Sample data:
structure(list(macid = structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F",
"00:05:4F:0B:45:F7"), class = "factor"),
time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L),
class = "data.frame")

How about:
x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))

Related

Removing duplicates of same date and location (3 columns) in R

I know there are like a million questions regarding duplicate removal, but unfortunately
none of them helped me so far. I struggle with the following:
I have a data frame (loc) that includes data of citizen science observations of nature (animals, plants, etc.). It has about 90.000 rows and looks like this:
ID Datum lat long Anzahl Art Gruppe Anrede Wochentag
1 1665376475 2019-05-09 51.30993 9.319896 20 Alytes obstetricans Amphibien Herr Do
2 529728479 2019-05-06 50.58524 8.503332 1 Alytes obstetricans Amphibien Frau Mo
3 1579862637 2019-05-23 50.53925 8.467546 8 Alytes obstetricans Amphibien Herr Do
4 -415013306 2019-05-06 50.58524 8.503332 3 Alytes obstetricans Amphibien Frau Mo
I also made a small sample data frame (loc_sample) of 10 observations and used dput(loc_sample):
structure(list(ID = c(688380991L, -1207894879L, 802295973L, -815104336L, -632066829L, -133354744L, 1929856503L, 952982037L, 1782222413L, 1967897802L),
Datum = structure(c(1559088000, 1558742400, 1557619200, 1557273600, 1557187200, 1557619200, 1557619200, 1557187200, 1557964800, 1556841600),
tzone = "UTC",
class = c("POSIXct", "POSIXt")),
lat = c(52.1236088700115, 51.5928822313012, 53.723426877949, 50.7737623304861, 49.9238597947287, 51.805563222817, 50.1738326622472, 51.2763067511127, 51.395189306337, 51.5732959108075),
long = c(8.62399927116144, 9.89597797393799, 9.04058595819038, 8.20740532922287, 8.29073164862348, 9.9225640296936, 8.79065646492143, 6.40700340270996, 6.47360801696777, 6.25690012620748),
Anzahl = c(2L, 25L, 4L, 1L, 1L, 30L, 2L, 1L, 1L, 1L),
Art = c("Sturnus vulgaris", "Olethreutes arcuella", "Sylvia atricapilla", "Buteo buteo", "Turdus merula", "Orchis mascula subsp. mascula", "Parus major", "Luscinia megarhynchos", "Milvus migrans", "Andrena bicolor"),
Gruppe = c("Voegel", "Schmetterlinge", "Voegel", "Voegel", "Voegel", "Pflanzen", "Voegel", "Voegel", "Voegel", "InsektenSonstige"),
Anrede = c("Herr", "Herr", "Frau", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr"),
Wochentag = structure(c(4L, 7L, 1L, 4L, 3L, 1L, 1L, 3L, 5L, 6L),
.Label = c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa"),
class = c("ordered", "factor"))),
row.names = c(NA, -10L),
class = "data.frame")
For my question only the variables Datum, latand long are important. Datum is a date and in the POSIXct format while lat and long are both numeric. There are quite a few observations that were reported on the same day from the exact same location. I would like to filter and remove those. So I have to check three separate columns and keep only one of each "same-place-same-day" observations.
I already tried putting the three variables in question into one:
loc$dupl <- paste(loc$Datum, loc$lat, loc$long, sep=" ,")
locu <- unique(loc[,2:4])
It seems like I managed to filter the duplicates, but I'm actually not sure, if that's how it is done correctly.
Also, that gives me a data frame with only Datum, lat and long. As a final result I need the original data frame without the duplicates in date and location, but with all the other information for the unique rows still left.
When I try:
locu <- unique(loc[,2:9])
It gives me all the other columns, but it doesn't remove the date and location duplicates.
Thanks in advance for your help!

This can work:
#Code
new <- loc[!duplicated(paste(loc$Datum,loc$lat,loc$long)),]

To get the full data frame back after finding the duplicates, you coudl do sth. like:
loc[!duplicated(loc[,2:4]),]
This code first detects the duplicate rows and then subsets your original data frame.
Note: this code will always keep the first occurences and delete the duplicates in subsequent rows. If you want to keep a certain ID (e.g. the second one, not the first one), we need a different solution.

Counting the rows in a data frame based on integer ranges

I have a data frame that lists a bunch of objects and their values.
Name NumCpu MemoryMB
1 BEAVERTN-SVR-C5 1 3072
2 BEAVERTN-SVR-UK 4 4096
3 BEAVERTN-SVR-JV 1 1024
I want to take my data frame and create a new column that groups these numbers by ranges.
Ranges: 0-1024, 1025-2048, 2049-4096
And then output the counts of those ranges into a new data frame:
Range Count
0-1024 1
1025-2048 0
2049-4096 2
I learn by doing, so this is a real work problem I'm trying to use R to solve. Any help greatly appreciated. Thank you!
Data
DF <- structure(list(Name = c("BEAVERTN-SVR-C5", "BEAVERTN-SVR-UK",
"BEAVERTN-SVR-JV"), NumCpu = c(1L, 4L, 1L), MemoryMB = c(3072L,
4096L, 1024L), Range = structure(c(3L, 3L, 1L), .Label = c("(0,1.02e+03]",
"(1.02e+03,2.05e+03]", "(2.05e+03,4.1e+03]"), class = "factor")), .Names = c("Name",
"NumCpu", "MemoryMB", "Range"), row.names = c("1", "2", "3"), class = "data.frame")

R: Need to perform multiple matches for each row in data frame

I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.

First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?

How to plot events on time scale using R?

I have a csv file like
id,date,event
1,01-01-2014,E1
1,01-02-2014,E2
2,01-03-2014,E1
2,01-04-2014,E1
2,01-05-2014,E2
I would like to plot events using R on time scale. For example x axis would be date and y axis would indicate event happened on a particular date. This would be one graph for one set of id's. In the above data set it would create 2 graphs.
This is little different from time series (i think). Anyway to accomplish this in R?
Thanks

Try:
ddf = structure(list(id = c(1L, 1L, 2L, 2L, 2L), date = structure(1:5, .Label = c("01-01-2014",
"01-02-2014", "01-03-2014", "01-04-2014", "01-05-2014"), class = "factor"),
event = structure(c(1L, 2L, 1L, 1L, 2L), .Label = c("E1",
"E2"), class = "factor")), .Names = c("id", "date", "event"
), class = "data.frame", row.names = c(NA, -5L))
>
ddf$date2 = as.Date(ddf$date, format="%m-%d-%Y")
ddf
id date event date2
1 1 01-01-2014 E1 2014-01-01
2 1 01-02-2014 E2 2014-01-02
3 2 01-03-2014 E1 2014-01-03
4 2 01-04-2014 E1 2014-01-04
5 2 01-05-2014 E2 2014-01-05
>
ggplot(data=ddf, aes(x=date2, y=event, group=factor(id), color=factor(id)))+
geom_line()+
geom_point()+
facet_grid(id~.)
Edit: The code is simple and self-explanatory. Basically the date is kept in x-axis and events in y-axis. For clarity, the graphs are plotted for different ID separately (using facet_grid command), although they can be kept in same graph also, as seen in graph below generated by excluding the facet_grid command in above code:
Here there may be some ambiguity when the lines get overlapping.

Identify gaps in a continuous time period

I have a dataframe with some observations of when lines attached to IDs.
I need the period of time in days when each ID had a line/catheter attached.
Here is my dput return:
structure(list(ID = c(487622L, 487622L, 487639L, 487639L, 489027L,
489027L, 489027L, 491858L, 491858L, 491858L, 491858L, 491858L,
491858L), Line = c("Central Venous Line", "Central Venous Line",
"Central Venous Line", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Central Venous Line", "Haemodialysis Catheter",
"Central Venous Line", "Haemodialysis Catheter", "Central Venous Line",
"Peripherally Inserted Central Catheter (PICC)"), Start = structure(c(1362528000,
1363219200, 1362268800, 1363219200, 1364774400, 1365120000, 1365465600,
1364688000, 1364688000, 1365724800, 1365724800, 1366848000, 1369353600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), End = structure(c(1362787200,
1363824000, 1363305600, 1363737600, 1365465600, 1366675200, 1365638400,
1365724800, 1365724800, 1366329600, 1366848000, 1367539200, 1369612800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Days = c("3.095138889",
"7.045138889", "11.87777778", "5.736111111", "7.850694444", "18.02083333",
"1.813888889", "12.32986111", "12.71388889", "6.782638889", "13.14027778",
"7.718055556", "3.397222222"), dateOrder = c(1L, 2L, 1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("ID", "Line",
"Start", "End", "Days", "dateOrder"), row.names = 79:91, class = "data.frame")
Here is the catch. It does not matter if an ID has more than one line/catheter. I just need to take the earliest start date for each ID, the latest end date for each ID, and calculate the number of continuous days each ID has a line/catheter attached.
The problem is confounded by some cases, e.g. ID 491858. This individual had a line removed (dateOrder = 5) on 2013-05-03 and reinserted on 2013-05-24 for just over 3 days.
How I intended to handle this is to subtract the gap (number of days) from the number of days of continuous time between min(Start Date) and max(end date).
There are over 20,000 records in the data set.
Here is what I have done so far:
Converted the DF to a list of DFs based on ID.
I intended to apply a function to each DF something as follows:
If the difference in time (days) between subsequent start date and previous end date for each row exceeds 0, then add TRUE or some arbitrary column value to each data frame.
function(y){
for (i in length(y)){
if(difftime(y$Start[i+1], y$End[i], units='days') > 0){
y$test <- TRUE}
}
}
Any help would be greatly appreciated.
Thanks.
UPDATE
Ignore the days column. It is of no use. I intend to aggregate month line counts from the unique cases.

I guess something like this might help, unless I've misunderstood something:
unlist(lapply(split(DF, DF$ID),
function(x) { totaldays <- max(x$End) - min(x$Start);
x$Start <- c(x$Start[-1], NA);
res <- difftime(x$Start[-length(x$Start)], x$End[-length(x$Start)], units = "days");
res <- res[res > 0];
res <- ifelse(length(res) == 0, 0, res);
return(as.numeric(totaldays - res)) }))
#487622 487639 489027 491858
# 10 17 22 36
DF is your dput.

If I understand correctly, you want the total amount of days that the catheter was present. To do that, I would use plyr
#assume df is your dput object
library(plyr)
day.summary <- ddply(df, "ID", function(x) data.frame(total.days = sum(as.numeric(x$Days))))
print(day.summary)
ID total.days
1 487622 10.14028
2 487639 17.61389
3 489027 27.68542
4 491858 56.08194

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Identify duplicate data with a threshold - r

Related

Removing duplicates of same date and location (3 columns) in R

Counting the rows in a data frame based on integer ranges

R: Need to perform multiple matches for each row in data frame

How to plot events on time scale using R?

Identify gaps in a continuous time period

Categories

Resources