I know there are like a million questions regarding duplicate removal, but unfortunately
none of them helped me so far. I struggle with the following:
I have a data frame (loc) that includes data of citizen science observations of nature (animals, plants, etc.). It has about 90.000 rows and looks like this:
ID Datum lat long Anzahl Art Gruppe Anrede Wochentag
1 1665376475 2019-05-09 51.30993 9.319896 20 Alytes obstetricans Amphibien Herr Do
2 529728479 2019-05-06 50.58524 8.503332 1 Alytes obstetricans Amphibien Frau Mo
3 1579862637 2019-05-23 50.53925 8.467546 8 Alytes obstetricans Amphibien Herr Do
4 -415013306 2019-05-06 50.58524 8.503332 3 Alytes obstetricans Amphibien Frau Mo
I also made a small sample data frame (loc_sample) of 10 observations and used dput(loc_sample):
structure(list(ID = c(688380991L, -1207894879L, 802295973L, -815104336L, -632066829L, -133354744L, 1929856503L, 952982037L, 1782222413L, 1967897802L),
Datum = structure(c(1559088000, 1558742400, 1557619200, 1557273600, 1557187200, 1557619200, 1557619200, 1557187200, 1557964800, 1556841600),
tzone = "UTC",
class = c("POSIXct", "POSIXt")),
lat = c(52.1236088700115, 51.5928822313012, 53.723426877949, 50.7737623304861, 49.9238597947287, 51.805563222817, 50.1738326622472, 51.2763067511127, 51.395189306337, 51.5732959108075),
long = c(8.62399927116144, 9.89597797393799, 9.04058595819038, 8.20740532922287, 8.29073164862348, 9.9225640296936, 8.79065646492143, 6.40700340270996, 6.47360801696777, 6.25690012620748),
Anzahl = c(2L, 25L, 4L, 1L, 1L, 30L, 2L, 1L, 1L, 1L),
Art = c("Sturnus vulgaris", "Olethreutes arcuella", "Sylvia atricapilla", "Buteo buteo", "Turdus merula", "Orchis mascula subsp. mascula", "Parus major", "Luscinia megarhynchos", "Milvus migrans", "Andrena bicolor"),
Gruppe = c("Voegel", "Schmetterlinge", "Voegel", "Voegel", "Voegel", "Pflanzen", "Voegel", "Voegel", "Voegel", "InsektenSonstige"),
Anrede = c("Herr", "Herr", "Frau", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr"),
Wochentag = structure(c(4L, 7L, 1L, 4L, 3L, 1L, 1L, 3L, 5L, 6L),
.Label = c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa"),
class = c("ordered", "factor"))),
row.names = c(NA, -10L),
class = "data.frame")
For my question only the variables Datum, latand long are important. Datum is a date and in the POSIXct format while lat and long are both numeric. There are quite a few observations that were reported on the same day from the exact same location. I would like to filter and remove those. So I have to check three separate columns and keep only one of each "same-place-same-day" observations.
I already tried putting the three variables in question into one:
loc$dupl <- paste(loc$Datum, loc$lat, loc$long, sep=" ,")
locu <- unique(loc[,2:4])
It seems like I managed to filter the duplicates, but I'm actually not sure, if that's how it is done correctly.
Also, that gives me a data frame with only Datum, lat and long. As a final result I need the original data frame without the duplicates in date and location, but with all the other information for the unique rows still left.
When I try:
locu <- unique(loc[,2:9])
It gives me all the other columns, but it doesn't remove the date and location duplicates.
Thanks in advance for your help!
This can work:
#Code
new <- loc[!duplicated(paste(loc$Datum,loc$lat,loc$long)),]
To get the full data frame back after finding the duplicates, you coudl do sth. like:
loc[!duplicated(loc[,2:4]),]
This code first detects the duplicate rows and then subsets your original data frame.
Note: this code will always keep the first occurences and delete the duplicates in subsequent rows. If you want to keep a certain ID (e.g. the second one, not the first one), we need a different solution.
Related
I have a dataset that was recorded by observation(each observation has its own row of data). I am looking to combine/condense these rows by the plant they were found on - currently a character variable. All other columns are numerical vales.
EX:
This is the raw data
|Sci_Name|Honeybee_count|Other_bee_Obsevrved|Stem_count|
|---|---|---|---|
|Zizia aurea|1|5|10|
|Asclepias viridiflora|15|1|3|
|Viola unknown|0|0|4|
|Zizia aurea|0|2|6|
|Zizia aurea|3|6|3|
|Asclepias viridiflora|8|2|17|
and I want:
Sci_Name
Honeybee_count
Other_bee_Obsevrved
Stem_count
Zizia aurea
4
13
19
Asclepias viridiflora
23
3
20
Viola unknown
0
0
4
I am currently pulling this data from a CSV already in table form. I have been attempting to create a new table/data frame with one entry of each plant species, and blanks/0s for each other variable, which I can then use to c-binding the two together. This, however, has been clunky at best and I am having trouble figuring out how to have each row check itself. I am open to any approach, let me know what you think!
Thanks :D
We can use the formula method in aggregate from base R. On the rhs of the ~, specify the grouping variable and on the lhs, use . for denoting the rest of the variables. Specify the FUN as sum and it will do the column wise sum by group
aggregate(. ~ Sci_Name, df1, sum)
-output
Sci_Name Honeybee_count Other_bee_Obsevrved Stem_count
1 Asclepias viridiflora 23 3 20
2 Viola unknown 0 0 4
3 Zizia aurea 4 13 19
data
df1 <- structure(list(Sci_Name = c("Zizia aurea", "Asclepias viridiflora",
"Viola unknown", "Zizia aurea", "Zizia aurea", "Asclepias viridiflora"
), Honeybee_count = c(1L, 15L, 0L, 0L, 3L, 8L), Other_bee_Obsevrved = c(5L,
1L, 0L, 2L, 6L, 2L), Stem_count = c(10L, 3L, 4L, 6L, 3L, 17L)),
class = "data.frame", row.names = c(NA,
-6L))
I have a dataset with two columns containing the following: an indicator number and a hashcode
The only problem is that the columns have the same name, but the value can switch columns.
Now I want to merge the columns and keep the number (I don't care about the hashcode)
I saw this question: Merge two columns into one in r
and I tried the coalesce() function, but that is only for having NA values. Which I don't have. I looked at the unite function, but according to the cheat sheet documentation documentation here that doesn't what I'm looking for
My next try was the filter_at and other filter functions from the dplyr package Documentation here
But that only leaves 150 data points while at the start I have 61k data points.
Code of filter_at I tried:
data <- filter_at(data,vars("hk","hk_1"),all_vars(.>0))
I assumed that a #-string shall not be greater than 0, which seems to be true, but it removes more than intented.
I would like to keep hk or hk_1 value which is a number. The other one (the hash) can be removed. Then I want a new column which only contains those numbers.
Sample data
My data looks like this:
HK|HK1
190|#SP0839
190|#SP0340
178|#SP2949
#SP8390|177
#SP2240|212
What I would like to see:
HK
190
190
178
177
212
I hope this provides an insight into the data. There are more columns like description, etc which makes that 190 at the start are not doubles.
We can replace all the values that start with "#" to NA and then use coalesce to select non-NA value between HK and HK1.
library(dplyr)
df %>%
mutate_all(~as.character(replace(., grepl("^#", .), NA))) %>%
mutate(HK = coalesce(HK, HK1)) %>%
select(HK)
# HK
#1 190
#2 190
#3 178
#4 177
#5 212
data
df <- structure(list(HK = structure(c(4L, 4L, 3L, 2L, 1L), .Label = c("#SP2240",
"#SP8390", "178", "190"), class = "factor"), HK1 = structure(c(2L,
1L, 3L, 4L, 5L), .Label = c("#SP0340", "#SP0839", "#SP2949",
"177", "212"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))
I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.
First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?
I have a dataframe with some observations of when lines attached to IDs.
I need the period of time in days when each ID had a line/catheter attached.
Here is my dput return:
structure(list(ID = c(487622L, 487622L, 487639L, 487639L, 489027L,
489027L, 489027L, 491858L, 491858L, 491858L, 491858L, 491858L,
491858L), Line = c("Central Venous Line", "Central Venous Line",
"Central Venous Line", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Peripherally Inserted Central Catheter (PICC)",
"Haemodialysis Catheter", "Central Venous Line", "Haemodialysis Catheter",
"Central Venous Line", "Haemodialysis Catheter", "Central Venous Line",
"Peripherally Inserted Central Catheter (PICC)"), Start = structure(c(1362528000,
1363219200, 1362268800, 1363219200, 1364774400, 1365120000, 1365465600,
1364688000, 1364688000, 1365724800, 1365724800, 1366848000, 1369353600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), End = structure(c(1362787200,
1363824000, 1363305600, 1363737600, 1365465600, 1366675200, 1365638400,
1365724800, 1365724800, 1366329600, 1366848000, 1367539200, 1369612800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Days = c("3.095138889",
"7.045138889", "11.87777778", "5.736111111", "7.850694444", "18.02083333",
"1.813888889", "12.32986111", "12.71388889", "6.782638889", "13.14027778",
"7.718055556", "3.397222222"), dateOrder = c(1L, 2L, 1L, 2L,
1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("ID", "Line",
"Start", "End", "Days", "dateOrder"), row.names = 79:91, class = "data.frame")
Here is the catch. It does not matter if an ID has more than one line/catheter. I just need to take the earliest start date for each ID, the latest end date for each ID, and calculate the number of continuous days each ID has a line/catheter attached.
The problem is confounded by some cases, e.g. ID 491858. This individual had a line removed (dateOrder = 5) on 2013-05-03 and reinserted on 2013-05-24 for just over 3 days.
How I intended to handle this is to subtract the gap (number of days) from the number of days of continuous time between min(Start Date) and max(end date).
There are over 20,000 records in the data set.
Here is what I have done so far:
Converted the DF to a list of DFs based on ID.
I intended to apply a function to each DF something as follows:
If the difference in time (days) between subsequent start date and previous end date for each row exceeds 0, then add TRUE or some arbitrary column value to each data frame.
function(y){
for (i in length(y)){
if(difftime(y$Start[i+1], y$End[i], units='days') > 0){
y$test <- TRUE}
}
}
Any help would be greatly appreciated.
Thanks.
UPDATE
Ignore the days column. It is of no use. I intend to aggregate month line counts from the unique cases.
I guess something like this might help, unless I've misunderstood something:
unlist(lapply(split(DF, DF$ID),
function(x) { totaldays <- max(x$End) - min(x$Start);
x$Start <- c(x$Start[-1], NA);
res <- difftime(x$Start[-length(x$Start)], x$End[-length(x$Start)], units = "days");
res <- res[res > 0];
res <- ifelse(length(res) == 0, 0, res);
return(as.numeric(totaldays - res)) }))
#487622 487639 489027 491858
# 10 17 22 36
DF is your dput.
If I understand correctly, you want the total amount of days that the catheter was present. To do that, I would use plyr
#assume df is your dput object
library(plyr)
day.summary <- ddply(df, "ID", function(x) data.frame(total.days = sum(as.numeric(x$Days))))
print(day.summary)
ID total.days
1 487622 10.14028
2 487639 17.61389
3 489027 27.68542
4 491858 56.08194
I am working with bluetooth sensor data and need to identify possible duplicate readings for each unique ID. The bluetooth sensor made a scan every five seconds, and may pick up the same device in subsequent readings if the device wasn't moving quickly (i.e. sitting in traffic). There may be multiple readings from the same device if that device made a round trip, but those should be separated by several minutes. I can't wrap my head around how to get rid of the duplicate data. I need to calculate a time difference column if the macid's match.
The data has the format:
macid time
00:03:7A:4D:F3:59 82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141
And I need to create:
macid time timediff
00:03:7A:4D:F3:59 82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA
My first attempt at this is extremely slow and not really usable:
dedupeIDs <- function (zz) {
#Order by macid and then time
zz <- zz[order(zz$macid, zz$time) ,]
zz$timediff <- c(999999, diff(zz$time))
for (i in 2:nrow(zz)) {
if (zz[i, "macid"] == zz[i - 1, "macid"]) {
print("Different IDs")
} else {
zz[i, "timediff"] <- 999999
}
}
return(zz)
}
I'll then be able to filter the data.frame based on the time difference column.
Sample data:
structure(list(macid = structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F",
"00:05:4F:0B:45:F7"), class = "factor"),
time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L),
class = "data.frame")
How about:
x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
.Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))