R select one row from duplicated rows after compare multi conditions - r

I got these duplicated records from ton of data. Now, I need to choose one row from these duplicated rows.
ID <- c("6820","6820","17413","17413","38553","38553","52760","52760","717841","717841","717841","747187","747187","747187")
date <- c("2014-06-12","2015-06-11","2014-05-01","2014-05-01","2014-06-12","2015-06-11","2014-10-24","2014-10-24","2014-05-01","2014-05-01","2014-12-02","2014-03-01","2014-05-12","2014-05-12")
type <- c("ST","ST","MC","MC","LC","LC","YA","YA","YA","YA","MC","LC","LC","MC")
level <-c("firsttime","new","new","active","active","active","firsttime","new","active","new","active","new","active","active")
data <- data.frame(ID,date,type,level)
The data frame will look like this:
I want to compare this: for each ID,if their dates are different, then keep all of them in df.right; if the date is same, then compare type, choose them in order of LC>MC>YA>ST (eg. choose MC over YA), put them into df.right; if type is same, then compare level, choose them in order of active>new>firsttime (eg. choose new over first time), and put the choosen into df.right.
I tried to use foreach, this is only on the first step, and it is not working for ID have 3 duplicated rows.
foreach (i=unique(data$ID), .combine='rbind') %do% {data[data$ID==i, "date"][1] == data[data$ID==i, "date"][2])
b <- data[data$ID==i,]}
The result should be like this:
Does anybody knows how to do this? really appreciate it. Thank you

The dplyr package is good for this sort of thing
Using factors, you can specify how you want your categories ordered. Then you can pick the first of each type and level for each unique ID/date pair.
library(dplyr)
ID <- c("6820","6820","17413","17413","38553","38553","52760","52760","717841","717841","717841","747187","747187","747187")
date <- c("2014-06-12","2015-06-11","2014-05-01","2014-05-01","2014-06-12","2015-06-11","2014-10-24","2014-10-24","2014-05-01","2014-05-01","2014-12-02","2014-03-01","2014-05-12","2014-05-12")
type <- c("ST","ST","MC","MC","LC","LC","YA","YA","YA","YA","MC","LC","LC","MC")
level <-c("firsttime","new","new","active","active","active","firsttime","new","active","new","active","new","active","active")
type <- factor(type, levels=c("LC", "MC", "YA", "ST"))
level <- factor(level, levels=c("active", "new", "firsttime"))
data <- data.frame(ID,date,type,level)
df.right <- data %>%
group_by(ID, date) %>%
filter(type == sort(type)[1]) %>%
filter(level == sort(level)[1])

The trick here is to order the levels of type and level as appropriate. Then two deduplications are necessary: first, to remove duplicate rows based on the columns ID, date, type and second, remove dup rows based upon first two columns:
type = factor(type, levels=c("ST","YA","MC","LC"))
level = factor(level, levels=c("active","new","firsttime"))
data <- data.frame(ID,date,type,level)
d = with(data, data[order(ID, date, type, level),])
e = d[-which(duplicated(d[,1:3])),]
df.right = e[-which(duplicated(e[,1:2])),]
df.right = df.right[order(as.numeric(as.character(df.right$ID)), df.right$date),]
df.right
Output:
ID date type level
1 6820 2014-06-12 ST firsttime
2 6820 2015-06-11 ST new
4 17413 2014-05-01 MC active
5 38553 2014-06-12 LC active
6 38553 2015-06-11 LC active
8 52760 2014-10-24 YA new
9 717841 2014-05-01 YA active
11 717841 2014-12-02 MC active
12 747187 2014-03-01 LC new
14 747187 2014-05-12 MC active

Related

Use part of row data for new columns in R

I have a very large df with a column that contains the file directory for each row's data.
Example: D:Mouse_2174/experiment/13/trialsummary.txt.1
I would like to create 2 new columns, one with only the mouse ID (2174) and one with the session number (13). There will be different IDs and session numbers based on the row.
I've used sub as recommended here (match part of names in data.frame to new column), but only can get the subject column to say "D:Mouse_2174" I've added an additional line and can get it down to "D:Mous2174"
Is there a way to eliminate all chars before _ and after / to obtain mouse ID?
For session number, I'm not quite as sure what to do with multiple / in the directory name.
percent_correct_list$mouse_id <- sub("/.+", "", percent_correct_list$rn)
#gives me D:Mouse_2174
percent_correct_list$mouse_id <- sub("+._", "", percent_correct_list$mouse_id)
#gives me D:Mous2174
Here is sample code for the directories:
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")
)
What I want:
rn
id
session
D:..
2174
9
D:..
2181
33
D:..
2183
107
D:..
2185
87
Maybe there's some way to do this earlier along in the process too (like when I import all the data into a df using lapply - but this is good as well)
For sure isnt an elegant solution. Only works if your ID and Session are always numbers...
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")) %>%
# Extract all numeric values from the string
mutate(allnums = regmatches(rn, gregexpr("+[[:digit:]]+", rn)))%>%
# Separate them
separate(allnums, into = c("id", "session", "idk"), sep = "\\,") %>%
# Extract them individually
mutate(id = as.numeric(regmatches(id, gregexpr("+[[:digit:]]+", id,))),
session = as.numeric(regmatches(session, gregexpr("+[[:digit:]]+", session)))) %>%
select(-idk)
Output:
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87
Here's a somewhat long-winded solution, using tidyr::separate. Perhaps there is something more concise/elegant.
It does assume that all values of rn take the same format.
library(dplyr)
library(tidyr)
new_df <- df %>%
# separate on / into 4 new columns
separate(rn, into = c(paste0("item", 1:4)), sep = "/", remove = FALSE) %>%
# remove unwanted columns
select(-item2, -item4) %>%
# separate again on _ into 2 new columns
separate(item1, sep = "_", into = c("prefix", "id")) %>%
# retain and rename desired columns
select(rn, id, session = item3)
Result:
rn id session
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87

Recode values based on look up table with dplyr (R)

A relatively trivial question that has been bothering me for a while, but to which I have not yet found an answer - perhaps because I have trouble verbalizing the problem for search engines.
Here is a column of a data frame that contains identifiers.
data <- data.frame("id" = c("D78", "L30", "F02", "A23", "B45", "T01", "Q38", "S30", "K84", "O04", "P12", "Z33"))
Based on a lookup table, outdated identifiers are to be recoded into new ones. Here is an example look up table.
recode_table <- data.frame("old" = c("A23", "B45", "K84", "Z33"),
"new" = c("A24", "B46", "K88", "Z33"))
What I need now can be done with a merge or a loop. Here a loop example:
for(ID in recode_table$old) {
data[data$id == ID, "id"] <- recode_table[recode_table$old == ID, "new"]
}
But I am looking for a dplyr solution without having to use the " join" family. I would like something like this.
data <- mutate(data, id = ifelse(id %in% recode_table$old, filter(recode_table, old == id) %>% pull(new), id))
Obviously though, I can't use the column name ("id") of the table in order to identify the new ID.
References to corresponding passages in documentations or manuals are also appreciated. Thanks in advance!
You can use recode with unquote splicing (!!!) on a named vector
library(dplyr)
# vector of new IDs
recode_vec <- recode_table$new
# named with old IDs
names(recode_vec) <- recode_table$old
data %>%
mutate(id = recode(id, !!!recode_vec))
# id
# 1 D78
# 2 L30
# 3 F02
# 4 A24
# 5 B46
# 6 T01
# 7 Q38
# 8 S30
# 9 K88
# 10 O04
# 11 P12
# 12 Z33

How to count and plot a cumulative number over a date range by groups

I want to find the best way to plot a chart showing the cumulative number of individuals in a group based on the date they came into the group as well as the date they may have left the group. This would be within the minimum and maximum date ranges of the date values. Each row is a person.
group_id Date_started Date_exit
1 2005-06-23 NA
1 2013-03-17 2013-09-20
2 2019-10-24 NA
3 2019-11-27 2019-11-27
4 2019-08-14 NA
3 2018-10-17 NA
4 2018-04-13 2019-10-12
1 2019-07-10 NA
I've considered creating a new data frame with a row per day within the min/max range and then applying some kind of function to tally the groups totals for each row (adding and subtracting from a running total based on whether or not there is a new value in either of the columns) but I'm not sure if one, that's the best way to approach the problem and two, how to practically run the cumulative count function either.
Ultimately though I want to be able to plot this as a line chart so I can see the trends over time for each group as I suspect one or more of them are much more volatile in terms of overall numbers. So again I'm not sure if ggplot2 has something already in place to handle this.
As you mentioned, you will need to create a dataframe with the desired dates and count, for each group, how many individuals are in the group.
I quickly put this together, so I'm sure there's a more optimal solution, but it should be what you're looking for.
library(ggplot2)
library(reshape2) # for melt
# your data
test <- read.table(
text =
"group_id,Date_started,Date_exit
1,2005-06-23,NA
1,2013-03-17,2013-09-20
2,2019-10-24,NA
3,2019-11-27,2019-11-27
4,2019-08-14,NA
3,2018-10-17,NA
4,2018-04-13,2019-10-12
1,2019-07-10,NA",
h = T, sep = ",", stringsAsFactors = F
)
# make date series
from <- min(as.POSIXct(test$Date_started))
to <- max(as.POSIXct(test$Date_started))
datebins <- seq(from, to, by = "1 month")
d_between <- function(d, ds, de){
if(ds <= d & (de > d | is.na(de)))
return(TRUE)
return(FALSE)
}
# make df to plot
df <- data.frame(dates = datebins)
df[,paste0("g", unique(test$group_id))] <- 0
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
# plot
ggplot(melt(df, id.vars = "dates"), aes(dates, value, color = variable)) +
geom_line(size = 1) + theme_bw()
This gives:
Feel free to play with the date bins (in seq()) as necessary.
EDIT : for loop explanation
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
The first loop iterates over the chosen dates.
For each date, go through the dataframe of interest (test) with the second for loop and use the custom d_between() function to determine whether or not an individual is part of the group. That function returns a boolean (which can translate to 0/1). The value 0 or 1 is then added to the df dataframe's column corresponding to the appropriate group (with gid) at the date we checked (row i).
Note that I'm considering the individuals as part of the group as soon as they join (ds <= d), but are not a part of the group the day they quit (de > d).

How to align Dates from different columns in R or Excel

I have a dataset of variables with each variable having a different date span. They are presented as in the following example (taken the first two cases out of 500):
DatesV1 DatesV2
29/12/1995 19/07/2001
02/01/1996 20/07/2001
03/01/1996 23/07/2001
04/01/1996 24/07/2001
05/01/1996 25/07/2001
08/01/1996 26/07/2001
09/01/1996 27/07/2001
10/01/1996 30/07/2001
11/01/1996 31/07/2001
12/01/1996 01/08/2001
15/01/1996 02/08/2001
16/01/1996 03/08/2001
17/01/1996 06/08/2001
What I want to happen is for the dates in DatesV2 to align with the dates in DatesV1. This means that DatesV2 will start with a few NA until the row that the dates align. Like this:
DatesV1 DatesV2
... ...
17/07/2001 NA
18/07/2001 NA
19/07/2001 19/07/2001
20/07/2001 20/07/2001
... ...
In the Example Set, I have the example of exactly what I am trying to do. I can't find of a fast computational way to do it in either R or Excel for the 500 variables that I have.
Example Set
I have tried something like this:
nhat<-which(Example$DatesV2[1]==Example$DatesV1)
nend<-which(Example$DatesV1[length(Example$DatesV1)-1]==Example$DatesV2)
Example$Apotelesma<- c(rep(NA,nhat-1),Example$DatesV2[1:nend],NA)
Which is an initial test that works for two dates. Only thing is that dates appear as numbers.
Here's a possible solution using some re-shaping. I'm using a simple example:
df = data.frame(DatesV1 = c("24/07/2001","25/07/2001","26/07/2001"),
DatesV2 = c("25/07/2001","26/07/2001","27/07/2001"),
DatesV3 = c("26/07/2001","27/07/2001","28/07/2001"),
stringsAsFactors = F)
library(tidyverse)
library(lubridate)
# update to date columns (only if needed)
df = df %>% mutate_all(dmy)
df %>%
gather() %>% # reshape dataset
mutate(id = value) %>% # use date values as row ids
spread(key, value) %>% # reshape again
select(-id) # remove ids
# DatesV1 DatesV2 DatesV3
# 1 2001-07-24 <NA> <NA>
# 2 2001-07-25 2001-07-25 <NA>
# 3 2001-07-26 2001-07-26 2001-07-26
# 4 <NA> 2001-07-27 2001-07-27
# 5 <NA> <NA> 2001-07-28
This an ugly/messy method if you like, but it gets the job done. Any faster and tidier way would be better.
n<-nrow(DataAlignment)
Newdata<-matrix(0,5148,ncol(DataAlignment))
loops<-ncol(DataAlignment)-1
for(i in 1:loops){
nhat<-which(DataAlignment[1,i+1]==DataAlignment[,1]) #finds the position of the first date in column 2 according to the first column
nend<-which(DataAlignment[n,1]==DataAlignment[,i+1]) #finds the position of last date in col 2 according to the first column
if(nhat==1 | nend != 5148){ #takes into account when they start at the same time but end in different dates
Newdata[,i+1]<-c(DataAlignment[c(1:nend),i+1],rep(NA,n-nend))
}
else{if(nhat==1| nend==5148){Newdata[,i+1]<-c(DataAlignment[,i+1])} #this takes account when they start and end at the same time
else{if(nhat!=1){
Newdata[,i+1]<-c(rep(NA,nhat-1),DataAlignment[c(1:nend),i+1])}}} #creates the new data
}

How to join data frames based on condition between 2 columns

I am stuck with a project where I need to merge two data frames. They look something like this:
Data1
Traffic Source Registrations Hour Minute
organic 1 6 13
social 1 8 54
Data2
Email Hour2 Minute2
test#domain.com 6 13
test2#domain2.com 8 55
I have the following line of code to merge the 2 data frames:
merge.df <- merge(Data1, Data2, by.x = c( "Hour", "Minute"),
by.y = c( "Hour2", "Minute2"))
It would work great if the variable time (hours & minutes) wasn't slightly off between the two data sets. Is there a way to make the column "Minute" match with "Minute2" if it's + or - one minute off?
I thought I could create 2 new columns for data set one:
Data1
Traffic Source Registrations Hour Minute Minute_plus1 Minute_minus1
organic 1 6 13 14 12
social 1 8 54 55 53
Is it possible to merge the 2 data frames if "Minute2" matches any variable from either "Minute", "Minute_plus1", or "Minute_minus1"? Or is there a more efficient way to accomplish this merge?
For stuff like this I usually turn to SQL:
library(sqldf)
x = sqldf("
SELECT *
FROM Data1 d1 JOIN Data2 d2
ON d1.Hour = d2.Hour2
AND ABS(d1.Minute - d2.Minute2) <= 1
")
Depending on the size of your data, you could also just join on Hour and then filter. Using dplyr:
library(dplyr)
x = Data1 %>%
left_join(Data2, by = c("Hour" = "Hour2")) %>%
filter(abs(Minute - Minute2) <= 1)
though you could do the same thing with base functions.

Resources