I have a patient data set i need to drop the rows after the first occurrence of disease column. for instance
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
123 04-03-2014 0
321 03-03-2015 1
423 06-06-2016 1
423 07-06-2017 1
543 08-05-2018 1
543 09-06-2019 0
645 08-09-2019 0
and the expected output i want
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
321 03-03-2015 1
423 06-06-2016 1
543 08-05-2018 1
One way with dplyr select rows till first occurrence of 1 for each ID.
library(dplyr)
df %>% group_by(ID) %>% filter(row_number() <= which(Disease == 1)[1])
# ID Date Disease
# <int> <fct> <int>
#1 123 02-03-2012 0
#2 123 03-03-2013 1
#3 321 03-03-2015 1
#4 423 06-06-2016 1
#5 543 08-05-2018 1
We can also use slice
df %>% group_by(ID) %>% slice(if(any(Disease == 1)) 1:which.max(Disease) else 0)
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L), Date = structure(c(1L, 2L, 4L, 3L, 5L, 6L, 7L, 9L,
8L), .Label = c("02-03-2012", "03-03-2013", "03-03-2015", "04-03-2014",
"06-06-2016", "07-06-2017", "08-05-2018", "08-09-2019", "09-06-2019"
), class = "factor"), Disease = c(0L, 1L, 0L, 1L, 1L, 1L, 1L,
0L, 0L)), class = "data.frame", row.names = c(NA, -9L))
I have no idea why don't have the last line 645 08-09-2019 0 in your expected result. The first occurrence of disease column for ID 645 has not appeared yet, so I guess you might have missed it in your expected result.
Based on my guess above, maybe you can try the base R solution below, using subset + ave
dfout <- subset(df,!!ave(Disease,ID,FUN = function(v) !duplicated(cumsum(v)>0)))
such that
> dfout
ID Date Disease
1 123 02-03-2012 0
2 123 03-03-2013 1
4 321 03-03-2015 1
5 423 06-06-2016 1
7 543 08-05-2018 1
9 645 08-09-2019 0
DATA
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L), Date = c("02-03-2012", "03-03-2013", "04-03-2014",
"03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018", "09-06-2019",
"08-09-2019"), Disease = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L
)), class = "data.frame", row.names = c(NA, -9L))
Related
I'm new to R, I have a set of PATENT IDs with Disease status. I want to drop the rows after 1 status occurrence of disease. My data set looks like
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
123 04-03-2014 0
321 03-03-2015 1
423 06-06-2016 1
423 07-06-2017 1
543 08-05-2018 1
543 09-06-2019 0
645 08-09-2019 0
645 10-10-2018 0
645 11-10 -2012 0
Expected Output
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
321 03-03-2015 1
423 06-06-2016 1
543 08-05-2018 1
645 08-09-2019 0
645 10-10-2018 0
645 11-10 -2012 0
Kindly suggest a code that returns the expected output.
Thanks in Advance!
Using dplyr one way would be to select all rows if no Disease == 1 occur in an ID or select rows only till first 1.
library(dplyr)
df %>%
group_by(ID) %>%
filter(if(any(Disease == 1)) row_number() <= match(1, Disease) else TRUE)
# ID Date Disease
# <int> <chr> <int>
#1 123 02-03-2012 0
#2 123 03-03-2013 1
#3 321 03-03-2015 1
#4 423 06-06-2016 1
#5 543 08-05-2018 1
#6 645 08-09-2019 0
#7 645 10-10-2018 0
#8 645 11-10-2012 0
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L, 645L, 645L), Date = c("02-03-2012", "03-03-2013",
"04-03-2014", "03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018",
"09-06-2019", "08-09-2019", "10-10-2018", "11-10-2012"), Disease = c(0L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-11L))
This would do it.
set.seed(1012)
datas <- data_frame(ids = rep(1:3, each = 3),
times = runif(9, 0, 100),
event = rep(c(0, 1, 0), 3)) %>%
arrange(ids, times)
datas %>%
group_by(ids) %>%
mutate(lag(cumsum(event), default = 0) == 0)
We can use cumsum to create a logical vector for subsetting
library(data.table)
setDT(df)[df[, .I[cumsum(cumsum(Disease)) <= 1], ID]$V1]
# ID Date Disease
#1: 123 02-03-2012 0
#2: 123 03-03-2013 1
#3: 321 03-03-2015 1
#4: 423 06-06-2016 1
#5: 543 08-05-2018 1
#6: 645 08-09-2019 0
#7: 645 10-10-2018 0
#8: 645 11-10-2012 0
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
filter(cumsum(cumsum(Disease)) <=1)
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L, 645L, 645L), Date = c("02-03-2012", "03-03-2013",
"04-03-2014", "03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018",
"09-06-2019", "08-09-2019", "10-10-2018", "11-10-2012"), Disease = c(0L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-11L))
I have been given a table of data to analyse.
I have a column with start times and a column with finish times given in 3 or 4 digits, example: 3:40pm is 1540, 7:25am is 725.
How can I obtain a new column in my table with the difference of these times, given in minutes?
There are over 2000 entries.
Thanks for any help.
structure(list(schedtime = c(1455L, 1640L, 1245L, 1715L, 1039L,
840L), deptime = c(1455L, 1640L, 1245L, 1709L, 1035L, 839L),
distance = c(184L, 213L, 229L, 229L, 229L, 228L), flightnumber =
c(5935L,
6155L, 7208L, 7215L, 7792L, 7800L), weather = c(0L, 0L, 0L,
0L, 0L, 0L), dayweek = c(4L, 4L, 4L, 4L, 4L, 4L), daymonth = c(1L,
1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 6L), class = "data.frame")
This is an example of the dataset (Im unsure are to why there are L's after each number, these are not shown in the table). I want, in minutes, deptime (finish) - schedtime (start).
Given the new column, there are 2 values with a schedtime before midnight and a deptime after midnight, for example schedtime 2120 and deptime 0010. The answer to this is given as -1270, considering it to be an extremely early departure. How could I change this to be calculated as +170, a late departure?
Efficient way for larger dataset-
data.table::setDT(dt)[,time_diff:=minutes(deptime-schedtime)]
> dt
schedtime deptime distance flightnumber weather dayweek daymonth time_diff
1: 1455 1455 184 5935 0 4 1 0S
2: 1640 1640 213 6155 0 4 1 0S
3: 1245 1245 229 7208 0 4 1 0S
4: 1715 1709 229 7215 0 4 1 -6M 0S
5: 1039 1035 229 7792 0 4 1 -4M 0S
6: 840 839 228 7800 0 4 1 -1M 0S
EDIT- (To handle cases like 1730 - 1600 = 130 mins ( Actually, it is 90 mins).
library(data.table)
library(stringr)
setDT(dt)
dt[,schedtime:=str_pad(schedtime, 4, pad = "0")]
dt[,deptime:=str_pad(deptime, 4, pad = "0")]
dt[,time_diff:=difftime(as.ITime(strptime(x = schedtime, format = "%H%M")),as.ITime(strptime(x = deptime, format = "%H%M")),units = "mins")]
> dt
schedtime deptime distance flightnumber weather dayweek daymonth time_diff
1: 1455 1455 184 5935 0 4 1 0 mins
2: 1640 1640 213 6155 0 4 1 0 mins
3: 1245 1245 229 7208 0 4 1 0 mins
4: 1715 1709 229 7215 0 4 1 6 mins
5: 1039 1035 229 7792 0 4 1 4 mins
6: 1730 1600 228 7800 0 4 1 90 mins
dat <- data.frame(c(1540,1820,1330,545,100),c(1850,2150,2325,1330,101))
60*(floor(dat[,2]/100) - floor(dat[,1]/100)) - dat[,1] %% 100 + dat[,2] %% 100
Taking the floor of the the hundreds gives the hours. Taking the difference and multiplying by 60 gives the minutes from the difference of the hours. Then you can subtract the original minutes and add the final minutes to get total minutes passed.
You can use library lubridate to find the difference in minutes . Hope this helps. lubridate provides very good functionality for time related data.
library(lubridate)
df$deptime_new <- minutes(df$deptime-df$schedtime)
Data
df <- structure(list(schedtime = c(1455L, 1640L, 1245L, 1715L, 1039L,
840L), deptime = c(1455L, 1640L, 1245L, 1709L, 1035L, 839L),
distance = c(184L, 213L, 229L, 229L, 229L, 228L), flightnumber =
c(5935L,
6155L, 7208L, 7215L, 7792L, 7800L), weather = c(0L, 0L, 0L,
0L, 0L, 0L), dayweek = c(4L, 4L, 4L, 4L, 4L, 4L), daymonth = c(1L,
1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 6L), class = "data.frame")
I have the same query, Is there a way to calculate the time difference of times in a column and display the answers in a new column in minutes
I have next data
mydat=structure(list(group = c(111L, 111L, 111L, 111L, 111L, 111L,
111L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 555L, 555L, 555L,
555L, 555L, 555L, 555L), group2 = c(222L, 222L, 222L, 222L, 222L,
222L, 222L, 444L, 444L, 444L, 444L, 444L, 444L, 444L, 666L, 666L,
666L, 666L, 666L, 666L, 666L), action = c(0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L
), x1 = c(1L, 2L, 3L, 0L, 0L, 1L, 2L, 1L, 2L, 3L, 0L, 0L, 1L,
2L, 1L, 2L, 3L, 10L, 20L, 1L, 2L)), .Names = c("group", "group2",
"action", "x1"), class = "data.frame", row.names = c(NA, -21L
))
Here two group variables(group and group2) .
there are three group
111 222
333 444
555 666
action column can take value only 0 and 1.
So i need find these groups where for 1 category of action they have only zero values by x1.
in our case it is
111 222
333 444
because for all 1 category of action they have zeros by x1.
So i can work only with 555 666 group.
because it has at least one non-zero value of first category of action by x1 variable.
The desired output
Mydat1 here groups with at least one non-zero value of first category of action by x1 variable.
group group2 action x1
555 666 0 1
555 666 0 2
555 666 0 3
555 666 1 **10**
555 666 1 **20**
555 666 0 1
555 666 0 2
mydat2 groups which for all 1 category of action they have zeros by x1
group group2 action x1
111 222 0 1
111 222 0 2
111 222 0 3
111 222 1 **0**
111 222 1 **0**
111 222 0 1
111 222 0 2
333 444 0 1
333 444 0 2
333 444 0 3
333 444 1 **0**
333 444 1 **0**
333 444 0 1
333 444 0 2
If i correctly you, then understand your question is:
i need find these groups where for 1 category of action they have
only zero values by x1.
so here is the response:
library(tidyverse)
mydat %>%
group_by( action ) %>%
filter( action==1 & x1==0 )
and the response is:
group group2 action x1
<int> <int> <int> <int>
1 111 222 1 0
2 111 222 1 0
3 333 444 1 0
4 333 444 1 0
What does this code do?
it looks at action feature, and consider 2 main categories for all rows(0,and 1). Then it filters out the observations which pass action==1 & x1==0. So, it means, among those rows who have action==1 the x1==0 is true as well.
can script return all values of 555+666 group?
No. it does not return these 2 groups. And it should not do that. Let's write a code which filters 555,and 666
library(tidyverse)
mydat %>%
group_by( action ) %>%
filter( group==555 | group2==666 )
and the result is:
group group2 action x1
<int> <int> <int> <int>
1 555 666 0 1
2 555 666 0 2
3 555 666 0 3
4 555 666 1 10
5 555 666 1 20
6 555 666 0 1
7 555 666 0 2
so, as you can see, none of these observation fulfills the condition action==1 & x1==0 . Therefore, they are not among the valid response.
I have dataset with data of gamesessions(id,count of session,averege seconds of session and date of session for each id)
here sample of mydat:
mydat=read.csv("C:/Users/Admin/desktop/rty.csv", sep=";",dec=",")
structure(list(udid = c(74385162L, 79599601L, 79599601L, 91475825L,
91475825L, 91492531L, 92137561L, 96308016L, 96308016L, 96308016L,
96308016L, 96308016L, 96495076L, 97135620L, 97135620L, 97135620L,
97135620L, 97135620L, 97135620L, 97135620L, 97135620L, 97135620L,
97135620L, 97165942L), count = c(1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), avg_duration = c(39L, 1216L, 568L, 5L, 6L, 79L, 9L, 426L,
78L, 884L, 785L, 785L, 22L, 302L, 738L, 280L, 2782L, 5L, 2284L,
144L, 234L, 231L, 539L, 450L), date = structure(c(13L, 3L, 3L,
1L, 1L, 14L, 2L, 11L, 11L, 11L, 12L, 12L, 9L, 7L, 4L, 4L, 5L,
6L, 8L, 8L, 8L, 8L, 8L, 10L), .Label = c("11.10.16", "12.12.16",
"15.11.16", "15.12.16", "16.12.16", "17.12.16", "18.10.16", "18.12.16",
"21.10.16", "26.10.16", "28.11.16", "29.11.16", "31.10.16", "8.10.16"
), class = "factor")), .Names = c("udid", "count", "avg_duration",
"date"), class = "data.frame", row.names = c(NA, -24L))
I need calculating the time difference between the first date of the player's appearance and the last date when he was seen.
For example uid 97135620 the first time when he started play was 18.10.2016 and last time he was seen at 18.12.2016, it is mean that the difference between first and last day = 60,9 days,
meanwhile uid74385162 started at 31.10.2016 and after he didn't play(i.e he played one time), it is mean the difference between first data and last data = 0.
id79599601 has two count of session in 1 day(i.e for one day I played 2 times), so the the difference =1
In output i expect this format only with last date and the value of the difference between the last day and the first day.
udid count avg_duration date datediff
74385162 1 39 31.10.2016 0
79599601 1 568 15.11.2016 1
91475825 1 5 11.10.2016 1
91492531 1 79 08.10.2016 0
92137561 1 9 12.12.2016 0
96308016 1 785 29.11.2016 1
96495076 1 22 21.10.2016 0
97135620 1 539 18.12.2016 61
97165942 1 450 26.10.2016 0
How do that?
This function calculates the difference between first and last session, and only returns the date of the last session:
get_datediff <- function (x) {
dates <- as.Date(as.character(x$date), "%d.%m.%y")
x <- x[order(dates), ]
if (length(x$date)==1) {
x$datediff <- 0
} else {
x$datediff <- max(1, diff(range(dates)))
}
x[nrow(x), ]
}
This can then be applied to data for each user, making use of dplyr and magrittr packages:
group_by(mydat, udid) %>% do(get_datediff(.))
# A tibble: 9 x 5
# Groups: udid [9]
udid count avg_duration date datediff
<int> <int> <int> <fctr> <dbl>
1 74385162 1 39 31.10.16 0
2 79599601 1 568 15.11.16 1
3 91475825 3 6 11.10.16 1
4 91492531 1 79 8.10.16 0
5 92137561 1 9 12.12.16 0
6 96308016 1 785 29.11.16 1
7 96495076 1 22 21.10.16 0
8 97135620 1 539 18.12.16 61
9 97165942 1 450 26.10.16 0
The way you describe how your metrics are calculated are confusing, but following what you wrote as closely as possible, I ended up with the following:
dplyr solution:
timeData%>%
mutate(dateFormat = as.Date(date, format = "%d.%m.%y"))%>%
group_by(udid)%>%
arrange(udid,dateFormat)%>%
summarise(dateBetween = difftime(last(dateFormat), first(dateFormat), units = "days"), mean(avg_duration))%>%
left_join((timeData%>%
mutate(dateFormat = as.Date(date, format = "%d.%m.%y"))%>%
select(udid, count,dateFormat)%>%
group_by(udid)%>%
slice(which.min(dateFormat))))
Result:
# A tibble: 9 x 5
udid dateBetween `mean(avg_duration)` count dateFormat
<int> <time> <dbl> <int> <date>
1 74385162 0 days 39.0 1 2016-10-31
2 79599601 0 days 892.0 1 2016-11-15
3 91475825 0 days 5.5 1 2016-10-11
4 91492531 0 days 79.0 1 2016-10-08
5 92137561 0 days 9.0 1 2016-12-12
6 96308016 1 days 591.6 1 2016-11-29
7 96495076 0 days 22.0 1 2016-10-21
8 97135620 61 days 753.9 1 2016-12-18
9 97165942 0 days 450.0 1 2016-10-26
I have a bunch of data frames that look like this in R:
print(output[2])
Button Intensity Acc Intensity RT Time tdelta SubjectID CoupleID PrePost
1: 0 30 0 0.0 0 83325.87 0.000 1531 153 Post
2: 1 30 1 13.5 0 83362.65 36.782 1531 153 Post
3: 1 30 1 15.0 0 83376.68 14.027 1531 153 Post
4: 1 30 1 6.0 0 83392.27 15.585 1531 153 Post
5: 1 30 1 15.0 0 83398.77 6.507 1531 153 Post
print(output[1])
[[1]]
Button Intensity Acc Intensity RT Time tdelta SubjectID CoupleID PrePost
1: 0 30 0 0.0 0 77987.93 0.000 1531 153 Pre
2: 1 30 1 13.5 0 78084.57 96.639 1531 153 Pre
3: 1 30 1 15.0 0 78098.62 14.054 1531 153 Pre
4: 1 30 1 6.0 0 78114.13 15.508 1531 153 Pre
5: 1 30 1 15.0 0 78120.67 6.537 1531 153 Pre
I want to combine them into one big data frame that has the following logic and format:
SubjectID CoupleID PrePost Miss1RT Miss2RT Miss3RT Hit1RT Hit2RT Hit3RT
1531 153 Post 0.00 NA NA NA 36.78 14.027
1531 153 Pre 0.00 NA NA NA 96.638 14.054
if Button == 0, then it's a Miss, if it ==1, then it's a Hit. So, it should be something like:
for row in output[i].rows:
if Button ==0:
Miss1RT ==tdelta
elif Button ==1;
Miss1RT =='NA'
and then a flipped version where if Button is 1, Hit[i]RT is tdelta or else 'NA'.
There are 26 lines per data frame and each row is either a hit or a miss so there will be 26 Miss and 26 Hit columns and each SubjectID gets two rows - one for Pre and one for Post. So the column headers for the final output will be:
SubjectID CoupleID PrePost Miss1RT Miss2RT ...Miss26RT Hit1RT Hit2RT ... Hit26RT
I'm new to R and struggling with the proper syntax.
Something like this should work:
#Get data in structure OP has
output <- list(pre, post)
output2 <- lapply(output, function(x) cbind(x, num = paste0(1:nrow(x), "RT")))
pre_post <- do.call("rbind", output2)
#Perform actual calculations
pre_post$miss <- ifelse(pre_post$Button == 0, pre_post$tdelta, NA)
pre_post$hit <- ifelse(pre_post$Button == 1, pre_post$tdelta, NA)
pre_post_melted <- melt(pre_post, id.vars = c("SubjectID", "CoupleID", "num", "PrePost"), measure.vars = c("hit","miss"))
pre_post_res <- dcast(pre_post_melted, SubjectID + CoupleID + PrePost ~ variable + num, sep = "")
pre_post_res
#SubjectID CoupleID PrePost hit_1RT hit_2RT hit_3RT hit_4RT hit_5RT miss_1RT miss_2RT miss_3RT miss_4RT miss_5RT
#1 1531 153 Post NA 36.782 14.027 15.585 6.507 0 NA NA NA NA
#2 1531 153 Pre NA 96.639 14.054 15.508 6.537 0 NA NA NA NA
We transpose the data to dynamically create all the variables we want. We also stack the data to avoid repeated steps.
Data:
pre <- structure(list(Button = c(0L, 1L, 1L, 1L, 1L), Intensity = c(30L,
30L, 30L, 30L, 30L), Acc = c(0L, 1L, 1L, 1L, 1L), Intensity = c(0,
13.5, 15, 6, 15), RT = c(0L, 0L, 0L, 0L, 0L), Time = c(77987.93,
78084.57, 78098.62, 78114.13, 78120.67), tdelta = c(0, 96.639,
14.054, 15.508, 6.537), SubjectID = c(1531L, 1531L, 1531L, 1531L,
1531L), CoupleID = c(153L, 153L, 153L, 153L, 153L), PrePost = c("Pre",
"Pre", "Pre", "Pre", "Pre")), .Names = c("Button", "Intensity",
"Acc", "Intensity", "RT", "Time", "tdelta", "SubjectID", "CoupleID",
"PrePost"), row.names = c(NA, -5L), class = "data.frame")
post <- structure(list(Button = c(0L, 1L, 1L, 1L, 1L), Intensity = c(30L,
30L, 30L, 30L, 30L), Acc = c(0L, 1L, 1L, 1L, 1L), Intensity = c(0,
13.5, 15, 6, 15), RT = c(0L, 0L, 0L, 0L, 0L), Time = c(83325.87,
83362.65, 83376.68, 83392.27, 83398.77), tdelta = c(0, 36.782,
14.027, 15.585, 6.507), SubjectID = c(1531L, 1531L, 1531L, 1531L,
1531L), CoupleID = c(153L, 153L, 153L, 153L, 153L), PrePost = c("Post",
"Post", "Post", "Post", "Post")), .Names = c("Button", "Intensity",
"Acc", "Intensity", "RT", "Time", "tdelta", "SubjectID", "CoupleID",
"PrePost"), row.names = c(NA, -5L), class = "data.frame")