I have dataset with data of gamesessions(id,count of session,averege seconds of session and date of session for each id)
here sample of mydat:
mydat=read.csv("C:/Users/Admin/desktop/rty.csv", sep=";",dec=",")
structure(list(udid = c(74385162L, 79599601L, 79599601L, 91475825L,
91475825L, 91492531L, 92137561L, 96308016L, 96308016L, 96308016L,
96308016L, 96308016L, 96495076L, 97135620L, 97135620L, 97135620L,
97135620L, 97135620L, 97135620L, 97135620L, 97135620L, 97135620L,
97135620L, 97165942L), count = c(1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), avg_duration = c(39L, 1216L, 568L, 5L, 6L, 79L, 9L, 426L,
78L, 884L, 785L, 785L, 22L, 302L, 738L, 280L, 2782L, 5L, 2284L,
144L, 234L, 231L, 539L, 450L), date = structure(c(13L, 3L, 3L,
1L, 1L, 14L, 2L, 11L, 11L, 11L, 12L, 12L, 9L, 7L, 4L, 4L, 5L,
6L, 8L, 8L, 8L, 8L, 8L, 10L), .Label = c("11.10.16", "12.12.16",
"15.11.16", "15.12.16", "16.12.16", "17.12.16", "18.10.16", "18.12.16",
"21.10.16", "26.10.16", "28.11.16", "29.11.16", "31.10.16", "8.10.16"
), class = "factor")), .Names = c("udid", "count", "avg_duration",
"date"), class = "data.frame", row.names = c(NA, -24L))
I need calculating the time difference between the first date of the player's appearance and the last date when he was seen.
For example uid 97135620 the first time when he started play was 18.10.2016 and last time he was seen at 18.12.2016, it is mean that the difference between first and last day = 60,9 days,
meanwhile uid74385162 started at 31.10.2016 and after he didn't play(i.e he played one time), it is mean the difference between first data and last data = 0.
id79599601 has two count of session in 1 day(i.e for one day I played 2 times), so the the difference =1
In output i expect this format only with last date and the value of the difference between the last day and the first day.
udid count avg_duration date datediff
74385162 1 39 31.10.2016 0
79599601 1 568 15.11.2016 1
91475825 1 5 11.10.2016 1
91492531 1 79 08.10.2016 0
92137561 1 9 12.12.2016 0
96308016 1 785 29.11.2016 1
96495076 1 22 21.10.2016 0
97135620 1 539 18.12.2016 61
97165942 1 450 26.10.2016 0
How do that?
This function calculates the difference between first and last session, and only returns the date of the last session:
get_datediff <- function (x) {
dates <- as.Date(as.character(x$date), "%d.%m.%y")
x <- x[order(dates), ]
if (length(x$date)==1) {
x$datediff <- 0
} else {
x$datediff <- max(1, diff(range(dates)))
}
x[nrow(x), ]
}
This can then be applied to data for each user, making use of dplyr and magrittr packages:
group_by(mydat, udid) %>% do(get_datediff(.))
# A tibble: 9 x 5
# Groups: udid [9]
udid count avg_duration date datediff
<int> <int> <int> <fctr> <dbl>
1 74385162 1 39 31.10.16 0
2 79599601 1 568 15.11.16 1
3 91475825 3 6 11.10.16 1
4 91492531 1 79 8.10.16 0
5 92137561 1 9 12.12.16 0
6 96308016 1 785 29.11.16 1
7 96495076 1 22 21.10.16 0
8 97135620 1 539 18.12.16 61
9 97165942 1 450 26.10.16 0
The way you describe how your metrics are calculated are confusing, but following what you wrote as closely as possible, I ended up with the following:
dplyr solution:
timeData%>%
mutate(dateFormat = as.Date(date, format = "%d.%m.%y"))%>%
group_by(udid)%>%
arrange(udid,dateFormat)%>%
summarise(dateBetween = difftime(last(dateFormat), first(dateFormat), units = "days"), mean(avg_duration))%>%
left_join((timeData%>%
mutate(dateFormat = as.Date(date, format = "%d.%m.%y"))%>%
select(udid, count,dateFormat)%>%
group_by(udid)%>%
slice(which.min(dateFormat))))
Result:
# A tibble: 9 x 5
udid dateBetween `mean(avg_duration)` count dateFormat
<int> <time> <dbl> <int> <date>
1 74385162 0 days 39.0 1 2016-10-31
2 79599601 0 days 892.0 1 2016-11-15
3 91475825 0 days 5.5 1 2016-10-11
4 91492531 0 days 79.0 1 2016-10-08
5 92137561 0 days 9.0 1 2016-12-12
6 96308016 1 days 591.6 1 2016-11-29
7 96495076 0 days 22.0 1 2016-10-21
8 97135620 61 days 753.9 1 2016-12-18
9 97165942 0 days 450.0 1 2016-10-26
Related
Hi this is an excel form of data i want to be able to create in R
Just want to make it clear, I need to be able to make the column Group_fix equal to 5 for the next 12 months period observation, every time an observation date has 5 in its Group column.
How to make it possible in R? Can we use ifelse function?
Here is an approach with lag from dplyr.
library(dplyr)
data %>%
mutate(GroupFix = case_when(Group == 5 |
lag(Group,2) == 5 |
lag(Group,2) == 5 |
lag(Group,3) == 5 |
lag(Group,4) == 5 |
lag(Group,5) == 5 |
lag(Group,6) == 5 |
lag(Group,7) == 5 |
lag(Group,8) == 5 |
lag(Group,9) == 5 |
lag(Group,10) == 5 |
lag(Group,11) == 5 ~ 5,
TRUE ~ as.numeric(Group)))
Observation.Date Group GroupFix
1 12/31/19 1 1
2 1/31/20 2 2
3 2/29/20 2 2
4 3/31/20 2 2
5 4/30/20 3 3
6 5/31/20 4 4
7 6/30/20 5 5
8 7/31/20 5 5
9 8/31/20 4 5
10 9/30/20 3 5
11 10/31/20 2 5
12 11/30/20 3 5
13 12/31/20 4 5
14 1/31/21 5 5
15 2/28/21 5 5
16 3/31/21 4 5
17 4/30/21 3 5
18 5/31/21 2 5
19 6/30/21 1 5
20 7/31/21 1 5
21 8/31/21 1 5
22 9/30/21 1 5
23 10/31/21 1 5
24 11/30/21 1 5
25 12/31/21 1 5
26 1/31/22 1 5
27 2/28/22 1 1
Data
data <- structure(list(Observation.Date = structure(c(8L, 1L, 13L, 14L,
16L, 18L, 20L, 22L, 24L, 26L, 4L, 6L, 9L, 2L, 11L, 15L, 17L,
19L, 21L, 23L, 25L, 27L, 5L, 7L, 10L, 3L, 12L), .Label = c("1/31/20",
"1/31/21", "1/31/22", "10/31/20", "10/31/21", "11/30/20", "11/30/21",
"12/31/19", "12/31/20", "12/31/21", "2/28/21", "2/28/22", "2/29/20",
"3/31/20", "3/31/21", "4/30/20", "4/30/21", "5/31/20", "5/31/21",
"6/30/20", "6/30/21", "7/31/20", "7/31/21", "8/31/20", "8/31/21",
"9/30/20", "9/30/21"), class = "factor"), Group = c(1L, 2L, 2L,
2L, 3L, 4L, 5L, 5L, 4L, 3L, 2L, 3L, 4L, 5L, 5L, 4L, 3L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-27L))
I have a dataframe which looks like this:
ID Smoker Asthma Age Sex COPD Event_Date
1 1 0 0 65 M 0 12-2009
2 1 0 1 65 M 0 21-2009
3 1 0 1 65 M 0 23-2009
4 2 1 0 67 M 0 19-2010
5 2 1 0 67 M 0 21-2010
6 2 1 1 67 M 1 01-2011
7 2 1 1 67 M 1 02-2011
8 3 2 1 77 F 0 09-2015
9 3 2 1 77 F 1 10-2015
10 3 2 1 77 F 1 10-2015
I would like to know whether it would be possible it combine my rows in order to achieve a dataset like this:
ID Smoker Asthma Age Sex COPD Event_Data
1 0 1 65 M 0 12-2009
2 1 1 66 M 1 19-2010
3 2 1 77 F 1 09-2015
I have tried using the unique function, however this doesn't give me my desired output and repeats the ID for multiple rows.
This is an example of the code i've tried
Data2<-unique(Data)
I do not just want the first row because I want to include each column status. For example, just getting the first row would not include the COPD status which occurs in the later rows for each ID.
Alternative Solution:
library(dplyr)
d %>%
group_by(ID, Age, Sex, Smoker) %>%
summarise(Asthma = !is.na(match(1, Asthma)),
COPD = !is.na(match(1, COPD)),
Event_Date = first(Event_Date)) %>%
ungroup %>%
mutate_if(is.logical, as.numeric)
# A tibble: 3 x 7
ID Age Sex Smoker Asthma COPD Event_Date
<int> <int> <fct> <int> <dbl> <dbl> <fct>
1 1 65 M 0 1 0 12-2009
2 2 67 M 1 1 1 19-2010
3 3 77 F 2 1 1 09-2015
If you want to get the (first) row for each ID you can try something like this:
d <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L),
Smoker = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Asthma = c(0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
Age = c(65L, 65L, 65L, 67L, 67L, 67L, 67L, 77L, 77L, 77L),
Sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L),
.Label = c("F", "M"), class = "factor"),
COPD = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L),
Event_Date = structure(c(5L, 7L, 9L, 6L, 8L, 1L, 2L, 3L, 4L, 4L),
.Label = c("01-2011", "02-2011", "09-2015",
"10-2015", "12-2009", "19-2010",
"21-2009", "21-2010", "23-2009"),
class = "factor")),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
d[!duplicated(d$ID), ]
# ID Smoker Asthma Age Sex COPD Event_Date
# 1 1 0 0 65 M 0 12-2009
# 4 2 1 0 67 M 0 19-2010
# 8 3 2 1 77 F 0 09-2015
Use max when you need a value further down and dplyr::first for others, here an example
library(dplyr)
df %>% group_by(ID) %>% summarise(Smoker=first(Smoker), Asthma=max(Asthma, na.rm = TRUE))
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I have a dataset like the one below:
test <- structure(list(SR = c(1L, 1L, 15L, 20L, 20L, 96L, 110L, 110L,
121L, 121L, 130L, 130L, 143L, 143L), Area = structure(c(3L, 3L,
1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 4L, 4L, 2L, 2L), .Label = c("FH",
"MO", "TSC", "WMB"), class = "factor"), Period = structure(c(1L,
2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("First",
"Second"), class = "factor"), count = c(4L, 6L, 3L, 6L, 6L, 3L,
6L, 6L, 6L, 6L, 6L, 6L, 5L, 6L), countTotal = c(10L, 10L, 3L,
12L, 12L, 3L, 12L, 12L, 12L, 12L, 12L, 12L, 11L, 11L), SumTotal = c(1520,
5769.02, 29346.78, 13316.89, 11932.68, 10173.05, 13243.5, 17131.94,
111189.07, 84123.52, 79463.1, 120010.57, 7035.88, 11520)), .Names = c("SR",
"Area", "Period", "count", "countTotal", "SumTotal"), class = "data.frame", row.names = c(NA,
-14L))
SR Area Period count countTotal SumTotal
1 TSC First 4 10 1520.00
1 TSC Second 6 10 5769.02
15 FH First 3 3 29346.78
20 FH First 6 12 13316.89
20 FH Second 6 12 11932.68
96 FH First 3 3 10173.05
110 MO First 6 12 13243.50
110 MO Second 6 12 17131.94
121 FH First 6 12 111189.07
121 FH Second 6 12 84123.52
130 WMB First 6 12 79463.10
130 WMB Second 6 12 120010.57
143 MO First 5 11 7035.88
143 MO Second 6 11 11520.00
I want to convert some of the rows to columns to make the dataset look like this:
SR Area countTotal First.Count Second.Count First.SumTotal Second.SumTotal
1 TSC 10 4 6 1520.00 5769.02
15 FH 3 3 NA 29346.78 NA
20 FH 12 6 6 13316.89 11932.68
96 FH 3 3 NA 10173.05 NA
110 MO 12 6 6 13243.50 17131.94
121 FH 12 6 6 111189.07 84123.52
130 WMB 12 6 6 79463.10 120010.57
143 MO 11 5 6 7035.88 11520.00
I was trying to use spread from tidyr with this code
test %>% spread(Period, SumTotal) but I still get two lines for each SR and Area.
Can someone help?
You need to first gather by the columns you want to spread, and combine the Period column with the variable column, then spread the resulting variable column:
library(dplyr)
library(tidyr)
test %>%
gather(variable, value, count:SumTotal) %>%
unite("variable", Period, variable, sep = ".") %>%
spread(variable, value)
Result:
SR Area First.count First.countTotal First.SumTotal Second.count Second.countTotal
1 1 TSC 4 10 1520.00 6 10
2 15 FH 3 3 29346.78 NA NA
3 20 FH 6 12 13316.89 6 12
4 96 FH 3 3 10173.05 NA NA
5 110 MO 6 12 13243.50 6 12
6 121 FH 6 12 111189.07 6 12
7 130 WMB 6 12 79463.10 6 12
8 143 MO 5 11 7035.88 6 11
Second.SumTotal
1 5769.02
2 NA
3 11932.68
4 NA
5 17131.94
6 84123.52
7 120010.57
8 11520.00
Here is my dataset:
structure(list(Date = structure(c(14609, 14609, 14609, 14609, 14699, 14699, 14699, 14699, 14790, 14790, 14790, 14790), class = "Date"),
ID = structure(c(5L, 4L, 6L, 10L, 9L, 3L, 10L, 8L, 7L, 1L,
10L, 2L), .Label = c("B00NYQ2", "B03J9L7", "B05DZD1", "B06HC42",
"B09V3X7", "B09YCC8", "X6114659", "X6478816", "X6556701",
"X6812555"), class = "factor"), Name = structure(c(10L, 4L,
9L, 8L, 7L, 3L, 8L, 6L, 2L, 5L, 8L, 1L), .Label = c("AIRA",
"BOUS", "CSCS", "EVF", "GTB", "JER", "MGB", "MPR", "NVB",
"TTNP"), class = "factor"), Score = c(55.075, 54.5, 53.325,
52.175, 70.275, 69.825, 60.15, 60.025, 56.175, 52.65, 52.175,
52.125), Score.rank = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L)), .Names = c("Date", "ID", "Name", "Score", "Score.rank"), row.names = c(1L, 2L, 3L, 4L, 71L, 72L, 73L, 74L, 156L, 157L, 158L, 159L), class = "data.frame")
I'm trying to find which IDs come in and out when we go into a new period.
What i mean by that is..i want to compare if the ID was present in the previous period, denoted by "Date".
If it existed in the previous period (date), It should not return anything.
If it did not exist in the previous period, it should return "IN".
I also want to show that if does not exist in the next period, it should return an "OUT".
ie the this period's OUTs should be equal to next periods INs
my expected dataframe is supposed to look like this
Date ID Name Score Score.rank THIS PERIOD NEXT PERIOD
31/12/2009 B09V3X7 TTNP 55.075 1 OUT
31/12/2009 B06HC42 EVF 54.5 2 OUT
31/12/2009 B09YCC8 NVB 53.325 3 OUT
31/12/2009 X6812555 MPR 52.175 4
31/3/2010 X6556701 MGB 70.275 1 IN
31/3/2010 B05DZD1 CSCS 69.825 2 IN OUT
31/3/2010 X6812555 MPR 60.15 3
31/3/2010 X6478816 JER 60.025 4 IN OUT
30/6/2010 X6114659 BOUS 56.175 1 IN
30/6/2010 B00NYQ2 GTB 52.65 2 IN
30/6/2010 X6812555 MPR 52.175 3
30/6/2010 B03J9L7 AIRA 52.125 4 IN
Can somebody point me in the right direction as to how to do this?
Thanks in advance
Your description and example doesn't match, unfortunately.
Considering your description, it seems you want to tag entry and exit conditions for the IDs.
Which can be achieved as:
dft %>%
group_by(ID) %>%
dplyr::mutate( This_period = if_else(Date == min(Date), "IN", NULL) ) %>%
dplyr::mutate( Next_period = if_else(Date == max(Date), "OUT", NULL))
and returns:
#Source: local data frame [12 x 7]
#Groups: ID [10]
#
# Date ID Name Score Score.rank This_period Next_period
# <date> <fctr> <fctr> <dbl> <int> <chr> <chr>
#1 2009-12-31 B09V3X7 TTNP 55.075 1 IN OUT
#2 2009-12-31 B06HC42 EVF 54.500 2 IN OUT
#3 2009-12-31 B09YCC8 NVB 53.325 3 IN OUT
#4 2009-12-31 X6812555 MPR 52.175 4 IN <NA>
#5 2010-03-31 X6556701 MGB 70.275 1 IN OUT
#6 2010-03-31 B05DZD1 CSCS 69.825 2 IN OUT
#7 2010-03-31 X6812555 MPR 60.150 3 <NA> <NA>
#8 2010-03-31 X6478816 JER 60.025 4 IN OUT
#9 2010-06-30 X6114659 BOUS 56.175 1 IN OUT
#10 2010-06-30 B00NYQ2 GTB 52.650 2 IN OUT
#11 2010-06-30 X6812555 MPR 52.175 3 <NA> OUT
#12 2010-06-30 B03J9L7 AIRA 52.125 4 IN OUT
However, your example suggests you want to exclude the min(Date) from this_period check and the max(Date) from the Next_period check. Is it so? if yes, is score.rank somehow related to Date?
please clarify.
Is there a way to fill in for implicit missingness for future dates based on id?
For example, imagine a experiment that starts in Jan-2016. I have 3 participants that join in at different periods. Subject 1 joins me in Jan and continues to stay until Aug. Subj 2 joins me in March, and stays in the experiment until August. Subject 3 also joins me in March, but drops out sometime in in May, so no observations are recorded for periods May-Aug.
The question is, how do I fill in the dates when subject 3 dropped out of the experiment? Here is some mock data for how things look like:
Subject Date
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L), Date = structure(c(5L, 4L, 8L, 2L,
9L, 7L, 6L, 3L, 8L, 2L, 9L, 7L, 6L, 3L, 8L, 2L), .Label = c("",
"Apr-16", "Aug-16", "Feb-16", "Jan-16", "Jul-16", "Jun-16", "Mar-16",
"May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L), .Names = c("Subject", "Date"))
And here is how the data should look like:
Subject Date
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
17 3 May-16
18 3 Jun-16
19 3 Jul-16
20 3 Aug-16
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Date = structure(c(4L,
3L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L,
6L, 5L, 2L), .Label = c("Apr-16", "Aug-16", "Feb-16", "Jan-16",
"Jul-16", "Jun-16", "Mar-16", "May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L), .Names = c("Subject", "Date"))
I tried using expand from tidyr and TimeFill from DataCombine package, but the issue with those approaches is that I would get dates for periods before a participant joined an experiment. In this particular instance, I only want the periods to be filled for cases when a participant drops out of an experiment.
The complete function from tidyr is designed for turning implicit missing values into explicit missing values. We will have to do some filtering to not include past completion. The easiest way seems to be to do a join on a table with starting values:
library(dplyr)
library(tidyr)
df <- df %>%
filter(Date != '') %>%
droplevels() %>%
group_by(Subject)
df2 <- summarise(df, start = first(Date))
df %>%
complete(Subject, Date) %>%
left_join(df2) %>%
mutate(Date2 = as.Date(paste0('01-', Date), format = '%d-%b-%y'),
start = as.Date(paste0('01-', start), format = '%d-%b-%y')) %>%
filter(Date2 >= start) %>%
arrange(Subject, Date2) %>%
select(-start, -Date2)
Result:
Source: local data frame [20 x 2]
Groups: Subject [3]
Subject Date
<int> <fctr>
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
17 3 May-16
18 3 Jun-16
19 3 Jul-16
20 3 Aug-16
I use date conversion to do a reliable comparison with the starting date, but you might be able to use the row_numbers somehow. A problem is that complete will rearrange the data.
p.s. Note that your example dput has an empty factor level (""), so I filter that out first.