Here is my dataset:
structure(list(Date = structure(c(14609, 14609, 14609, 14609, 14699, 14699, 14699, 14699, 14790, 14790, 14790, 14790), class = "Date"),
ID = structure(c(5L, 4L, 6L, 10L, 9L, 3L, 10L, 8L, 7L, 1L,
10L, 2L), .Label = c("B00NYQ2", "B03J9L7", "B05DZD1", "B06HC42",
"B09V3X7", "B09YCC8", "X6114659", "X6478816", "X6556701",
"X6812555"), class = "factor"), Name = structure(c(10L, 4L,
9L, 8L, 7L, 3L, 8L, 6L, 2L, 5L, 8L, 1L), .Label = c("AIRA",
"BOUS", "CSCS", "EVF", "GTB", "JER", "MGB", "MPR", "NVB",
"TTNP"), class = "factor"), Score = c(55.075, 54.5, 53.325,
52.175, 70.275, 69.825, 60.15, 60.025, 56.175, 52.65, 52.175,
52.125), Score.rank = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L)), .Names = c("Date", "ID", "Name", "Score", "Score.rank"), row.names = c(1L, 2L, 3L, 4L, 71L, 72L, 73L, 74L, 156L, 157L, 158L, 159L), class = "data.frame")
I'm trying to find which IDs come in and out when we go into a new period.
What i mean by that is..i want to compare if the ID was present in the previous period, denoted by "Date".
If it existed in the previous period (date), It should not return anything.
If it did not exist in the previous period, it should return "IN".
I also want to show that if does not exist in the next period, it should return an "OUT".
ie the this period's OUTs should be equal to next periods INs
my expected dataframe is supposed to look like this
Date ID Name Score Score.rank THIS PERIOD NEXT PERIOD
31/12/2009 B09V3X7 TTNP 55.075 1 OUT
31/12/2009 B06HC42 EVF 54.5 2 OUT
31/12/2009 B09YCC8 NVB 53.325 3 OUT
31/12/2009 X6812555 MPR 52.175 4
31/3/2010 X6556701 MGB 70.275 1 IN
31/3/2010 B05DZD1 CSCS 69.825 2 IN OUT
31/3/2010 X6812555 MPR 60.15 3
31/3/2010 X6478816 JER 60.025 4 IN OUT
30/6/2010 X6114659 BOUS 56.175 1 IN
30/6/2010 B00NYQ2 GTB 52.65 2 IN
30/6/2010 X6812555 MPR 52.175 3
30/6/2010 B03J9L7 AIRA 52.125 4 IN
Can somebody point me in the right direction as to how to do this?
Thanks in advance
Your description and example doesn't match, unfortunately.
Considering your description, it seems you want to tag entry and exit conditions for the IDs.
Which can be achieved as:
dft %>%
group_by(ID) %>%
dplyr::mutate( This_period = if_else(Date == min(Date), "IN", NULL) ) %>%
dplyr::mutate( Next_period = if_else(Date == max(Date), "OUT", NULL))
and returns:
#Source: local data frame [12 x 7]
#Groups: ID [10]
#
# Date ID Name Score Score.rank This_period Next_period
# <date> <fctr> <fctr> <dbl> <int> <chr> <chr>
#1 2009-12-31 B09V3X7 TTNP 55.075 1 IN OUT
#2 2009-12-31 B06HC42 EVF 54.500 2 IN OUT
#3 2009-12-31 B09YCC8 NVB 53.325 3 IN OUT
#4 2009-12-31 X6812555 MPR 52.175 4 IN <NA>
#5 2010-03-31 X6556701 MGB 70.275 1 IN OUT
#6 2010-03-31 B05DZD1 CSCS 69.825 2 IN OUT
#7 2010-03-31 X6812555 MPR 60.150 3 <NA> <NA>
#8 2010-03-31 X6478816 JER 60.025 4 IN OUT
#9 2010-06-30 X6114659 BOUS 56.175 1 IN OUT
#10 2010-06-30 B00NYQ2 GTB 52.650 2 IN OUT
#11 2010-06-30 X6812555 MPR 52.175 3 <NA> OUT
#12 2010-06-30 B03J9L7 AIRA 52.125 4 IN OUT
However, your example suggests you want to exclude the min(Date) from this_period check and the max(Date) from the Next_period check. Is it so? if yes, is score.rank somehow related to Date?
please clarify.
Related
I'd like to finally summarize my data after the data cleaning.
Here is my data structure:
structure(list(ID = structure(c(1L, 3L, 4L, 2L, 2L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("01002",
"01004", "01005", "01006", "01009", "01011"), class = "factor"),
date = structure(c(17645, 17645, 17645, 17646, 17646, 17646,
17646, 17646, 17646, 17646, 17648, 17646, 17648, 17646, 17648,
17646, 17646, 17646, 17649, 17646), class = "Date"), category = structure(c(1L,
1L, 1L, 2L, 4L, 7L, 3L, 3L, 1L, 6L, 6L, 6L, 7L, 7L, 7L, 6L,
2L, 5L, 3L, 3L), .Label = c("A", "B", "C", "D", "F", "G",
"Q"), class = "factor"), level = c(3000, 3000, 1000, 1000,
1000, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 8000, 9999,
9999, 9999, 300, 300, 300, 9999)), class = "data.frame", row.names = c(NA,
-20L))
Here is the code I have so far:
dataDF %>%
dplyr::group_by(category) %>%
dplyr::summarize(n = n()) %>%
dplyr::mutate(percentage = (prop.table(n))*100) %>%
arrange(desc(n))
with the following result:
category n percentage
<fct> <int> <dbl>
1 A 4 20
2 C 4 20
3 G 4 20
4 Q 4 20
5 B 2 10
6 D 1 5
7 F 1 5
Now I'd like to add a new column with an aggregation of the date.
I need to add for each category the mean counted dates per ID.
Here is how the data should look like (random numbers, not calculated).
category n percentage mean_reported_days_per_ID
<fct> <int> <dbl> <int>
1 A 4 20 2
2 C 4 20 3.4
3 G 4 20 4
4 Q 4 20 1
5 B 2 10 3.5
6 D 1 5 2
7 F 1 5 1.1
I'm not sure how that could be achieved. I tried to add another mutate() and calculate the mean days per ID and add it (with another group by) to the data table.
thx for your help!
You want average unique dates per ID for each category?
You just need to group and summarise twice:
require(dplyr)
dataDF %>%
group_by(ID, category) %>%
summarise(distinctDates = n_distinct(date)) %>%
group_by(category) %>%
summarise(mean(distinctDates))
# category `mean(distinctDates)`
# <fct> <dbl>
# 1 A 1.33
# 2 B 1
# 3 C 1
# 4 D 1
# 5 F 1
# 6 G 1
# 7 Q 1
If you want to join those values to your existing DF, just do
left_join(your.existing.df, this.new.df, by = "category")
I want split an irregular time series into separate events and assign each event a unique numerical ID for each site.
Here is an example data frame:
structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AllenBrook", "Eastberk"), class =
"factor"),
timestamp = structure(c(10L, 13L, 8L, 4L, 5L, 6L, 7L, 9L,
11L, 12L, 1L, 2L, 3L), .Label = c("10/1/12 11:29", "10/1/12 14:29",
"10/1/12 17:29", "10/20/12 16:30", "10/20/12 19:30", "10/21/12 1:30",
"10/21/12 4:30", "9/5/12 12:30", "9/5/12 4:14", "9/5/12 6:30",
"9/5/12 7:14", "9/5/12 7:44", "9/5/12 9:30"), class = "factor")), class
= "data.frame", row.names = c(NA,
-13L))
Each event is not the same length or number of timestamps, so I want to split them into separate events if more than 12 hours elapsed between a timestamp and the next timestamp at that site. Each event at the site should receive a unique numerical ID. Here's the outcome I would like:
site timestamp eventid
1 AllenBrook 9/5/12 6:30 1
2 AllenBrook 9/5/12 9:30 1
3 AllenBrook 9/5/12 12:30 1
4 AllenBrook 10/20/12 16:30 2
5 AllenBrook 10/20/12 19:30 2
6 AllenBrook 10/21/12 1:30 2
7 AllenBrook 10/21/12 4:30 2
8 Eastberk 9/5/12 4:14 1
9 Eastberk 9/5/12 7:14 1
10 Eastberk 9/5/12 7:44 1
11 Eastberk 10/1/12 11:29 2
12 Eastberk 10/1/12 14:29 2
13 Eastberk 10/1/12 17:29 2
Any coding solution will do, but bonus points for a tidyverse or data.table solution. Thanks for any help you can provide!
Using data.table, you can perhaps do the following:
library(data.table)
setDT(tmp)[, timestamp := as.POSIXct(timestamp, format="%m/%d/%y %H:%M")][,
eventid := 1L+cumsum(c(0L, diff(timestamp)>720)), by=.(site)]
diff(timestamp) calculates the time difference between adjacent rows. Then we check if the diff is greater than 12h (or 720mins). A common trick in R is to use cumsum to identify when an event happens in a series and group subsequent elements together with this event until the next event happens again. Since cumsum returns 1 less element, we use 0L to pad the beginning. 1+ merely starts the indexing from 1 instead of 0.
output:
site timestamp eventid
1: AllenBrook 2012-09-05 06:30:00 1
2: AllenBrook 2012-09-05 09:30:00 1
3: AllenBrook 2012-09-05 12:30:00 1
4: AllenBrook 2012-10-20 16:30:00 2
5: AllenBrook 2012-10-20 19:30:00 2
6: AllenBrook 2012-10-21 01:30:00 2
7: AllenBrook 2012-10-21 04:30:00 2
8: Eastberk 2012-09-05 04:14:00 1
9: Eastberk 2012-09-05 07:14:00 1
10: Eastberk 2012-09-05 07:44:00 1
11: Eastberk 2012-10-01 11:29:00 2
12: Eastberk 2012-10-01 14:29:00 2
13: Eastberk 2012-10-01 17:29:00 2
data:
tmp <- structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("AllenBrook", "Eastberk"), class =
"factor"),
timestamp = structure(c(10L, 13L, 8L, 4L, 5L, 6L, 7L, 9L,
11L, 12L, 1L, 2L, 3L), .Label = c("10/1/12 11:29", "10/1/12 14:29",
"10/1/12 17:29", "10/20/12 16:30", "10/20/12 19:30", "10/21/12 1:30",
"10/21/12 4:30", "9/5/12 12:30", "9/5/12 4:14", "9/5/12 6:30",
"9/5/12 7:14", "9/5/12 7:44", "9/5/12 9:30"), class = "factor")), class
= "data.frame", row.names = c(NA,
-13L))
I have dataset with data of gamesessions(id,count of session,averege seconds of session and date of session for each id)
here sample of mydat:
mydat=read.csv("C:/Users/Admin/desktop/rty.csv", sep=";",dec=",")
structure(list(udid = c(74385162L, 79599601L, 79599601L, 91475825L,
91475825L, 91492531L, 92137561L, 96308016L, 96308016L, 96308016L,
96308016L, 96308016L, 96495076L, 97135620L, 97135620L, 97135620L,
97135620L, 97135620L, 97135620L, 97135620L, 97135620L, 97135620L,
97135620L, 97165942L), count = c(1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), avg_duration = c(39L, 1216L, 568L, 5L, 6L, 79L, 9L, 426L,
78L, 884L, 785L, 785L, 22L, 302L, 738L, 280L, 2782L, 5L, 2284L,
144L, 234L, 231L, 539L, 450L), date = structure(c(13L, 3L, 3L,
1L, 1L, 14L, 2L, 11L, 11L, 11L, 12L, 12L, 9L, 7L, 4L, 4L, 5L,
6L, 8L, 8L, 8L, 8L, 8L, 10L), .Label = c("11.10.16", "12.12.16",
"15.11.16", "15.12.16", "16.12.16", "17.12.16", "18.10.16", "18.12.16",
"21.10.16", "26.10.16", "28.11.16", "29.11.16", "31.10.16", "8.10.16"
), class = "factor")), .Names = c("udid", "count", "avg_duration",
"date"), class = "data.frame", row.names = c(NA, -24L))
I need calculating the time difference between the first date of the player's appearance and the last date when he was seen.
For example uid 97135620 the first time when he started play was 18.10.2016 and last time he was seen at 18.12.2016, it is mean that the difference between first and last day = 60,9 days,
meanwhile uid74385162 started at 31.10.2016 and after he didn't play(i.e he played one time), it is mean the difference between first data and last data = 0.
id79599601 has two count of session in 1 day(i.e for one day I played 2 times), so the the difference =1
In output i expect this format only with last date and the value of the difference between the last day and the first day.
udid count avg_duration date datediff
74385162 1 39 31.10.2016 0
79599601 1 568 15.11.2016 1
91475825 1 5 11.10.2016 1
91492531 1 79 08.10.2016 0
92137561 1 9 12.12.2016 0
96308016 1 785 29.11.2016 1
96495076 1 22 21.10.2016 0
97135620 1 539 18.12.2016 61
97165942 1 450 26.10.2016 0
How do that?
This function calculates the difference between first and last session, and only returns the date of the last session:
get_datediff <- function (x) {
dates <- as.Date(as.character(x$date), "%d.%m.%y")
x <- x[order(dates), ]
if (length(x$date)==1) {
x$datediff <- 0
} else {
x$datediff <- max(1, diff(range(dates)))
}
x[nrow(x), ]
}
This can then be applied to data for each user, making use of dplyr and magrittr packages:
group_by(mydat, udid) %>% do(get_datediff(.))
# A tibble: 9 x 5
# Groups: udid [9]
udid count avg_duration date datediff
<int> <int> <int> <fctr> <dbl>
1 74385162 1 39 31.10.16 0
2 79599601 1 568 15.11.16 1
3 91475825 3 6 11.10.16 1
4 91492531 1 79 8.10.16 0
5 92137561 1 9 12.12.16 0
6 96308016 1 785 29.11.16 1
7 96495076 1 22 21.10.16 0
8 97135620 1 539 18.12.16 61
9 97165942 1 450 26.10.16 0
The way you describe how your metrics are calculated are confusing, but following what you wrote as closely as possible, I ended up with the following:
dplyr solution:
timeData%>%
mutate(dateFormat = as.Date(date, format = "%d.%m.%y"))%>%
group_by(udid)%>%
arrange(udid,dateFormat)%>%
summarise(dateBetween = difftime(last(dateFormat), first(dateFormat), units = "days"), mean(avg_duration))%>%
left_join((timeData%>%
mutate(dateFormat = as.Date(date, format = "%d.%m.%y"))%>%
select(udid, count,dateFormat)%>%
group_by(udid)%>%
slice(which.min(dateFormat))))
Result:
# A tibble: 9 x 5
udid dateBetween `mean(avg_duration)` count dateFormat
<int> <time> <dbl> <int> <date>
1 74385162 0 days 39.0 1 2016-10-31
2 79599601 0 days 892.0 1 2016-11-15
3 91475825 0 days 5.5 1 2016-10-11
4 91492531 0 days 79.0 1 2016-10-08
5 92137561 0 days 9.0 1 2016-12-12
6 96308016 1 days 591.6 1 2016-11-29
7 96495076 0 days 22.0 1 2016-10-21
8 97135620 61 days 753.9 1 2016-12-18
9 97165942 0 days 450.0 1 2016-10-26
I have a dataframe in the following form:
person currentTest beforeValue afterValue
1 1 A 1.284297055 2.671763513
2 2 A -0.618359548 -2.354926905
3 3 A 0.039457430 -0.091709968
4 4 A -0.448608324 -0.362851832
5 5 A -0.961777124 -1.416284339
6 6 A 0.702471895 2.052181444
7 7 A -0.455222045 -2.125684279
8 8 A -1.231549132 -2.777425148
9 9 A -0.797234990 -0.558306183
10 10 A -0.709734963 -1.244159550
11 1 B -0.472799377 -0.869472343
12 2 B 0.059720737 1.444855389
13 3 B 0.924201532 2.731049485
14 4 B 0.658884183 1.017542475
15 5 B -1.989807256 -4.712671740
16 6 B 0.660241305 1.971232718
17 7 B 0.089636952 -0.564457911
18 8 B -0.828399941 0.507659171
19 9 B -0.838074237 -0.316996942
20 10 B -1.659197101 -3.317623686
...
What I'd like is to get a data frame of:
person A_Before A_After B_Before, B_After, ...
1 1.284297055 2.671763513 -0.472799377 -0.869472343
2 -0.618359548 -2.354926905 0.059720737 1.444855389
...
I've tried gather and spread but that's not quite what I need as there's the creation of new columns. Any suggestions?
The dput version for easy access is below:
resultsData <- dput(resultsData)
structure(list(person = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L), currentTest = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("A", "B", "C",
"D", "E", "F"), class = "factor"), beforeValue = c(1.28429705541541,
-0.618359548370402, 0.039457429902531, -0.448608324038257, -0.961777123997687,
0.702471895259405, -0.455222044740939, -1.23154913153736, -0.797234989892673,
-0.709734963076803, -0.47279937661921, 0.0597207367403981, 0.924201531911827,
0.658884182599422, -1.98980725637449, 0.660241304554785, 0.0896369516528346,
-0.828399941497236, -0.838074236572976, -1.65919710134782, 0.577469369909437,
1.92748171699512, -0.245593641496638, 0.126104785456265, -0.559338325961641,
1.29802115505785, 0.719406692531958, 0.969414499181256, -0.814697072724845,
0.86465983690719, -0.709539159817187, 1.02775240926492, -0.50490096148732,
0.40769259465753, -0.868531009656408, 0.949518511358715, 2.32458579520932,
-0.257578702370506, -0.789761851618986, 0.0979274657020477, -0.00803566278013502,
1.42984177159549, 1.45485678109231, -0.956556613290905, 0.443323691839299,
-0.261951072972966, -1.30990441429799, 0.0921741874883992, -1.02612779569131,
0.81550719514697, -0.403037731404182, -0.384422139459082, 0.417074857491798,
-1.37128032791855, -0.0796160137501127, 1.35302483988882, -0.752751140138746,
0.812453275384099, -1.32443072805549, -1.66986584340583), afterValue = c(2.67176351335094,
-2.35492690509713, -0.0917099675669388, -0.362851831626841, -1.4162843393352,
2.05218144382074, -2.12568427901904, -2.77742514848958, -0.558306182843248,
-1.24415954975022, -0.869472343362331, 1.44485538931333, 2.73104948477609,
1.01754247530805, -4.71267174035743, 1.9712327179732, -0.564457911016569,
0.507659170771878, -0.31699694238194, -3.31762368638082, 1.09068172988414,
4.37537723545199, -0.116850493406969, 1.9533832597394, -1.69003563933244,
2.62250581307257, -0.00837379068728961, 1.84192937988371, -0.675899868505659,
2.08506660046288, -0.583526785879512, 0.699298693972492, -1.26172199141024,
1.23589313451783, -1.56008919968504, 0.436686458587792, 0.11699090169902,
-1.07206510594109, 1.21204947218164, -0.812406581646911, 0.50373332256566,
-0.084945367568491, -0.236015748624917, -0.479606239480476, -0.596799139055039,
-0.562575023441403, -0.339935276865152, -0.213813544612318, -0.265296303857373,
-1.12545083569158, 0.0105156062602101, 0.635695183644557, 0.767433440961415,
0.16648012185356, 0.544633089427927, -0.904001384160196, -0.429299134808951,
0.764224744168297, -0.166062348771635, -0.101892580202475)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -60L), .Names = c("person",
"currentTest", "beforeValue", "afterValue"))
We can use dcast from reshape2
library(reshape2)
meltdf <- melt(resultsData, id.vars=1:2)
dcast(meltdf, person ~ currentTest + variable)
> dcast(meltdf, person ~ currentTest + variable)
person A_beforeValue A_afterValue B_beforeValue B_afterValue C_beforeValue C_afterValue D_beforeValue D_afterValue E_beforeValue
1 1 1.28429706 2.67176351 -0.47279938 -0.8694723 0.5774694 1.090681730 -0.70953916 -0.5835268 -0.008035663
2 2 -0.61835955 -2.35492691 0.05972074 1.4448554 1.9274817 4.375377235 1.02775241 0.6992987 1.429841772
3 3 0.03945743 -0.09170997 0.92420153 2.7310495 -0.2455936 -0.116850493 -0.50490096 -1.2617220 1.454856781
4 4 -0.44860832 -0.36285183 0.65888418 1.0175425 0.1261048 1.953383260 0.40769259 1.2358931 -0.956556613
5 5 -0.96177712 -1.41628434 -1.98980726 -4.7126717 -0.5593383 -1.690035639 -0.86853101 -1.5600892 0.443323692
6 6 0.70247190 2.05218144 0.66024130 1.9712327 1.2980212 2.622505813 0.94951851 0.4366865 -0.261951073
7 7 -0.45522204 -2.12568428 0.08963695 -0.5644579 0.7194067 -0.008373791 2.32458580 0.1169909 -1.309904414
8 8 -1.23154913 -2.77742515 -0.82839994 0.5076592 0.9694145 1.841929380 -0.25757870 -1.0720651 0.092174187
9 9 -0.79723499 -0.55830618 -0.83807424 -0.3169969 -0.8146971 -0.675899869 -0.78976185 1.2120495 -1.026127796
10 10 -0.70973496 -1.24415955 -1.65919710 -3.3176237 0.8646598 2.085066600 0.09792747 -0.8124066 0.815507195
E_afterValue F_beforeValue F_afterValue
1 0.50373332 -0.40303773 0.01051561
2 -0.08494537 -0.38442214 0.63569518
3 -0.23601575 0.41707486 0.76743344
4 -0.47960624 -1.37128033 0.16648012
5 -0.59679914 -0.07961601 0.54463309
6 -0.56257502 1.35302484 -0.90400138
7 -0.33993528 -0.75275114 -0.42929913
8 -0.21381354 0.81245328 0.76422474
9 -0.26529630 -1.32443073 -0.16606235
10 -1.12545084 -1.66986584 -0.10189258
You can use a combined gather + spread approach; Gather the *Values columns and combine with currentTest to form the new header, then spread to wide format:
resultsData %>%
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value)
# A tibble: 10 x 13
# person A_afterValue A_beforeValue B_afterValue B_beforeValue C_afterValue C_beforeValue
# * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.67176351 1.28429706 -0.8694723 -0.47279938 1.090681730 0.5774694
# 2 2 -2.35492691 -0.61835955 1.4448554 0.05972074 4.375377235 1.9274817
# 3 3 -0.09170997 0.03945743 2.7310495 0.92420153 -0.116850493 -0.2455936
# 4 4 -0.36285183 -0.44860832 1.0175425 0.65888418 1.953383260 0.1261048
# 5 5 -1.41628434 -0.96177712 -4.7126717 -1.98980726 -1.690035639 -0.5593383
# 6 6 2.05218144 0.70247190 1.9712327 0.66024130 2.622505813 1.2980212
# 7 7 -2.12568428 -0.45522204 -0.5644579 0.08963695 -0.008373791 0.7194067
# 8 8 -2.77742515 -1.23154913 0.5076592 -0.82839994 1.841929380 0.9694145
# 9 9 -0.55830618 -0.79723499 -0.3169969 -0.83807424 -0.675899869 -0.8146971
#10 10 -1.24415955 -0.70973496 -3.3176237 -1.65919710 2.085066600 0.8646598
# ... with 6 more variables: D_afterValue <dbl>, D_beforeValue <dbl>, E_afterValue <dbl>,
# E_beforeValue <dbl>, F_afterValue <dbl>, F_beforeValue <dbl>
If you need to rename the columns:
resultsData %>%
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value) %>%
rename_at(vars(matches("Value$")), funs(gsub("Value$", "", .)))
We could do this in a single line using recast
reshape2::recast(resultsData, person ~currentTest + variable, id.var = 1:2)
#person A_beforeValue A_afterValue B_beforeValue B_afterValue C_beforeValue C_afterValue D_beforeValue D_afterValue
#1 1 1.28429706 2.67176351 -0.47279938 -0.8694723 0.5774694 1.090681730 -0.70953916 -0.5835268
#2 2 -0.61835955 -2.35492691 0.05972074 1.4448554 1.9274817 4.375377235 1.02775241 0.6992987
#3 3 0.03945743 -0.09170997 0.92420153 2.7310495 -0.2455936 -0.116850493 -0.50490096 -1.2617220
#4 4 -0.44860832 -0.36285183 0.65888418 1.0175425 0.1261048 1.953383260 0.40769259 1.2358931
#5 5 -0.96177712 -1.41628434 -1.98980726 -4.7126717 -0.5593383 -1.690035639 -0.86853101 -1.5600892
#6 6 0.70247190 2.05218144 0.66024130 1.9712327 1.2980212 2.622505813 0.94951851 0.4366865
#7 7 -0.45522204 -2.12568428 0.08963695 -0.5644579 0.7194067 -0.008373791 2.32458580 0.1169909
#8 8 -1.23154913 -2.77742515 -0.82839994 0.5076592 0.9694145 1.841929380 -0.25757870 -1.0720651
#9 9 -0.79723499 -0.55830618 -0.83807424 -0.3169969 -0.8146971 -0.675899869 -0.78976185 1.2120495
#10 10 -0.70973496 -1.24415955 -1.65919710 -3.3176237 0.8646598 2.085066600 0.09792747 -0.8124066
# E_beforeValue E_afterValue F_beforeValue F_afterValue
#1 -0.008035663 0.50373332 -0.40303773 0.01051561
#2 1.429841772 -0.08494537 -0.38442214 0.63569518
#3 1.454856781 -0.23601575 0.41707486 0.76743344
#4 -0.956556613 -0.47960624 -1.37128033 0.16648012
#5 0.443323692 -0.59679914 -0.07961601 0.54463309
#6 -0.261951073 -0.56257502 1.35302484 -0.90400138
#7 -1.309904414 -0.33993528 -0.75275114 -0.42929913
#8 0.092174187 -0.21381354 0.81245328 0.76422474
#9 -1.026127796 -0.26529630 -1.32443073 -0.16606235
#10 0.815507195 -1.12545084 -1.66986584 -0.10189258
Is there a way to fill in for implicit missingness for future dates based on id?
For example, imagine a experiment that starts in Jan-2016. I have 3 participants that join in at different periods. Subject 1 joins me in Jan and continues to stay until Aug. Subj 2 joins me in March, and stays in the experiment until August. Subject 3 also joins me in March, but drops out sometime in in May, so no observations are recorded for periods May-Aug.
The question is, how do I fill in the dates when subject 3 dropped out of the experiment? Here is some mock data for how things look like:
Subject Date
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L), Date = structure(c(5L, 4L, 8L, 2L,
9L, 7L, 6L, 3L, 8L, 2L, 9L, 7L, 6L, 3L, 8L, 2L), .Label = c("",
"Apr-16", "Aug-16", "Feb-16", "Jan-16", "Jul-16", "Jun-16", "Mar-16",
"May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L), .Names = c("Subject", "Date"))
And here is how the data should look like:
Subject Date
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
17 3 May-16
18 3 Jun-16
19 3 Jul-16
20 3 Aug-16
structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Date = structure(c(4L,
3L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L, 6L, 5L, 2L, 7L, 1L, 8L,
6L, 5L, 2L), .Label = c("Apr-16", "Aug-16", "Feb-16", "Jan-16",
"Jul-16", "Jun-16", "Mar-16", "May-16"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L), .Names = c("Subject", "Date"))
I tried using expand from tidyr and TimeFill from DataCombine package, but the issue with those approaches is that I would get dates for periods before a participant joined an experiment. In this particular instance, I only want the periods to be filled for cases when a participant drops out of an experiment.
The complete function from tidyr is designed for turning implicit missing values into explicit missing values. We will have to do some filtering to not include past completion. The easiest way seems to be to do a join on a table with starting values:
library(dplyr)
library(tidyr)
df <- df %>%
filter(Date != '') %>%
droplevels() %>%
group_by(Subject)
df2 <- summarise(df, start = first(Date))
df %>%
complete(Subject, Date) %>%
left_join(df2) %>%
mutate(Date2 = as.Date(paste0('01-', Date), format = '%d-%b-%y'),
start = as.Date(paste0('01-', start), format = '%d-%b-%y')) %>%
filter(Date2 >= start) %>%
arrange(Subject, Date2) %>%
select(-start, -Date2)
Result:
Source: local data frame [20 x 2]
Groups: Subject [3]
Subject Date
<int> <fctr>
1 1 Jan-16
2 1 Feb-16
3 1 Mar-16
4 1 Apr-16
5 1 May-16
6 1 Jun-16
7 1 Jul-16
8 1 Aug-16
9 2 Mar-16
10 2 Apr-16
11 2 May-16
12 2 Jun-16
13 2 Jul-16
14 2 Aug-16
15 3 Mar-16
16 3 Apr-16
17 3 May-16
18 3 Jun-16
19 3 Jul-16
20 3 Aug-16
I use date conversion to do a reliable comparison with the starting date, but you might be able to use the row_numbers somehow. A problem is that complete will rearrange the data.
p.s. Note that your example dput has an empty factor level (""), so I filter that out first.