Choosing different amount of elements from each group in R - r

I am working on the Kaggle Instacart competition, but I am quite new to R and have run into something I can not figure out.
I have a dataset with 4 columns. The first column is an order ID (id1). The second column is a product ID (id2). The third column is the probability that I want select the product id2 from the order id1 which we can consider just as a ranking, so a higher probability is always selected over a smaller probability. Finally, the fourth column is the amount of products I want to select from the given order (a feature of the order). So for example, I have here the first 12 rows of the dataframe df:
id1 id2 prob num
1 17 13107 0.4756982 3
2 17 21463 0.3724126 3
3 17 38777 0.3534422 3
4 17 21709 0.3364623 3
5 17 47766 0.3364623 3
6 17 39275 0.3165896 3
7 34 16083 0.4093785 4
8 34 39475 0.3892882 4
9 34 47766 0.3892882 4
10 34 2596 0.3837562 4
11 34 21137 0.3762758 4
12 34 47792 0.3737032 4
We can see that from the id1 = 17 I want to choose 3 elements, and for id1 = 34 I want to choose 4 elements. The result should then be
ID1 ID2
17 13107, 21463, 38777
34 16083, 39475, 47766, 2596
or something similar to this.
At the moment I have tried using
df %>% group_by(id1) %>% top_n(n = num)
but I get the error
Selecting by num
Error in is_scalar_integerish(n) : object 'num' not found
Anyone know how I would go about doing this?
Thanks

You can pipe the grouped data directly into a summarise statement:
df %>% group_by(id1) %>% summarise(id2 = toString(id2[seq_len(first(num))]))
## A tibble: 2 x 2
# id1 id2
# <int> <chr>
#1 17 13107, 21463, 38777
#2 34 16083, 39475, 47766, 2596
In this statement, the id2[seq_len(first(num))] is used to extract the first num per group, create a sequence from 1 to the num and that sequence is used to subset the first X id2 values.
The toString creates a string per id1 group.
Here's another base R option using aggregate:
aggregate(id2 ~ id1, FUN=toString, subset(df, ave(id1, id1, FUN=seq_along) <= num))
# id1 id2
#1 17 13107, 21463, 38777
#2 34 16083, 39475, 47766, 2596
Please note that I assumed the data was already orderd (as in the example) by decreasing probability.

In base R, you can use Map on the list of data frames split by ID with split to apply head to select the respective number of rows for each ID. The number of selected rows is supplied by feeding tapply the column of interest and selecting the first value with head. A data.frame with the corresponding rows is returned using do.call with rbind.
do.call(rbind, Map(head, split(dat, dat$id1), tapply(dat$num, dat$id1, head, 1)))
id1 id2 prob num
17.1 17 13107 0.4756982 3
17.2 17 21463 0.3724126 3
17.3 17 38777 0.3534422 3
34.7 34 16083 0.4093785 4
34.8 34 39475 0.3892882 4
34.9 34 47766 0.3892882 4
34.10 34 2596 0.3837562 4
It is a bit simpler to return a named list of the first dat$num elements where then names in the list correspond to the id1.
Map(head, split(dat$id2, dat$id1), tapply(dat$num, dat$id1, head, 1))
$`17`
[1] 13107 21463 38777
$`34`
[1] 16083 39475 47766 2596
data
dat <-
structure(list(id1 = c(17L, 17L, 17L, 17L, 17L, 17L, 34L, 34L,
34L, 34L, 34L, 34L), id2 = c(13107L, 21463L, 38777L, 21709L,
47766L, 39275L, 16083L, 39475L, 47766L, 2596L, 21137L, 47792L
), prob = c(0.4756982, 0.3724126, 0.3534422, 0.3364623, 0.3364623,
0.3165896, 0.4093785, 0.3892882, 0.3892882, 0.3837562, 0.3762758,
0.3737032), num = c(3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L)), .Names = c("id1", "id2", "prob", "num"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Having one row per ID may seem nice, but a list column often ends up being a pain to work with; it's not "tidy." Here's a simple dplyr pipeline that sticks to the verbs that make sense: separate by group, filter rows, put back together.
df %>%
group_by(id1) %>%
filter(seq_along(num) <= num) %>%
ungroup() %>%
select(id1, id2)
# A tibble: 7 x 2
id1 id2
<int> <int>
1 17 13107
2 17 21463
3 17 38777
4 34 16083
5 34 39475
6 34 47766
7 34 2596

You can try this by using #lmo's data
dat%>%group_by(id1)%>%arrange(-prob)%>%dplyr::summarise(ID2=paste(id2[1:unique(num)],collapse=","))

With data.table:
library(data.table)
setDT(df)[order(-prob), .(id2 = toString(head(id2, first(num)))), by = id1]
id1 id2
1: 17 13107, 21463, 38777
2: 34 16083, 39475, 47766, 2596
Here, df is coerced to data.table, ordered by decreasing probability. For each group in id1, the num topmost values are picked and aggregated into one string.
This returns id2 as character. If it is required to continue processing, it might be useful to keep the id2 values separate in a list:
setDT(df)[order(-prob), .(id2 = list(head(id2, first(num)))), by = id1]
Data
df <- fread(
"rn id1 id2 prob num
1 17 13107 0.4756982 3
2 17 21463 0.3724126 3
3 17 38777 0.3534422 3
4 17 21709 0.3364623 3
5 17 47766 0.3364623 3
6 17 39275 0.3165896 3
7 34 16083 0.4093785 4
8 34 39475 0.3892882 4
9 34 47766 0.3892882 4
10 34 2596 0.3837562 4
11 34 21137 0.3762758 4
12 34 47792 0.3737032 4")

Related

(R) How to copy paste values from one column based on another column and ID in R

For simplicity reasons, let's assume I have two columns.
First: ID (string of codes such as AA23, BA53, NA, etc.)
Second: Age (18, 32, 55, 23, etc.)
And IDs sometimes repeat (i.e., one person - AA23 filled the survey in two days, but only on the first day was asked how old he is, but during the second and third day not).
I want to copy paste values from the Age column based on the ID, so that I have a 'long format' of the dataframe.
dput(data):
structure(list(Code = c("MW68", "AW80", "EW40", "BW60", "Wn36",
"ZK45", "SI55", "MW68", "EW40", "DC06", NA, "IW28"), Age = c("52",
"26", "34", "26", "20", "35", NA, NA, NA, NA, NA, NA)), row.names = c(5L,
6L, 7L, 8L, 9L, 10L, 400L, 401L, 402L, 403L, 404L, 405L), class = "data.frame")
Input:
ID Age
AA23 18
BA53 32
AC13 55
AA23 NA
BA53 NA
AC13 NA
NA 23
AA23 NA
(the trick is that sometimes ID is NA)
And the desired output:
ID Age
AA23 18
BA53 32
AC13 55
AA23 18
BA53 32
AC13 55
NA 23
AA23 18
Thank you in advance!
You can also use the function coalesce which finds the first NA value and replace it with the value you define, here we would like it to be the first value of every Age variable (grouping variable):
library(dplyr)
df %>%
group_by(Code) %>%
mutate(across(Age, ~ coalesce(.x, first(.x))))
# A tibble: 12 x 2
# Groups: Code [10]
Code Age
<chr> <chr>
1 MW68 52
2 AW80 26
3 EW40 34
4 BW60 26
5 Wn36 20
6 ZK45 35
7 SI55 NA
8 MW68 52
9 EW40 34
10 DC06 NA
11 NA NA
12 IW28 NA
I'm not quite sure if I understood correctly what you want to do, but this code here should look where Age is NA and fill in the mean of the Age from the other rows with the same entry in Code. Obviously, this will fail if there are values for Code where no Age value exists anywhere in the table. If there are various values for Age in different rows with the same Code, it will fill in the mean in this example, since you didn't specify what to do in such a case.
for(i in 1:nrow(data)){
if(!is.na(data$Code[i])){
if(is.na(data$Age[i])){
data$Age[i] <- mean(data$Age[data$Code == data$Code[i]], na.rm = TRUE)
}
}
}
This skips rows with NA in the Code column.
Here's a solution based on zoo's function na.locf("in the case of NA, last observation carried forward"): first you group by Codethen you mutate column Ageusingifelse and carrying the last non-NA` observation forward:
library(zoo)
data %>%
group_by(Code) %>%
mutate(Age = ifelse(is.na(Age), na.locf(Age), Age))
# A tibble: 12 x 2
# Groups: Code [10]
Code Age
<chr> <chr>
1 MW68 52
2 AW80 26
3 EW40 34
4 BW60 26
5 Wn36 20
6 ZK45 35
7 SI55 NA
8 MW68 52 # <- value `carried forward`
9 EW40 34 # <- value `carried forward`
10 DC06 NA
11 NA NA
12 IW28 NA

Select last observation of a date variable - SPSS or R

I'm relatively new to R, so I realise this type of question is asked often but I've read a lot of stack overflow posts and still can't quite get something to work on my data.
I have data on spss, in two datasets imported into R. Both of my datasets include an id (IDC), which I have been using to merge them. Before merging, I need to filter one of the datasets to select specifically the last observation of a date variable.
My dataset, d1, has a longitudinal measure in long format. There are multiple rows per IDC, representing different places of residence (neighborhood). Each row has its own "start_date", which is a variable that is NOT necessarily unique.
As it looks on spss :
IDC
neighborhood
start_date
1
22
08.07.2001
1
44
04.02.2005
1
13
21.06.2010
2
44
24.12.2014
2
3
06.03.2002
3
22
04.01.2006
4
13
08.07.2001
4
2
15.06.2011
In R, the start dates do not look the same, instead they are one long number like "13529462400". I do not understand this format but I assume it still would retain the date order...
Here are all my attempts so far to select the last date. All attempts ran, there was no error. The output just didn't give me what I want. To my perception, none of these made any change in the number of repetitions of IDC, so none of them actually selected *only the last date.
##### attempt 1 --- not working
d1$start_date_filt <- d1$start_date
d1[order(d1$IDC,d1$start_date_filt),] # Sort by ID and week
d1[!duplicated(d1$IDC, fromLast=T),] # Keep last observation per ID)
###### attempt 2--- not working
myid.uni <- unique(d1$IDC)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(d1, IDC==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
##### atempt 3 -- doesn't work
do.call("rbind",
by(d1,INDICES = d1$IDC,
FUN=function(DF)
DF[which.max(DF$start_date),]))
#### attempt 4 -- doesnt work
library(plyr)
ddply(d1,.(IDC), function(X)
X[which.max(X$start_date),])
### merger code -- in case something has to change with that after only the last start_date is selected
merge(d1,d2, IDC)
My goal dataset d1 would look like this:
IDC
neighborhood
start_date
1
13
21.06.2010
2
44
24.12.2014
3
22
04.01.2006
4
2
15.06.2011
I'm grateful for any help, many thanks <3
There are some problems with most approaches dealing with this data: because your dates are arbitrary strings in a format that does not sort correctly, it just-so-happens to work here because the maximum day-of-month also happens in the maximum year.
It would generally be better to work with that column as a Date object in R, so that comparisons can be better.
dat$start_date <- as.Date(dat$start_date, format = "%d.%m.%Y")
dat
# IDC neighborhood start_date
# 1 1 22 2001-07-08
# 2 1 44 2005-02-04
# 3 1 13 2010-06-21
# 4 2 44 2014-12-24
# 5 2 3 2002-03-06
# 6 3 22 2006-01-04
# 7 4 13 2001-07-08
# 8 4 2 2011-06-15
From here, things are a bit simpler:
Base R
do.call(rbind, by(dat, dat[,c("IDC"),drop=FALSE], function(z) z[which.max(z$start_date),]))
# IDC neighborhood start_date
# 1 1 13 2010-06-21
# 2 2 44 2014-12-24
# 3 3 22 2006-01-04
# 4 4 2 2011-06-15
dplyr
dat %>%
group_by(IDC) %>%
slice(which.max(start_date)) %>%
ungroup()
# # A tibble: 4 x 3
# IDC neighborhood start_date
# <int> <int> <date>
# 1 1 13 2010-06-21
# 2 2 44 2014-12-24
# 3 3 22 2006-01-04
# 4 4 2 2011-06-15
Data
dat <- structure(list(IDC = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L), neighborhood = c(22L, 44L, 13L, 44L, 3L, 22L, 13L, 2L), start_date = c("08.07.2001", "04.02.2005", "21.06.2010", "24.12.2014", "06.03.2002", "04.01.2006", "08.07.2001", "15.06.2011")), class = "data.frame", row.names = c(NA, -8L))

Randomly sampling groups, followed by sampling within these sampled groups

My dataset contains several groups and each group can have a different number of unique observations. I carry out some calculations by group (simplified in the code below), resulting in a summary value for each group. Next, for the purpose of a bootstrap, I want to:
Randomly sample the groups with replacement (number of sampled groups = equal to number of different groups in the original dataset)
Within these sampled groups, randomly sample observations with replacement (number of sampled observations per group = equal to number of unique observations in that group in the original dataset)
A simplified version of my data set up (data1):
data1:
id group y
1001 1 10
1002 1 15
1003 1 3
3002 2 24
3003 2 15
3005 2 37
3006 2 32
3007 2 11
4001 3 12
4002 3 15
5006 4 7
5007 4 9
5009 4 22
5010 4 19
E.g. based on the dataset example above: there are 4 groups in the original dataset, so I want to sample 4 groups with replacement (e.g. groups sampled = groups 4,3,3,1), and then sample observations/rows from those 4 groups (4 ids from group 4 (e.g. 5007, 5007, 5006, 5009); 2 ids from group 3 (twice, as group 3 was sampled twice), and 3 ids from group 1, all with replacement), and return the sampled rows together in a dataframe (4+2+2+3 = 11 rows).
For the above, I some have code working for these steps separately, but I cannot seem to combine them:
# Calculate group value
y.group <- tapply(data1$y,data1$group,mean)
# Step 1. Sample groups, with replacement:
sampled.group <- sample(1:length(unique(data1$group)),replace=T)
# Step 2. Sample within groups, with replacement
data2 <- data.frame(data1 %>%
group_by(group) %>% # for each group
sample_frac(1, replace = TRUE) %>%
ungroup)
Obviously, the code above in full does not do what I want, as in step 2 the sampled groups from step 1 are ignored since it just uses the original group var (I am aware of this). I have tried to solve this using step 1 and trying to generate a new dataframe containing only the sampled groups' observations (with duplicates if a group was sampled more than once, which is likely to happen), and then apply step 2 to that new dataframe, but I cannot get this to work.
I think I am just on the wrong path or overthinking things. Hopefully you can give me some advice on how to proceed.
Edit: While awaiting any potential solutions, I continued on the question myself and ended up with:
total.result <- c()
for (j in 1:length(unique(data1$group))){
sampled.group <- sample(1:length(unique(data1$group)),size=1,replace=T)
group.result <- sample_n(data1[data1$group==sampled.group,],
size=length(unique(data1$id[data1$group==sampled.group])),replace=T)
total.result <- rbind(total.result,group.result)
}
(So basically using a loop to sample the groups one at a time, creating datasets for each, and then sampling individual rows from those, and finally combining the results with rbind)
However, I think Allan Cameron's solution (see below) is more straigthforward, so I have accepted that one as the answer to my question.
I think this is what you're looking for. Let's start with your data in a reproducible format:
data1 <- structure(list(id = structure(1:14, .Label = c("1001", "1002",
"1003", "3002", "3003", "3005", "3006", "3007", "4001", "4002",
"5006", "5007", "5009", "5010"), class = "factor"), group = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), y = structure(c(1L, 4L, 8L,
7L, 4L, 10L, 9L, 2L, 3L, 4L, 11L, 12L, 6L, 5L), .Label = c("10",
"11", "12", "15", "19", "22", "24", "3", "32", "37", "7", "9"
), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))
And just to make sure:
data1
#> id group y
#> 1 1001 1 10
#> 2 1002 1 15
#> 3 1003 1 3
#> 4 3002 2 24
#> 5 3003 2 15
#> 6 3005 2 37
#> 7 3006 2 32
#> 8 3007 2 11
#> 9 4001 3 12
#> 10 4002 3 15
#> 11 5006 4 7
#> 12 5007 4 9
#> 13 5009 4 22
#> 14 5010 4 19
We start by splitting the data frame by group into smaller data frames, using the split function. This gives us a list with four data frames, each one containing all the members of its respective group. (The set.seed is there purely to make this example reproducible).
set.seed(69)
split_dfs <- split(data1, data1$group)
Now we can sample this list, giving us a new list of four data frames drawn with replacement from split_dfs. Each one will again contain all the members of its respective group, though of course some whole groups might be sampled more than once, and other whole groups not sampled at all.
sampled_group_dfs <- split_dfs[sample(length(split_dfs), replace = TRUE)]
Now we can sample within each group by sampling with replacement from the rows of each data frame in our new list. We do this for all our data frames in our list by using lapply
all_sampled <- lapply(sampled_group_dfs, function(x) x[sample(nrow(x), replace = TRUE), ])
All that remains is to stick all the resultant dataframes in this list back together to get our result:
result <- do.call(rbind, all_sampled)
As you can see from the final result, it just so happens that each of the four groups was sampled once (this is just by chance - alter set.seed to get different results). However, within the groups there have clearly been some duplicates drawn. In fact, since R mandates unique row names in a data frame, these are easy to pick out by the .1 that has been appended to the duplicate row names. If you don't like this, you can reset the row names with rownames(result) <- seq(nrow(result))
result
#> id group y
#> 4.14 5010 4 19
#> 4.14.1 5010 4 19
#> 4.11 5006 4 7
#> 4.13 5009 4 22
#> 1.3 1003 1 3
#> 1.3.1 1003 1 3
#> 1.2 1002 1 15
#> 3.9 4001 3 12
#> 3.9.1 4001 3 12
#> 2.5 3003 2 15
#> 2.5.1 3003 2 15
#> 2.6 3005 2 37
#> 2.7 3006 2 32
#> 2.5.2 3003 2 15
Created on 2020-02-15 by the reprex package (v0.3.0)

Group date variable when dates are close

I am trying write a function or use cut to assign a grouping variable to some date data when those dates are close (user definition of close). For example, I would like to create a common grouping variable for some samples that were collected on consecutive dates. I was thinking cut would work here but then I realized cut doesn't group variables when they are close and rather creates a series of groups based on a sequence.
So take this dataframe for example:
df <- structure(list(Num = c(0.888401849195361, 0.185766335576773,
0.493163562379777, 0.13070688676089, 0.484760325402021, 0.603240836178884,
0.893201333936304, 0.641203448642045, 0.16957180458121, 0.0101411847863346
), Date = structure(c(10592, 10597, 10598, 10605, 10606, 10608,
10609, 10616, 10617, 10618), class = "Date"), day = c(1L, 6L,
7L, 14L, 15L, 17L, 18L, 25L, 26L, 27L)), .Names = c("Num", "Date",
"day"), row.names = c(NA, -10L), class = "data.frame")
If was to apply a cut function as I understand its usage like this:
df$cutVar <- cut(df$day, breaks= seq(0, 31, by = 1), right=TRUE)
I would be left with a range that went right through values that I'd prefer to be grouped together. For example, the 6th and 7th should be grouped together based on their proximity to each other. Similar to 14th and 15th and so on.
> df
Num Date day cutVar
1 0.88840185 1999-01-01 1 (0,1]
2 0.18576634 1999-01-06 6 (5,6]
3 0.49316356 1999-01-07 7 (6,7]
4 0.13070689 1999-01-14 14 (13,14]
5 0.48476033 1999-01-15 15 (14,15]
6 0.60324084 1999-01-17 17 (16,17]
7 0.89320133 1999-01-18 18 (17,18]
8 0.64120345 1999-01-25 25 (24,25]
9 0.16957180 1999-01-26 26 (25,26]
10 0.01014118 1999-01-27 27 (26,27]
So the basic question here is how to group a continuous variable (a date in this instance) such that close (defined by the user) numbers are grouped together in a factor range?
Is this something you'd like? where 3 is a threshold I chose for convenience. It can be any number you prefer:
df$group <- cumsum(c(1, diff.Date(df$Date)) >= 3)
df
Num Date day group
1 0.88840185 1999-01-01 1 0
2 0.18576634 1999-01-06 6 1
3 0.49316356 1999-01-07 7 1
4 0.13070689 1999-01-14 14 2
5 0.48476033 1999-01-15 15 2
6 0.60324084 1999-01-17 17 2
7 0.89320133 1999-01-18 18 2
8 0.64120345 1999-01-25 25 3
9 0.16957180 1999-01-26 26 3
10 0.01014118 1999-01-27 27 3

Subtraction on different rows and columns and separated by group

I really hate to ask two questions in a row but this is something that I can’t wrap my head around. So let’s say I have a data frame, as follows:
df
Row# User Morning Evening Measure Date
1 1 NA NA 2/18/11
2 1 50 115 2/19/11
3 1 85 128 2/20/11
4 1 62 NA 2/25/11
5 1 48 100.8 3/8/11
6 1 19 71 3/9/11
7 1 25 98 3/10/11
8 1 NA 105 3/11/11
9 2 48 105 2/18/11
10 2 28 203 2/19/11
11 2 35 80.99 2/21/11
12 2 91 78.25 2/22/11
Is it possible in R to take the difference between the previous consecutive day (and only the previous day, not the previous result) evening value of 1 row and the morning value of a different row for each user group? So my desired results would be this.
df
Row# User Morning Evening Date Difference
1 1 NA NA 2/18/11 NA
2 1 50 115 2/19/11 NA
3 1 85 129 2/20/11 30
4 1 62 NA 2/25/11 NA
5 1 48 100.8 3/8/11 NA
6 1 19 71 3/9/11 81.8
7 1 25 98 3/10/11 46
8 1 10 105 3/11/11 88
9 2 48 105 2/18/11 NA
10 2 28 203 2/19/11 77
11 2 35 80.99 2/21/11 NA
12 2 91 78.25 2/22/11 -10.01
All I want this to do is to take the morning value and subtract it from the evening value of the previous consecutive day for each user group. As you can see, some parts of my data frame contain NA values in the morning and evening columns, in addition, not all of the dates are in consecutive order for each different user, so naturally, NA should be assigned.
I've tried searching google but there wasn't much information on being able to apply functions to different rows for each group of rows on different columns (if that makes any sense).
My attempts include many variations of this.
df$Difference<-ave((df$Morning,df$Evening),
df$User,
FUN=function(x){
c('NA',diff(df$Evening-df$Morning)),na.rm=T
})
Again, any help would be greatly appreciated. Thanks.
Note: The input data you show and the output data are not the same. There is a NA which is replaced by 10 in output and the last date is 2/14/11 in input and 2/22/11 in output.
I've assumed the output to be the original data to create this answer to match your result.
df$Diff <- c(NA, head(df$Evening, -1) - tail(df$Morning, -1))
df$Diff[which(c(0, diff(as.Date(as.character(df$Measure_Date),
format="%m/%d/%Y"))) != 1)] <- NA
> df
# Row User Morning Evening Measure_Date Diff
# 1 1 1 NA NA 2/18/11 NA
# 2 2 1 50 115.00 2/19/11 NA
# 3 3 1 85 128.00 2/20/11 30.00
# 4 4 1 62 NA 2/25/11 NA
# 5 5 1 48 100.80 3/8/11 NA
# 6 6 1 19 71.00 3/9/11 81.80
# 7 7 1 25 98.00 3/10/11 46.00
# 8 8 1 10 105.00 3/11/11 88.00
# 9 9 2 48 105.00 2/18/11 NA
# 10 10 2 28 203.00 2/19/11 77.00
# 11 11 2 35 80.99 2/21/11 NA
# 12 12 2 91 78.25 2/22/11 -10.01
#user1342086's edit (that got rejected, but was right indeed):
df$Diff[which(diff(df$User) != 0)] <- NA
seems to take care of the grouping by "User".
A blind first shot (untested). Relies on the data frame being already sorted by User and Date.
#if necessary, transform your dates from factor to Date
df$Date <- as.Date(levels(df$Date)[df$Date],format="%m/%d/%y")
df <- within(df,
Difference <- ifelse(c(NA,diff(Measure_Date)) == 1 & diff(User) == 0,
c(NA,head(Evening,-1)) - Morning, NA
)
)
I used plyr, so be sure you have it installed. This solution should work even if user data are mixed (i.e. not in consecutive rows) and dates are not in chronological order.
# Your example data, as you should post it for us to use
df <-
structure(list(User = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), Morning = c(NA, 50L, 85L, 62L, 48L, 19L, 25L, NA, 48L,
28L, 35L, 91L), Evening = c(NA, 115, 128, NA, 100.8, 71, 98,
105, 105, 203, 80.99, 78.25), Measure_Date = structure(c(1L,
2L, 3L, 5L, 9L, 10L, 6L, 7L, 1L, 2L, 4L, 8L), .Label = c("2/18/11",
"2/19/11", "2/20/11", "2/21/11", "2/25/11", "3/10/11", "3/11/11",
"3/14/11", "3/8/11", "3/9/11"), class = "factor")), .Names = c("User",
"Morning", "Evening", "Measure_Date"), class = "data.frame", row.names = c(NA,
-12L))
# As already stated by Arun, you need the date as class Date
df$Measure_Date <- as.Date(df$Measure_Date, format='%m/%d/%y')
# Use plyr to procces the dataframe by user
library(package=plyr)
ddply(.data=df, .variables='User',
.fun=function(x){
# Complete sequence of dates for each user
tdf <- data.frame(Measure_Date=seq(from=min(x$Measure_Date),
to=max(x$Measure_Date),
by='1 day'))
# Merge to fill in NAs for unused dates
tdf <- merge(tdf, x, all=TRUE)
# Put desired values side by side
tdf$Evening <- c(NA, tdf$Evening[-length(tdf$Evening)])
# Diference
tdf$Difference <- tdf$Evening - tdf$Morning
# Return desired value to original data
tdf <- tdf[,c('Measure_Date', 'Difference')]
x <- merge(x, tdf)
x
})

Resources