Get the means of sub groups of means in R - r

I'm a newbie of R and I don't know how to get R calculate the means of a subgroups of means which are the means of a subgroup themselves. I'll explain clearer.
I have a data frame like this:
GROUP WORD WLN
1 1 4
1 1 3
1 1 3
1 2 2
1 2 2
1 2 3
2 3 1
2 3 1
2 3 2
2 4 1
2 4 1
2 4 1
... ... ...
but the real one has a total of 5 groups and 25 words (5 words each group; every word has being assigned a number from 1 to 4 by 5 subjects...).
I need to get the means of WLN for every word and I can do that easily with a loop and save the results in a vector; but then I need a vector with the means of these means according to the group which the words belong to... So I need the means of means of words of the group 1, then of group 2, etc... (I don't know if I'm making it clear).
How can I get this without doing it one group by one?

With base, using aggregate
> aggregate(WLN~GROUP+WORD, mean, data=df)
GROUP WORD WLN
1 1 1 3.333333
2 1 2 2.333333
3 2 3 1.333333
4 2 4 1.000000
where df is #Metrics' data.
Another alternative is using summaryBy from doBy package
> library(doBy)
> summaryBy(WLN~GROUP+WORD, FUN=mean, data=df)
GROUP WORD WLN.mean
1 1 1 3.333333
2 1 2 2.333333
3 2 3 1.333333
4 2 4 1.000000

Assume df is your dataframe:
df<-structure(list(GROUP = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), WORD = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), WLN = c(4L, 3L, 3L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 1L,
1L)), .Names = c("GROUP", "WORD", "WLN"), class = "data.frame", row.names = c(NA,
-12L))
Plyr solution
install.packages("plyr")
library(plyr)
ddply(df,.(GROUP,WORD),summarize, meanwln=mean(WLN))
GROUP WORD meanwln
1 1 1 3.333333
2 1 2 2.333333
3 2 3 1.333333
4 2 4 1.000000
Data.table solution:
install.packages("data.table")
library(data.table)
df<-data.table(df)
setkey(df,GROUP,WORD)
df[,list(meanwln=mean(WLN)),by="GROUP,WORD"]
GROUP WORD meanwln
1: 1 1 3.333333
2: 1 2 2.333333
3: 2 3 1.333333
4: 2 4 1.000000

with base:
with(df,tapply(WLN,list(GROUP,WORD),mean))
Edit:
If you also want row- and colmeans for the table above, you could do something like this:
x <- with(df,tapply(WLN,list(GROUP,WORD),mean))
addmargins(x, margin = seq_along(dim(x)), FUN = mean, quiet = TRUE)

And now dplyr is even better...
require(dplyr)
tmp <- group_by(df, WORD)
df1 <- summarise(tmp,
count = n(),
mWLN = mean(WLN, na.rm = TRUE))
df1

Related

Unique Instances By Two Groups in R

I'm trying to determine the number of unique customers per week per store.
I have a piece of code that accomplishes this task but the tabulation is not what I am looking for.
I have the following table:
store week customer_ID
1 1 1
1 1 1
1 1 2
1 2 1
1 2 2
1 2 3
2 1 1
2 1 1
2 1 2
2 2 2
2 2 3
2 2 3
So every week I need to count how many unique customer there were.
Say for example if customer 1 had visited on week 1, then revisited on week 2 that would not count as a unique visit.
If that same customer visited store 2 on week 1 or any other week. Then that would count as a unique visit for store two.
The outcome would look like the following:
store week unique Customers
1 1 2
1 2 1
2 1 2
2 2 1
I used the following but its not correct
agg <- aggregate(data=df, customer_ID~ week+store, function(x) length(unique(x)))
structure(list(store = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), week = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), customer_ID = c(1L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L,
2L, 3L, 3L)), .Names = c("store", "week", "customer_ID"), class = "data.frame", row.names = c(NA,
-12L))
Here is a base R method. The idea is to split the data into a list of data.frames, one for each store. Assuming observations are ordered by week, then drop duplicated observations of customer ID. The subset data.frame is aggregated using your function. Then do.call and rbind put the results into a single data.frame:
do.call(rbind, lapply(split(df, df$store),
function(i) aggregate(data=i[!duplicated(i$customer_ID),],
customer_ID ~ week+store, length)))
week store customer_ID
1.1 1 1 2
1.2 2 1 1
2.1 1 2 2
2.2 2 2 1
to make sure that your data.frame is ordered properly prior to attempting this, you could use order:
df <- df[order(df$store, df$week), ]
In case it is of interest, I put together a data.table solution as well.
library(data.table)
setDT(df)
df[df[, !duplicated(customer_ID), by=store]$V1,
.(newCust=length(customer_ID)), by=.(store, week)]
store week newCust
1: 1 1 2
2: 1 2 1
3: 2 1 2
4: 2 2 1
This method uses a logical vector df[, !duplicated(customer_ID), by=store]$V1 to subset the data to unique IDs by store, and then calculates the unique number of new customers by store-week.

R - sample and resample a person-period file

I am working with a gigantic person-period file and I thought that
a good way to deal with a large dataset is by using sampling and re-sampling technique.
My person-period file look like this
id code time
1 1 a 1
2 1 a 2
3 1 a 3
4 2 b 1
5 2 c 2
6 2 b 3
7 3 c 1
8 3 c 2
9 3 c 3
10 4 c 1
11 4 a 2
12 4 c 3
13 5 a 1
14 5 c 2
15 5 a 3
I have actually two distinct issues.
The first issue is that I am having trouble in simply sampling a person-period file.
For example, I would like to sample 2 id-sequences such as :
id code time
1 a 1
1 a 2
1 a 3
2 b 1
2 c 2
2 b 3
The following line of code is working for sampling a person-period file
dt[which(dt$id %in% sample(dt$id, 2)), ]
However, I would like to use a dplyr solution because I am interested in resampling and in particular I would like to use replicate.
I am interested in doing something like replicate(100, sample_n(dt, 2), simplify = FALSE)
I am struggling with the dplyr solution because I am not sure what should be the grouping variable.
library(dplyr)
dt %>% group_by(id) %>% sample_n(1)
gives me an incorrect result because it does not keep the full sequence of each id.
Any clue how I could both sample and re-sample person-period file ?
data
dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b",
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA,
-15L), class = "data.frame")
I think the idiomatic way would probably look like
set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)
id code time
1 2 b 1
2 2 c 2
3 2 b 3
4 5 a 1
5 5 c 2
6 5 a 3
This extends straightforwardly to more grouping variables and fancier sampling rules.
If you need to do this many times...
nrep = 100
ng = 2
samps = df %>% select(id) %>% distinct %>%
slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)
# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff
I imagine you are doing some simulations and may want to do the subsetting many times. You probably also want to try this data.table method and utilize the fast binary search feature on the key column:
library(data.table)
setDT(dt)
setkey(dt, id)
replicate(2, dt[list(sample(unique(id), 2))], simplify = F)
#[[1]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 5 a 1
#5: 5 c 2
#6: 5 a 3
#[[2]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 4 c 1
#5: 4 a 2
#6: 4 c 3
We can use filter with sample
dt %>%
filter(id %in% sample(unique(id),2, replace = FALSE))
NOTE: The OP specified using dplyr method and this solution does uses the dplyr.
If we need to do replicate one option would be using map from purrr
library(purrr)
dt %>%
distinct(id) %>%
replicate(2, .) %>%
map(~sample(., 2, replace=FALSE)) %>%
map(~filter(dt, id %in% .))
#$id
# id code time
#1 1 a 1
#2 1 a 2
#3 1 a 3
#4 4 c 1
#5 4 a 2
#6 4 c 3
#$id
# id code time
#1 4 c 1
#2 4 a 2
#3 4 c 3
#4 5 a 1
#5 5 c 2
#6 5 a 3

Deleting Rows per ID when value gets greater than... minus 2

I have the following data frame
id<-c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)
time<-c(0,1,2,3,4,5,6,7,0,1,2,3,0,1,2,3)
value<-c(1,1,6,1,2,0,0,1,2,6,2,2,1,1,6,1)
d<-data.frame(id, time, value)
The value 6 appears only once for each id. For every id, i would like to remove all rows after the line with the value 6 per id except the first two lines coming after.
I've searched and found a similar problem, but i couldnt adapt it myself. I therefore use the code of this thread
In the above case the final data frame should be
id time value
1 0 1
1 1 1
1 2 6
1 3 1
1 4 2
2 0 2
2 1 6
2 2 2
2 3 2
3 0 1
3 1 1
3 2 6
3 3 1
On of the solution given seems getting very close to what i need. But i didn't manage to adapt it. Could u help me?
library(plyr)
ddply(d, "id",
function(x) {
if (any(x$value == 6)) {
subset(x, time <= x[x$value == 6, "time"])
} else {
x
}
}
)
Thank you very much.
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(d)). Grouped by the 'id' column, we get the position of 'value' that is equal to 6. Add 2 to it. Find the min of the number of elements for that group (.N) and the position, get the seq, and use that to subset the dataset. We can also add an if/else condition to check whether there are any 6 in the 'value' column or else to return .SD without any subsetting.
library(data.table)
setDT(d)[, if(any(value==6)) .SD[seq(min(c(which(value==6) + 2, .N)))]
else .SD, by = id]
# id time value
# 1: 1 0 1
# 2: 1 1 1
# 3: 1 2 6
# 4: 1 3 1
# 5: 1 4 2
# 6: 2 0 2
# 7: 2 1 6
# 8: 2 2 2
# 9: 2 3 2
#10: 3 0 1
#11: 3 1 1
#12: 3 2 6
#13: 3 3 1
#14: 4 0 1
#15: 4 1 2
#16: 4 2 5
Or as #Arun mentioned in the comments, we can use the ?head to subset, which would be faster
setDT(d)[, if(any(value==6)) head(.SD, which(value==6L)+2L) else .SD, by = id]
Or using dplyr, we group by 'id', get the position of 'value' 6 with which, add 2, get the seq and use that numeric index within slice to extract the rows.
library(dplyr)
d %>%
group_by(id) %>%
slice(seq(which(value==6)+2))
# id time value
#1 1 0 1
#2 1 1 1
#3 1 2 6
#4 1 3 1
#5 1 4 2
#6 2 0 2
#7 2 1 6
#8 2 2 2
#9 2 3 2
#10 3 0 1
#11 3 1 1
#12 3 2 6
#13 3 3 1
#14 4 0 1
#15 4 1 2
#16 4 2 5
data
d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), time = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 1L, 2L), value = c(1L, 1L, 6L, 1L,
2L, 2L, 6L, 2L, 2L, 1L, 1L, 6L, 1L, 1L, 2L, 5L)), .Names = c("id",
"time", "value"), class = "data.frame", row.names = c(NA, -16L))

Condtionally create new columns based on specific numeric values (keys) from existing column

I have a data.frame df where the column x is populated with integers (1-9). I would like to update columns y and z based on the value of x as follows:
if x is 1,2, or 3 | y = 1 ## if x is 1,4, or 7 | z = 1
if x is 4,5, or 6 | y = 2 ## if x is 2,5, or 8 | z = 2
if x is 7,8, or 9 | y = 3 ## if x is 3,6, or 9 | z = 3
Below is a data.frame with the desired output for y and z
df <- structure(list(x = c(1L, 2L, 3L, 3L, 4L, 2L, 1L, 2L, 5L, 2L,
1L, 6L, 3L, 7L, 3L, 2L, 1L, 4L, 3L, 2L), y = c(1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 2L, 1L, 1L
), z = c(1L, 2L, 3L, 3L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 3L, 3L,
1L, 3L, 2L, 1L, 1L, 3L, 2L)), .Names = c("x", "y", "z"), class = "data.frame", row.names = c(NA,
-20L))
I can write a for-loop with multiple if statements to fill y and z row by row. This doesn't seem very r: it is not vectorized. Is there a method to specify what numeric values will correspond to new numeric values? Like a map or key to indicate which values will become based on the previous values.
Solution #1: Lookup Vector
Assuming the mismatches I pointed out in my comment are mistakes in the data, and not in the rules, then you can accomplish this as follows:
x2y <- rep(1:3,each=3);
x2z <- rep(1:3,3);
df$y <- x2y[df$x];
df$z <- x2z[df$x];
df1 <- df; ## for identical() calls later
df;
## x y z
## 1 1 1 1
## 2 2 1 2
## 3 3 1 3
## 4 3 1 3
## 5 4 2 1
## 6 2 1 2
## 7 1 1 1
## 8 2 1 2
## 9 5 2 2
## 10 2 1 2
## 11 1 1 1
## 12 6 2 3
## 13 3 1 3
## 14 7 3 1
## 15 3 1 3
## 16 2 1 2
## 17 1 1 1
## 18 4 2 1
## 19 3 1 3
## 20 2 1 2
The above solution is dependent on the fact that the domain of x consists of contiguous integer values beginning from 1, so a direct index into a "lookup vector" suffices. If x began at a very high number but was still contiguous you could make this solution work by subtracting one less than the minimum of x before indexing.
Solution #2: Lookup Table
If you don't like this assumption, then you can accomplish the task with a lookup table:
library('data.table');
lookup <- data.table(x=1:9,y=x2y,z=x2z,key='x');
lookup;
## x y z
## 1: 1 1 1
## 2: 2 1 2
## 3: 3 1 3
## 4: 4 2 1
## 5: 5 2 2
## 6: 6 2 3
## 7: 7 3 1
## 8: 8 3 2
## 9: 9 3 3
df[c('y','z')] <- lookup[df['x'],.(y,z)];
identical(df,df1);
## [1] TRUE
Or base R approach:
lookup <- data.frame(x=1:9,y=x2y,z=x2z);
lookup;
## x y z
## 1 1 1 1
## 2 2 1 2
## 3 3 1 3
## 4 4 2 1
## 5 5 2 2
## 6 6 2 3
## 7 7 3 1
## 8 8 3 2
## 9 9 3 3
df[c('y','z')] <- lookup[match(df$x,lookup$x),c('y','z')];
identical(df,df1);
## [1] TRUE
Solution #3: Arithmetic Expression
Yet another alternative is to devise arithmetic expressions equivalent to the mapping:
df$y <- (df$x-1L)%/%3L+1L;
df$z <- 3L--df$x%%3L;
identical(df,df1);
## [1] TRUE
This particular solution is dependent on the fact that your mapping happens to possess a regularity that lends itself to arithmetic description.
With regard to implementation, it also takes advantage of a bit of a non-obvious property of R precedence rules (actually this is true of other languages as well, such as C/C++ and Java), namely that unary negative is higher than modulus which is higher than binary subtraction, thus the calculation for df$z is equivalent to 3L-((-df$x)%%3L).
To go into more detail regarding the z calculation: It is not possible to describe the mapping with a straight modulus of df$x%%3, because the 3, 6, and 9 inputs would mod to zero. That could be solved with a simple index-assign operation, but I wanted to achieve a simpler and purely arithmetic solution. To get from zero to 3 we can subtract df$x%%3 from 3, but that would mess up (invert) the remaining values. I realized that by taking the mod of the negative of the input values, we would "pre-invert" them, and then subtracting all of them from 3 would "right" them and would also convert the zeroes into 3, as desired.

New dataframe with difference between first and last values of repeated measurements?

I am working with time series data and want to calculate the difference between the first and final measurement times, and put these numbers into a new and simpler dataframe. For example, for this dataframe
structure(list(time = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), indv = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), value = c(1L, 3L, 5L, 8L, 3L, 4L,
7L, 8L)), .Names = c("time", "indv", "value"), class = "data.frame", row.names = c(NA,
-8L))
or
time indv value
1 1 1
2 1 3
3 1 5
4 1 8
1 2 3
2 2 4
3 2 7
4 2 8
I can use this code
ddply(test, .(indv), transform, value_change = (value[length(value)] - value[1]), time_change = (time[length(time)] - time[1]))
to give
time indv value value_change time_change
1 1 1 7 3
2 1 3 7 3
3 1 5 7 3
4 1 8 7 3
1 2 3 5 3
2 2 4 5 3
3 2 7 5 3
4 2 8 5 3
However, I would like to eliminate the redundant rows and make a new and simpler dataframe like this
indv time_change value_change
1 3 7
2 3 5
Does anyone have any clever way to do this?
Thanks!
Just replace transform with summarize. You can also make your code a little prettier by using head and tail:
ddply(test, .(indv), summarize,
value_change = tail(value, 1) - head(value, 1),
time_change = tail(time, 1) - head(time, 1))
For maximum readability, write a function:
change <- function(x) tail(x, 1) - head(x, 1)
ddply(test, .(indv), summarize, value_change = change(value),
time_change = change(time))

Resources