Sum certain values from changing dataframe in R - r

I have a data frame that I would like to aggregate by adding certain values. Say I have six clusters. I then feed data from each cluster into some function that generates a value x which is then put into the output data frame.
cluster year lambda v e x
1 1 1 -0.12160997 -0.31105287 -0.253391178 15
2 1 2 -0.12160997 -1.06313732 -0.300349972 10
3 1 3 -0.12160997 -0.06704185 0.754397069 40
4 2 1 -0.07378295 -0.31105287 -1.331764904 4
5 2 2 -0.07378295 -1.06313732 0.279413039 19
6 2 3 -0.07378295 -0.06704185 -0.004581941 23
7 3 1 -0.02809310 -0.31105287 0.239647063 28
8 3 2 -0.02809310 -1.06313732 1.284568047 38
9 3 3 -0.02809310 -0.06704185 -0.294881283 18
10 4 1 0.33479251 -0.31105287 -0.480496125 15
11 4 2 0.33479251 -1.06313732 -0.380251626 12
12 4 3 0.33479251 -0.06704185 -0.078851036 34
13 5 1 0.27953088 -0.31105287 1.435456851 100
14 5 2 0.27953088 -1.06313732 -0.795435607 0
15 5 3 0.27953088 -0.06704185 -0.166848530 0
16 6 1 0.29409366 -0.31105287 0.126647655 44
17 6 2 0.29409366 -1.06313732 0.162961658 18
18 6 3 0.29409366 -0.06704185 -0.812316265 13
To aggregate, I then add up the x value for cluster 1 across all three years with seroconv.cluster1=sum(data.all[c(1:3),6]) and repeat for each cluster.
Every time I change the number of clusters right now I have to manually change the addition of the x's. I would like to be able to say n.vec <- seq(6, 12, by=2) and feed n.vec into the functions and get x and have R add up the x values for each cluster every time with the number of clusters changing. So it would do 6 clusters and add up all the x's per cluster. Then 8 and add up the x's and so on.

It seems you are asking for an easy way to split your data up, apply a function (sum in this case) and then combine it all back together. Split apply combine is a common data strategy, and there are several split/apply/combine strategies in R, the most popular being ave in base, the dplyr package and the data.table package.
Here's an example for your data using dplyr:
library(dplyr)
df %>% group_by(cluster, year) %>% summarise_each(funs(sum))

To get the sum of x for each cluster as a vector, you can use tapply:
tapply(df$x, df$cluster, sum)
# 1 2 3 4 5 6
# 65 46 84 61 100 75
If you instead wanted to output as a data frame, you could use aggregate:
aggregate(x~cluster, sum, data=df)
# cluster x
# 1 1 65
# 2 2 46
# 3 3 84
# 4 4 61
# 5 5 100
# 6 6 75

Related

Creating an summary dataset with multiple objects and multiple observations per object

I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.

R: how to use expand.grid to generate combinations based on group

I am trying to get all combinations of values per group. I want to prevent combination of values between different groups.
To create all combinations of values (no matter which group the value belongs) vaI can use:
expand.grid(value, value)
Awaited result should be the subset of result of previous command.
Example:
#base data
value = c(1,3,5, 1,5,7,9, 2)
group = c("a", "a", "a","b","b","b","b", "c")
base <- data.frame(value, group)
#creating ALL combinations of value
allComb <- expand.grid(base$value, base$value)
#awaited result is subset of allComb.
#Note: first colums shows the number of row from allComb.
#Empty rows are separating combinations per group and are shown only for clarification.
Var1 Var2
1 1 1
2 3 1
3 5 1
11 1 3
12 3 3
13 5 3
21 1 5
22 3 5
23 5 5
34 1 1
35 5 1
36 7 1
37 9 1
44 1 5
45 5 5
46 7 5
47 9 5
54 1 7
55 5 7
56 7 7
57 9 7
64 1 9
65 5 9
66 7 9
67 9 9
78 2 2

r - aggregate / substract two variables, rows

I'm using the aggregate function for calculating the difference for every observation of two variables,so somehow like this (and the I want to save the result as a new variable) :
data1
Group Points_Attempt1 Points_Attempt2
1 1 10 5
2 1 34 23
3 1 50 5
4 1 10 12
5 2 11 21
6 2 23 23
7 2 32 10
8 2 12 10
I'm able to do something like this:
aggregate(data1[c("Points_Attempt1","Points_Attempt2")],list(data1$group),diff)
But I want it for every single observations and I just do not now to select the observations, so somehow the row numbers (here from 1-8).
So I'm searching for the following fourth column (Difference), which I then would like to safe as a new variable:
Group Points_Attempt1 Points_Attempt2 Difference
1 1 10 5 5
2 1 34 23 11
3 1 50 5 45
4 1 10 12 -2
5 2 11 21 -10
6 2 23 23 0
7 2 32 10 22
8 2 12 10 2
I would be highly thankful, if someone could help me with this.
We can use mutate_each
library(dplyr)
data1 %>%
group_by(Group) %>%
mutate_each(funs(c(NA, diff(.))), 2:3)
Or if we need to subtract between the variables,
data1 %>%
mutate(Difference = Points_Attemp1 - Points_Attemp2)

Split data when time intervals exceed a defined value

I have a data frame of GPS locations with a column of seconds. How can I split create a new column based on time-gaps? i.e. for this data.frame:
df <- data.frame(secs=c(1,2,3,4,5,6,7,10,11,12,13,14,20,21,22,23,24,28,29,31))
I would like to cut the data frame when there is a time gap between locations of 3 or more seconds seconds and create a new column entitled 'bouts' which gives a running tally of the number of sections to give a data frame looking like this:
id secs bouts
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 10 2
9 11 2
10 12 2
11 13 2
12 14 2
13 20 3
14 21 3
15 22 3
16 23 3
17 24 3
18 28 4
19 29 4
20 31 4
Use cumsum and diff:
df$bouts <- cumsum(c(1, diff(df$secs) >= 3))
Remember that logical values get coerced to numeric values 0/1 automatically and that diff output is always one element shorter than its input.

How to run a loop on different sections of the same data.frame [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 7 years ago.
Suppose I have a data frame with 2 variables which I'm trying to run some basic summary stats on. I would like to run a loop to give me the difference between minimum and maximum seconds values for each unique value of number. My actual data frame is huge and contains many values for 'number' so subsetting and running individually is not a realistic option. Data looks like this:
df <- data.frame(number=c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4,5,5,5,5),
seconds=c(1,4,8,1,5,11,23,1,8,1,9,11,24,44,112,1,34,55,109))
number seconds
1 1 1
2 1 4
3 1 8
4 2 1
5 2 5
6 2 11
7 2 23
8 3 1
9 3 8
10 4 1
11 4 9
12 4 11
13 4 24
14 4 44
15 4 112
16 5 1
17 5 34
18 5 55
19 5 109
my current code only returns the value of the difference between minimum and maximum seconds for the entire data fram:
ZZ <- unique(df$number)
for (i in ZZ){
Y <- max(df$seconds) - min(df$seconds)
}
Since you have a lot of data performance should matter and you should use a data.table instead of a data.frame:
library(data.table)
dt <- as.data.table(df)
dt[, .(spread = (max(seconds) - min(seconds))), by=.(number)]
number spread
1: 1 7
2: 2 22
3: 3 7
4: 4 111
5: 5 108

Resources