Creating an summary dataset with multiple objects and multiple observations per object - r

I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!

I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.

Related

is data with individual and #event sequence (not fixed time) considered as a panel data?

I have a dataset that includes individual events across a time period. some example records as below, each individual has 2-4 records observed within a period. The event# is ordered by time, however, the same event# did not occur at the same date (A's #1 event occurs on 6/1, while C's #1 event happens on 6/3). Should I analyze the data as an unbalanced panel data with 2 dimensions individual and event #(i.e, the time dimension)? thanks. If not, how should I treat this data? thanks.
obs
ind
event#
date
var1
y
1
A
1
6/1
11
33
2
A
2
6/4
12
23
3
A
3
6/5
13
32
4
A
4
6/5
14
55
5
B
1
6/1
15
44
6
B
2
6/2
18
54
7
C
1
6/3
15
22
8
C
2
6/3
29
55
9
C
3
6/6
31
23
10
D
1
6/3
13
45
11
D
2
6/5
2
12

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

reshape / restructure the data frame in R [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 years ago.
I'm cleaning a dataset, but the frame is not ideal, I have to reshape it, but I don't know how. The following are the original data frame:
Rater Rater ID Ratee1 Ratee2 Ratee3 Ratee1.item1 Ratee1.item2 Ratee2.item1 Ratee2.item2 Ratee3.item1 Ratee3.item2
A 12 701 702 800 1 2 3 4 5 6
B 23 45 46 49 3 3 3 3 3 3
C 24 80 81 28 2 3 4 5 6 9
Then I am wondering how to reshape it as the below:
Rater Rater ID Ratee item1 item2
A 12 701 1 2
A 12 702 3 4
A 12 800 5 6
B 23 45 3 3
B 23 46 3 3
B 23 49 3 3
C 24 80 2 3
C 24 81 4 5
C 24 28 6 9
This reshaping is a little bit different from this one (Reshaping data.frame from wide to long format). As I have three parts in the original data.
First part is about the rater's ID (Rater and Rater ID).
The second is about retee's ID (Ratee1, Ratee2, Ratee3).
The Third part is about Rater's rating on each retee (retee*.item1(or2)).
To make it more clear, let me brief the data collecting process.
First, a rater types in his own name and ID,
then nominates three persons (Ratee1 to Ratee3),
and then rates the questions regarding each retee (for each retee, there are two questions).
Does anyone know how to reshape this? Thanks!
We can use melt from data.table
library(data.table)
melt(setDT(df1), measure = patterns("^Ratee\\d+$", "^Ratee\\d+\\.item1",
"^Ratee\\d+\\.item2"), value.name = c("Ratee", "item1", "item2"))[,
variable := NULL][order(Rater)]
# Rater RaterID Ratee item1 item2
#1: A 12 701 1 2
#2: A 12 702 3 4
#3: A 12 800 5 6
#4: B 23 45 3 3
#5: B 23 46 3 3
#6: B 23 49 3 3
#7: C 24 80 2 3
#8: C 24 81 4 5
#9: C 24 28 6 9

r - aggregate / substract two variables, rows

I'm using the aggregate function for calculating the difference for every observation of two variables,so somehow like this (and the I want to save the result as a new variable) :
data1
Group Points_Attempt1 Points_Attempt2
1 1 10 5
2 1 34 23
3 1 50 5
4 1 10 12
5 2 11 21
6 2 23 23
7 2 32 10
8 2 12 10
I'm able to do something like this:
aggregate(data1[c("Points_Attempt1","Points_Attempt2")],list(data1$group),diff)
But I want it for every single observations and I just do not now to select the observations, so somehow the row numbers (here from 1-8).
So I'm searching for the following fourth column (Difference), which I then would like to safe as a new variable:
Group Points_Attempt1 Points_Attempt2 Difference
1 1 10 5 5
2 1 34 23 11
3 1 50 5 45
4 1 10 12 -2
5 2 11 21 -10
6 2 23 23 0
7 2 32 10 22
8 2 12 10 2
I would be highly thankful, if someone could help me with this.
We can use mutate_each
library(dplyr)
data1 %>%
group_by(Group) %>%
mutate_each(funs(c(NA, diff(.))), 2:3)
Or if we need to subtract between the variables,
data1 %>%
mutate(Difference = Points_Attemp1 - Points_Attemp2)

Split data when time intervals exceed a defined value

I have a data frame of GPS locations with a column of seconds. How can I split create a new column based on time-gaps? i.e. for this data.frame:
df <- data.frame(secs=c(1,2,3,4,5,6,7,10,11,12,13,14,20,21,22,23,24,28,29,31))
I would like to cut the data frame when there is a time gap between locations of 3 or more seconds seconds and create a new column entitled 'bouts' which gives a running tally of the number of sections to give a data frame looking like this:
id secs bouts
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 10 2
9 11 2
10 12 2
11 13 2
12 14 2
13 20 3
14 21 3
15 22 3
16 23 3
17 24 3
18 28 4
19 29 4
20 31 4
Use cumsum and diff:
df$bouts <- cumsum(c(1, diff(df$secs) >= 3))
Remember that logical values get coerced to numeric values 0/1 automatically and that diff output is always one element shorter than its input.

Resources