I have data on households that made purchase with individual ID for every receipts under some time where weeks are code just as regular integers.
I need to count numbers of receipts from each household during 4 weeks period.(Data is over 3 years; 1st year - 52 weeks, 2nd - 53, 3d-48). Eventually I want to have an average number of purchases per 4 weeks for every household. If the solution includes converting to months and counting monthly,that works as well. The dataset is over 100k rows. I'm quite new to R, all suggestions are very much appreciated!
Household<-c(1,2,3,1,1,2,2,2,3,1,3,3)
Week<-c(201501,201501,201501,201502,201502,201502,201502,201503,201503,201504,201504,201504)
Receipt<-c(111,112,113,114,115,116,117,118,119,120,121,121)
df<-data.frame(Household,Week,Receipt)
This calculates the number of reciepts (rows) per houehold, per 4-week period
library(data.table)
setDT(df)
n_reciepts <- df[, .N, by = .(Household, period = floor(Week/4))]
# Household period N
# 1: 1 50375 3
# 2: 2 50375 4
# 3: 3 50375 2
# 4: 1 50376 1
# 5: 3 50376 2
Then you just need to average by household over all periods
avg_n_reciepts <- n_reciepts[, .(avg_reciepts = mean(N)), by = Household]
# Household avg_reciepts
# 1: 1 2
# 2: 2 4
# 3: 3 2
You could also do this in one step
df[, .N, by = .(Household, period = floor(Week/4))
][, .(avg_reciepts = mean(N)), by = Household]
# Household avg_reciepts
# 1: 1 2
# 2: 2 4
# 3: 3 2
dplyr equivalent:
library(dplyr)
df %>%
group_by(Household, period = floor(Week/4)) %>%
count %>%
group_by(Household) %>%
summarise(avg_reciepts = mean(n))
# # A tibble: 3 x 2
# Household avg_reciepts
# <dbl> <dbl>
# 1 1 2
# 2 2 4
# 3 3 2
Related
I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
I collected some data that is different in unique in year, month, and level. I want to assign a unique code (simple numerics) to each row on these three columns alone. Any suggestions on how to proceed?
year <- c("A","J","J","S")
month <- c(2000,2001,2001,2000)
level <- c("high","low","low","low")
site <- c(1,2,3,3)
val1 <- c(1,2,3,0)
df <- data.frame(year,month,level,site,val1)
#Result desired
df$Unique.code --> 1,2,2,3
dplyr has the cur_group_id() function for this:
df %>%
group_by(year, month, level) %>%
mutate(id = cur_group_id())
# # A tibble: 4 × 6
# # Groups: year, month, level [3]
# year month level site val1 id
# <chr> <dbl> <chr> <dbl> <dbl> <int>
# 1 A 2000 high 1 1 1
# 2 J 2001 low 2 2 2
# 3 J 2001 low 3 3 2
# 4 S 2000 low 3 0 3
Or we could coerce a factor into an integer in base:
df$group_id = with(df, as.integer(factor(paste(year, month, level))))
Preamble:
The main problem is how to subset a datatable based on IDs, forming subsets within an ID based on consecutive time differences. A hint regarding this would be most welcome.
The complete question/setup:
I have a dataset dt in data.table format that looks like
date id val1 val2
%d.%m.%Y
1 01.01.2000 1 5 10
2 09.01.2000 1 4 9
3 01.08.2000 1 3 8
4 01.01.2000 2 2 7
5 01.01.2000 3 1 6
6 14.01.2000 3 7 5
7 28.01.2000 3 8 4
8 01.06.2000 3 9 3
I want to combine observations (grouped by id) which are not more than two weeks apart (consecutively from observation to observation). By combining I mean that for each subset, I
keep the value of the last observation of val1
replace val2 of the last observation with the sum of all values of val2 of the group
add counter for how many observations came together in this group.
I.e., I want to end up with a dataset like this
date id val1 val2 counter
%d.%m.%Y
2 09.01.2000 1 4 19 2
3 01.08.2000 1 3 8 1
4 01.01.2000 2 2 7 1
7 28.01.2000 3 8 15 3
8 01.06.2000 3 9 3 1
Still, I am trying to wrap my head around data.table functions, particularly .SD and want to solve the issue with these tools.
So far I know
that I can indicate what I mean by first and last using setkey(dt,date)
that I can replace the last val2 of a subset with the sum
dt[, val2 := replace(val2, .N, sum(val2[-.N], na.rm = TRUE)), by=id]
that I get the length of a subset with [.N]
how to delete rows
that I can calculate the difference between two dates with difftime(strptime(dt$date[1],format ="%d.%m.%Y"),strptime(dt$date[2],format ="%d.%m.%Y"),units="weeks")
However I can't get my head around how to subset the observations such that each subset contains only groups of observations of the same id with dates of (consecutive) distance at max 2 weeks.
Any help is appreciated. Many thanks in advance.
The trick is to use cumsum() on a condition. In this case, the condition is being more than 14 days. When the condition is true, the cumulative sum increments.
df %>%
mutate(rownumber = row_number()) %>%
group_by(id) %>%
mutate(interval = as.numeric(as.Date(date, format = "%d.%m.%Y") - as.Date(lag(date), format = "%d.%m.%Y"))) %>%
mutate(interval = ifelse(is.na(interval), 0, interval)) %>%
mutate(group = cumsum(interval > 14) + 1) %>%
ungroup() %>%
group_by(id, group) %>%
summarise(
rownumber = last(rownumber),
date = last(date),
val1 = last(val1),
val2 = sum(val2),
counter = n()
) %>%
select(rownumber, date, id, val1, val2, counter)
Output
rownumber date id val1 val2 counter
<int> <chr> <int> <int> <int> <int>
1 2 09.01.2000 1 4 19 2
2 3 01.08.2000 1 3 8 1
3 4 01.01.2000 2 2 7 1
4 7 28.01.2000 3 8 15 3
5 8 01.06.2000 3 9 3 1
I am trying to rearrange a dataset with a few thousand observations (to eventually use the drm function in package DRC), and I am tired of doing it in excel. Within a dataframe I am looking to add "start" and "end" times (up to inf) based on the intervals found in a vector within the df. This means I would have to end up adding an observation (row) where there the last "end" time is inf. For that last row (the one with inf) I ALSO need to subtract the total of "value" from an arbitrary number (in my example below this would be 50). All this grouped by two variables ("Name", and "Rep" in my example). I am hoping there is a solution using group_by, but honestly I'll be overjoyed at any solution!
I have a data set that looks like this;
# data
names<-c(rep("Luke",30), rep("Han", 30), rep("Leia", 30), rep("OB1", 30))
reps<-c(rep("A", 10), rep("B", 10), rep("C", 10))
time<-rep(seq(1:10), 4)
value<-rep(sample(0:5,10,replace=T), 4)
df<-data.frame(names, reps, time, value)
but need it to look like this;
Example of the data structure I need.
I'm at a loss. Please help!
If I have understood you correctly, we can do
library(dplyr)
df1 <- df %>%
group_by(names, reps) %>%
mutate(start = lag(time, default = 0),
end = time)
bind_rows(df1, df1 %>%
group_by(names, reps) %>%
summarise(start = last(time),
end = Inf,
value = sum(value))) %>%
select(-time) %>%
arrange(names, reps)
# names reps value start end
# <fct> <fct> <int> <dbl> <dbl>
# 1 Han A 2 0 1
# 2 Han A 2 1 2
# 3 Han A 1 2 3
# 4 Han A 1 3 4
# 5 Han A 3 4 5
# 6 Han A 2 5 6
# 7 Han A 0 6 7
# 8 Han A 2 7 8
# 9 Han A 2 8 9
#10 Han A 5 9 10
#11 Han A 20 10 Inf
#.....
We can do this in data.table shifting the 'time' while appending 'Inf' at the end of 'time' to create the end and difference of 50 from the sum of 'value' for 'value' after grouping by 'names' and 'reps'
library(data.table)
setDT(df)[, {stL <- last(time)
enL <- Inf
vL <- 50- sum(value)
.(start = c(shift(time, fill = 0), stL),
end = c(time, enL),
value = c(value, vL))}, .(names, reps)]
# names reps start end value
# 1: Luke A 0 1 0
# 2: Luke A 1 2 3
# 3: Luke A 2 3 3
# 4: Luke A 3 4 4
# 5: Luke A 4 5 0
# ---
#128: OB1 C 6 7 3
#129: OB1 C 7 8 0
#130: OB1 C 8 9 2
#131: OB1 C 9 10 5
#132: OB1 C 10 Inf 27
I have read a few different posts about finding the difference between two different rows in R using dplyr. However, the posts I have seen do not give me quite what I want. I would like to find the difference between the times, and place that difference between n and n+1 in a new variable, on the same row as n, kind of like the duration between n and n+1. All other posts place the elapsed time on the same row as n+1.
Here is some sample data:
df <- read.table(text = c("
id time
1 1
1 4
1 7
2 5
2 10"), header = T)
My desired output:
# id time duration
# 1 1 3
# 1 4 3
# 1 7 NA
# 2 5 5
# 2 10 NA
I have the following code at the moment:
df %>% arrange(id, time) %>% group_by(id) %>% mutate(duration = time - lag(time))
Please let me know how I should change this around. Thanks!
You can use diff(), appending the NA to each group. Just change your mutate() call to
mutate(duration = c(diff(time), NA)))
Edit: To clarify, the code above is only the mutate() call at the end of the pipe in the code shown in the question. So the the entire operation would be, based on the code shown in the question, is
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(duration = c(diff(time), NA))
# Source: local data frame [5 x 3]
# Groups: id [2]
#
# id time duration
# <dbl> <dbl> <dbl>
# 1 1 1 3
# 2 1 4 3
# 3 1 7 NA
# 4 2 5 5
# 5 2 10 NA
We can swap lag with lead
df %>%
group_by(id) %>%
mutate(duration = lead(time)- time)
# id time duration
# <int> <int> <int>
#1 1 1 3
#2 1 4 3
#3 1 7 NA
#4 2 5 5
#5 2 10 NA
A corresponding option in data.table would be shift with type = "lead"
library(data.table)
setDT(df)[, duration := shift(time, type = "lead") - time, by = id]
NOTE: In the example the 'id', 'time' were in order. If it is not, add the order statement as the OP showed in his post.