How can I create a column in R that calculates a value of a row based on the previous row's value? - r

Here's what I would like to a achieve as a function in Excel, but I can't seem to find a solution to do it in R.
This is what I tried to do but it does not seem to allow me to operate with the previous values of the new column I'm trying to make.
Here is a reproducible example:
library(dplyr)
set.seed(42) ## for sake of reproducibility
dat <- data.frame(date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"))
This would be the output of the dataframe:
dat
date
1 2020-12-26
2 2020-12-27
3 2020-12-28
4 2020-12-29
5 2020-12-30
6 2020-12-31
Desired output:
date periodNumber
1 2020-12-26 1
2 2020-12-27 2
3 2020-12-28 3
4 2020-12-29 4
5 2020-12-30 5
6 2020-12-31 6
My try at this:
dat %>%
mutate(periodLag = dplyr::lag(date)) %>%
mutate(periodNumber = ifelse(is.na(periodLag)==TRUE, 1,
ifelse(date == periodLag, dplyr::lag(periodNumber), (dplyr::lag(periodNumber) + 1))))
Excel formula screenshot:

You could use dplyr's cur_group_id():
library(dplyr)
set.seed(42)
# I used a larger example
dat <- data.frame(date=sample(seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"), size = 30, replace = TRUE))
dat %>%
arrange(date) %>% # needs sorting because of the random example
group_by(date) %>%
mutate(periodNumber = cur_group_id())
This returns
# A tibble: 30 x 2
# Groups: date [6]
date periodNumber
<date> <int>
1 2020-12-26 1
2 2020-12-26 1
3 2020-12-26 1
4 2020-12-26 1
5 2020-12-26 1
6 2020-12-26 1
7 2020-12-26 1
8 2020-12-26 1
9 2020-12-27 2
10 2020-12-27 2
11 2020-12-27 2
12 2020-12-27 2
13 2020-12-27 2
14 2020-12-27 2
15 2020-12-27 2
16 2020-12-28 3
17 2020-12-28 3
18 2020-12-28 3
19 2020-12-29 4
20 2020-12-29 4
21 2020-12-29 4
22 2020-12-29 4
23 2020-12-29 4
24 2020-12-29 4
25 2020-12-30 5
26 2020-12-30 5
27 2020-12-30 5
28 2020-12-30 5
29 2020-12-30 5
30 2020-12-31 6

Related

subset R dataframe using multiple functions/constraints [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I would like to subset a dataframe (dat) in 2 steps. First constraint: keep unique "id". Second constraint: keep the smallest "visit".
For example, for id=S2, I want to keep the 3rd row which has visit #1, rather than row2 with visit #2.
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(id=c("s1","s2","s2","s3","s4","s4"),
date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
visit=1:2,
age=sample(18:30, n, replace=TRUE))
#dat
# id date visit age
# 1 s1 2020-12-26 1 18
# 2 s2 2020-12-27 2 22
# 3 s2 2020-12-28 1 18
# 4 s3 2020-12-29 2 26
# 5 s4 2020-12-30 1 27
# 6 s4 2020-12-31 2 21
#desired output:
# id date visit age
# 1 s1 2020-12-26 1 18
# 3 s2 2020-12-28 1 18
# 4 s3 2020-12-29 2 26
# 5 s4 2020-12-30 1 27
base R
dat[ave(dat$visit, dat$id, FUN = function(z) seq_along(z) == which.min(z)) > 0,]
# id date visit age
# 1 s1 2020-12-26 1 18
# 3 s2 2020-12-28 1 18
# 4 s3 2020-12-29 2 26
# 5 s4 2020-12-30 1 27
dplyr
library(dplyr)
dat %>%
group_by(id) %>%
slice(which.min(visit)) %>%
ungroup()
# # A tibble: 4 x 4
# id date visit age
# <chr> <date> <int> <int>
# 1 s1 2020-12-26 1 18
# 2 s2 2020-12-28 1 18
# 3 s3 2020-12-29 2 26
# 4 s4 2020-12-30 1 27
data.table
library(data.table)
as.data.table(dat)[, .SD[which.min(visit),], by = id]
# id date visit age
# <char> <Date> <int> <int>
# 1: s1 2020-12-26 1 18
# 2: s2 2020-12-28 1 18
# 3: s3 2020-12-29 2 26
# 4: s4 2020-12-30 1 27
This seems to be the solution with the simplest syntax:
library(dplyr)
dat %>%
group_by(id) %>%
filter(age == min(age))

How to read a list of list in r

I have a txt file like this:
[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]
I would like to read it by R as a table like this:
seller_id
product_id
buyer_id
sale_date
quantity
price
7
11
49
2019-01-21
5
3330
13
32
6
2019-02-10
9
1089
50
47
4
2019-01-06
1
1343
1
22
2
2019-03-03
9
7677
How to write the correct R code? Thanks very much for your time.
An easier option is fromJSON
library(jsonlite)
library(janitor)
fromJSON(txt = "file1.txt") %>%
as_tibble %>%
row_to_names(row_number = 1) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 4 x 6
# seller_id product_id buyer_id sale_date quantity price
# <int> <int> <int> <chr> <int> <int>
#1 7 11 49 2019-01-21 5 3330
#2 13 32 6 2019-02-10 9 1089
#3 50 47 4 2019-01-06 1 1343
#4 1 22 2 2019-03-03 9 7677
You will need to parse the json from arrays into a data frame. Perhaps something like this:
# Get string
str <- '[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]'
df_list <- jsonlite::parse_json(str)
do.call(rbind, lapply(df_list[-1], function(x) {
setNames(as.data.frame(x), unlist(df_list[1]))}))
#> seller_id product_id buyer_id sale_date quantity price
#> 1 7 11 49 2019-01-21 5 3330
#> 2 13 32 6 2019-02-10 9 1089
#> 3 50 47 4 2019-01-06 1 1343
#> 4 1 22 2 2019-03-03 9 7677
Created on 2020-12-11 by the reprex package (v0.3.0)
Some base R options using:
gsub + read.table
read.table(
text = gsub('"|\\[|\\]', "", gsub("\\],", "\n", s)),
sep = ",",
header = TRUE
)
gsub + read.csv
read.csv(text = gsub('"|\\[|\\]', "", gsub("\\],", "\n", s)))
which gives
seller_id product_id buyer_id sale_date quantity price
1 7 11 49 2019-01-21 5 3330
2 13 32 6 2019-02-10 9 1089
3 50 47 4 2019-01-06 1 1343
4 1 22 2 2019-03-03 9 7677
Data
s <- '[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]'

Selecting distinct entries based on specific variables in R

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

A running sum for daily data that resets when month turns

I have a 2 column table (tibble), made up of a date object and a numeric variable. There is maximum one entry per day but not every day has an entry (ie date is a natural primary key). I am attempting to do a running sum of the numeric column along with dates but with the running sum resetting when the month turns (the data is sorted by ascending date). I have replicated what I want to get as a result below.
Date score monthly.running.sum
10/2/2019 7 7
10/9/2019 6 13
10/16/2019 12 25
10/23/2019 2 27
10/30/2019 13 40
11/6/2019 2 2
11/13/2019 4 6
11/20/2019 15 21
11/27/2019 16 37
12/4/2019 4 4
12/11/2019 24 28
12/18/2019 28 56
12/25/2019 8 64
1/1/2020 1 1
1/8/2020 15 16
1/15/2020 9 25
1/22/2020 8 33
It looks like the package "runner" is possibly suited to this but I don't really understand how to instruct it. I know I could use a join operation plus a group_by using dplyr to do this, but the data set is very very large and doing so would be wildly inefficient. i could also manually iterate through the list with a loop, but that also seems inelegant. last option i can think of is selecting out a unique vector of yearmon objects and then cutting the original list into many shorter lists and running a plain cumsum on it, but that also feels unoptimal. I am sure this is not the first time someone has to do this, and given how many tools there is in the tidyverse to do things, I think I just need help finding the right one. The reason I am looking for a tool instead of using one of the methods I described above (which would take less time than writing this post) is because this code needs to be very very readable by an audience that is less comfortable with code.
We can also use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%m/%d/%Y")
][, monthly.running.sum := cumsum(score),format(Date, "%Y-%m")][]
# Date score monthly.running.sum
# 1: 2019-10-02 7 7
# 2: 2019-10-09 6 13
# 3: 2019-10-16 12 25
# 4: 2019-10-23 2 27
# 5: 2019-10-30 13 40
# 6: 2019-11-06 2 2
# 7: 2019-11-13 4 6
# 8: 2019-11-20 15 21
# 9: 2019-11-27 16 37
#10: 2019-12-04 4 4
#11: 2019-12-11 24 28
#12: 2019-12-18 28 56
#13: 2019-12-25 8 64
#14: 2020-01-01 1 1
#15: 2020-01-08 15 16
#16: 2020-01-15 9 25
#17: 2020-01-22 8 33
data
df <- structure(list(Date = c("10/2/2019", "10/9/2019", "10/16/2019",
"10/23/2019", "10/30/2019", "11/6/2019", "11/13/2019", "11/20/2019",
"11/27/2019", "12/4/2019", "12/11/2019", "12/18/2019", "12/25/2019",
"1/1/2020", "1/8/2020", "1/15/2020", "1/22/2020"), score = c(7L,
6L, 12L, 2L, 13L, 2L, 4L, 15L, 16L, 4L, 24L, 28L, 8L, 1L, 15L,
9L, 8L)), row.names = c(NA, -17L), class = "data.frame")
Using lubridate, you can extract month and year values from the date, group_by those values and them perform the cumulative sum as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Month = month(mdy(Date)),
Year = year(mdy(Date))) %>%
group_by(Month, Year) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 6
# Groups: Month, Year [4]
Date score monthly.running.sum Month Year SUM
<chr> <int> <int> <int> <int> <int>
1 10/2/2019 7 7 10 2019 7
2 10/9/2019 6 13 10 2019 13
3 10/16/2019 12 25 10 2019 25
4 10/23/2019 2 27 10 2019 27
5 10/30/2019 13 40 10 2019 40
6 11/6/2019 2 2 11 2019 2
7 11/13/2019 4 6 11 2019 6
8 11/20/2019 15 21 11 2019 21
9 11/27/2019 16 37 11 2019 37
10 12/4/2019 4 4 12 2019 4
11 12/11/2019 24 28 12 2019 28
12 12/18/2019 28 56 12 2019 56
13 12/25/2019 8 64 12 2019 64
14 1/1/2020 1 1 1 2020 1
15 1/8/2020 15 16 1 2020 16
16 1/15/2020 9 25 1 2020 25
17 1/22/2020 8 33 1 2020 33
An alternative will be to use floor_date function in order ot convert each date as the first day of each month and the calculate the cumulative sum:
library(lubridate)
library(dplyr)
df %>% mutate(Floor = floor_date(mdy(Date), unit = "month")) %>%
group_by(Floor) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 5
# Groups: Floor [4]
Date score monthly.running.sum Floor SUM
<chr> <int> <int> <date> <int>
1 10/2/2019 7 7 2019-10-01 7
2 10/9/2019 6 13 2019-10-01 13
3 10/16/2019 12 25 2019-10-01 25
4 10/23/2019 2 27 2019-10-01 27
5 10/30/2019 13 40 2019-10-01 40
6 11/6/2019 2 2 2019-11-01 2
7 11/13/2019 4 6 2019-11-01 6
8 11/20/2019 15 21 2019-11-01 21
9 11/27/2019 16 37 2019-11-01 37
10 12/4/2019 4 4 2019-12-01 4
11 12/11/2019 24 28 2019-12-01 28
12 12/18/2019 28 56 2019-12-01 56
13 12/25/2019 8 64 2019-12-01 64
14 1/1/2020 1 1 2020-01-01 1
15 1/8/2020 15 16 2020-01-01 16
16 1/15/2020 9 25 2020-01-01 25
17 1/22/2020 8 33 2020-01-01 33
A base R alternative :
df$Date <- as.Date(df$Date, "%m/%d/%Y")
df$monthly.running.sum <- with(df, ave(score, format(Date, "%Y-%m"),FUN = cumsum))
df
# Date score monthly.running.sum
#1 2019-10-02 7 7
#2 2019-10-09 6 13
#3 2019-10-16 12 25
#4 2019-10-23 2 27
#5 2019-10-30 13 40
#6 2019-11-06 2 2
#7 2019-11-13 4 6
#8 2019-11-20 15 21
#9 2019-11-27 16 37
#10 2019-12-04 4 4
#11 2019-12-11 24 28
#12 2019-12-18 28 56
#13 2019-12-25 8 64
#14 2020-01-01 1 1
#15 2020-01-08 15 16
#16 2020-01-15 9 25
#17 2020-01-22 8 33
The yearmon class represents year/month objects so just convert the dates to yearmon and accumulate by them using this one-liner:
library(zoo)
transform(DF, run.sum = ave(score, as.yearmon(Date, "%m/%d/%Y"), FUN = cumsum))
giving:
Date score run.sum
1 10/2/2019 7 7
2 10/9/2019 6 13
3 10/16/2019 12 25
4 10/23/2019 2 27
5 10/30/2019 13 40
6 11/6/2019 2 2
7 11/13/2019 4 6
8 11/20/2019 15 21
9 11/27/2019 16 37
10 12/4/2019 4 4
11 12/11/2019 24 28
12 12/18/2019 28 56
13 12/25/2019 8 64
14 1/1/2020 1 1
15 1/8/2020 15 16
16 1/15/2020 9 25
17 1/22/2020 8 33

Calculate average number of individuals present on each date in R

I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).
To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?
I calculated the total number of residence days (total.days) using lubridate and dplyr
mutate(total.days = end.date - start.date + 1)
site ID start.date end.date total.days
1 1 16 5/24/17 6/5/17 13
2 1 46 4/30/17 5/20/17 21
3 1 26 4/30/17 5/23/17 24
4 1 89 5/5/17 5/13/17 9
5 1 12 5/11/17 5/14/17 4
6 2 14 5/4/17 5/10/17 7
7 2 18 5/9/17 5/29/17 21
8 2 19 5/24/17 6/10/17 18
9 2 39 5/5/17 5/18/17 14
First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:
> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16,
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310,
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"),
end.date = structure(c(17322, 17306, 17309, 17299, 17300,
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names =
c(NA,
-9L))
To do this easily we first need to unpack the start.date and end.date to individual dates:
newDat <- data.frame()
for (i in 1:nrow(dat)){
expand <- data.frame(site = dat$site[i],
ID = dat$ID[i],
Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
newDat <- rbind(newDat, expand)
}
newDat
site ID Dates
1 1 16 2017-05-24
2 1 16 2017-05-25
3 1 16 2017-05-26
4 1 16 2017-05-27
5 1 16 2017-05-28
6 1 16 2017-05-29
7 1 16 2017-05-30
. . .
. . .
Then we calculate the number of other individuals present in each site in each day:
individualCount = newDat %>%
group_by(site, Dates) %>%
summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups: site [?]
site Dates individuals
<dbl> <date> <int>
1 1 2017-04-30 1
2 1 2017-05-01 1
3 1 2017-05-02 1
4 1 2017-05-03 1
5 1 2017-05-04 1
6 1 2017-05-05 2
7 1 2017-05-06 2
8 1 2017-05-07 2
9 1 2017-05-08 2
10 1 2017-05-09 2
# ... with 65 more rows
Then, we augment our data with the new information using left_join() and calculate the required average:
newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
group_by(site, ID) %>%
summarise(duration = max(Dates) - min(Dates)+1,
av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups: site [?]
site ID duration av.individuals
<dbl> <dbl> <time> <dbl>
1 1 12 4 0.75
2 1 16 13 0
3 1 26 24 1.42
4 1 46 21 1.62
5 1 89 9 1.33
6 2 14 7 1.14
7 2 18 21 0.875
8 2 19 18 0.333
9 2 39 14 1.14
The final step is to add the required column to the original dataset (dat) again with left_join():
dat %>% left_join(newDat, by = c("site", "ID"))
dat
site ID start.date end.date duration av.individuals
1 1 16 2017-05-24 2017-06-05 13 days 0.000000
2 1 46 2017-04-30 2017-05-20 21 days 1.619048
3 1 26 2017-04-30 2017-05-23 24 days 1.416667
4 1 89 2017-05-05 2017-05-13 9 days 2.333333
5 1 12 2017-05-11 2017-05-14 4 days 2.750000
6 2 14 2017-05-04 2017-05-10 7 days 1.142857
7 2 18 2017-05-09 2017-05-29 21 days 0.857143
8 2 19 2017-05-24 2017-06-10 18 days 0.333333
9 2 39 2017-05-05 2017-05-18 14 days 1.142857

Resources