Remove group of rows by flag indicator in R - r

I have a dataframe where I have groups of numbers in the unique3 column.
structure(list(unique1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("11/1/2016", "11/10/2016", "11/11/2016",
"11/12/2016", "11/13/2016", "11/14/2016", "11/15/2016", "11/16/2016",
"11/17/2016", "11/18/2016", "11/19/2016", "11/2/2016", "11/20/2016",
"11/21/2016", "11/22/2016", "11/23/2016", "11/24/2016", "11/25/2016",
"11/26/2016", "11/27/2016", "11/28/2016", "11/3/2016", "11/4/2016",
"11/5/2016", "11/6/2016", "11/7/2016", "11/8/2016", "11/9/2016"
),
class = "factor"), unique2 = c(21L, 21L, 21L, 21L, 21L, 21L,
21L, 21L, 31L, 41L), unique3 = c(100001L, 100001L, 100001L, 100001L,
100001L, 100001L, 100001L, 100001L, 100002L, 100003L),
flag = c(NA_integer_,1, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), value = c(1L,
6L, 18L, 19L, 22L, 29L, 30L, 32L, 1L, 1L)),
.Names = c("unique1","unique2", "unique3", "flag", "value"), row.names = c(NA, 10L), class = "data.frame")
unique1 unique2 unique3 flag value
1 11/1/2016 21 100001 NA 1
2 11/1/2016 21 100001 1 6
3 11/1/2016 21 100001 NA 18
4 11/1/2016 21 100001 NA 19
5 11/1/2016 21 100001 NA 22
6 11/1/2016 21 100001 NA 29
7 11/1/2016 21 100001 NA 30
8 11/1/2016 21 100001 NA 32
9 11/1/2016 31 100002 NA 1
10 11/1/2016 41 100003 NA 1
I basically need to group by unique column 3 where if any of the rows for 100001 had a 1 in flag. They would be removed. Although 100001 may not be unique and may repeat for a different value of unique2.
What I would do is make all the values for unique 3 to have a value of 1 like so
unique1 unique2 unique3 flag value
1 11/1/2016 21 100001 1 1
2 11/1/2016 21 100001 1 6
3 11/1/2016 21 100001 1 18
4 11/1/2016 21 100001 1 19
5 11/1/2016 21 100001 1 22
6 11/1/2016 21 100001 1 29
7 11/1/2016 21 100001 1 30
8 11/1/2016 21 100001 1 32
9 11/1/2016 31 100002 NA 1
10 11/1/2016 41 100003 NA 1
and then group by and filter to have:
unique1 unique2 unique3 flag value
1 11/1/2016 21 100001 1 1
2 11/1/2016 21 100001 1 6
3 11/1/2016 21 100001 1 18
4 11/1/2016 21 100001 1 19
5 11/1/2016 21 100001 1 22
6 11/1/2016 21 100001 1 29
7 11/1/2016 21 100001 1 30
8 11/1/2016 21 100001 1 32

For the first step (applying the flag uniformly to each group):
DF$flag <- ave(DF$flag, DF$unique3, FUN = function(x) max(c(0,x), na.rm=TRUE))
Then you can filter a few different ways. One option is:
subset(DF, flag == 1)
How it works
ave(v, g1, g2, g3, FUN = f) splits up vector v based on grouping variables; applies a function to each subvector; recombines to return a vector with the same class as v.
max(c(0,x), na.rm=TRUE) removes the NA values, adds a 0 value and then takes the max. If x only contains 1s and NAs, this will return a 1 if x contains any 1 and otherwise returns 0.
Some alternatives with packages
library(data.table)
DT = setDT(copy(DF))
DT[, flag := max(c(0,flag), na.rm=TRUE), by=unique3][ flag == 1 ]
# or...
library(dplyr)
DF2 = DF
(DF2 %<>%
group_by(unique3) %>%
mutate(flag = max(c(0,flag), na.rm=TRUE))
) %>% filter(flag == 1)
(I'm only creating the DF2 and DT objects here so the code can be run directly without conflicting edits on DF.)

You should be able to do this with just dplyr. Here, I group_by, then use any to return whether or not any values in that column are "1". If you have more complicated criteria in your use case, you could include them here.
df %>%
group_by(unique3) %>%
mutate(newFlag = any(flag == 1, na.rm = TRUE))
returns:
unique1 unique2 unique3 flag value newFlag
<fctr> <int> <int> <dbl> <int> <lgl>
1 11/1/2016 21 100001 NA 1 TRUE
2 11/1/2016 21 100001 1 6 TRUE
3 11/1/2016 21 100001 NA 18 TRUE
4 11/1/2016 21 100001 NA 19 TRUE
5 11/1/2016 21 100001 NA 22 TRUE
6 11/1/2016 21 100001 NA 29 TRUE
7 11/1/2016 21 100001 NA 30 TRUE
8 11/1/2016 21 100001 NA 32 TRUE
9 11/1/2016 31 100002 NA 1 FALSE
10 11/1/2016 41 100003 NA 1 FALSE
where the column newFlag accomplishes what I think you are requesting. You can overwrite flag instead if you prefer.
You can use it to filter as such:
df %>%
group_by(unique3) %>%
mutate(newFlag = any(flag == 1, na.rm = TRUE)) %>%
filter(newFlag)
From your question, it is unclear whether you want to keep or discard groups that have a flag. If you want to remove them, use filter(!newFlag) instead. In either case, if you want to be rid of the new column after filtering, use select(-newFlag).

Related

How to fill missing values grouped on id and based on time period from index date

I want to fill in missing values for a data.frame based on a period of time within groups of ID.
For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.
Sample version of my data:
ID registration_dat value1 value2
1 2020-03-04 NA NA
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 NA NA
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 NA NA
8 2019-10-31 NA 43
8 2019-08-12 32 46
Desired output:
ID registration_dat value1 value2
1 2020-03-04 33 25
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 32 83
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 32 43
8 2019-10-31 NA 43
8 2019-08-12 32 46
I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million+ rows so I would need it to work on a much bigger data.frame than the one shown as an example.
I'd appreciate any ideas!
You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
arrange(desc(registration_dat), .by_group = TRUE) %>%
mutate(across(starts_with("value"),
~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
.x[which.max(!is.na(.x))], .x))) %>%
ungroup()
# # A tibble: 12 x 4
# ID registration_dat value1 value2
# <int> <date> <int> <int>
# 1 1 2020-03-04 33 25
# 2 1 2019-05-06 33 25
# 3 1 2019-01-02 32 21
# 4 3 2021-10-31 NA NA
# 5 3 2018-10-12 33 NA
# 6 3 2018-10-10 25 35
# 7 4 2020-01-02 32 83
# 8 4 2019-10-31 32 83
# 9 4 2019-09-20 33 56
# 10 8 2019-12-12 32 43
# 11 8 2019-10-31 NA 43
# 12 8 2019-08-12 32 46
Data
df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))
You could make a small function (f, below) to handle each value column.
Make a grouped ID, and generate a rowid (this is only to retain your original order)
dat <- dat %>%
mutate(rowid = row_number()) %>%
arrange(registration_dat) %>%
group_by(ID)
Make a function that takes a df and val column, and returns and updated df with val fixed
f <- function(df, val) {
bind_rows(
df %>% filter(is.na({{val}}) & row_number()!=n()),
df %>% filter(!is.na({{val}}) | row_number()==n()) %>%
mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
)
}
Apply the function to the columns of interest
dat = f(dat,value1)
dat = f(dat,value2)
If you want, recover the original order
dat %>% arrange(rowid) %>% select(-rowid)
Output:
ID registration_dat value1 value2
<int> <date> <int> <int>
1 1 2020-03-04 33 25
2 1 2019-05-06 33 25
3 1 2019-01-02 32 21
4 3 2021-10-31 NA NA
5 3 2018-10-12 33 NA
6 3 2018-10-10 25 35
7 4 2020-01-02 32 83
8 4 2019-10-31 32 83
9 4 2019-09-20 33 56
10 8 2019-12-12 32 46
11 8 2019-10-31 NA 43
12 8 2019-08-12 32 46
Update:
The OP wants the final row (i.e the last registration_dat) per ID. With 3 million rows and 14 value columns, I would use data.table and do something like this:
library(data.table)
f <- function(df) {
df = df[df[1,registration_dat]-registration_dat<=365]
df[1,value:=df[2:.N][!is.na(value)][1,value]][1]
}
dcast(
melt(setDT(dat), id=c("ID", "registration_dat"))[order(-registration_dat),f(.SD), by=.(ID,variable)],
ID+registration_dat~variable, value.var="value"
)
Output:
ID registration_dat value1 value2
<int> <Date> <int> <int>
1: 1 2020-03-04 33 25
2: 3 2021-10-31 NA NA
3: 4 2020-01-02 32 83
4: 8 2019-12-12 32 43

Coding the number of visits based on dates and assigning value in new column R

I am relatively new to R and am trying to create a new column for number of visits (ie num_visits) based on the admission dates (ie admit_date)
The sample dataframe is below and the number of visits has to be created based on the admit_date column. The admit_dates do not necessarily run in sequence.
subject_id admit_date num_visits
22 2010-10-20 1
23 2010-10-20 1
24 2010-10-21 1
25 2010-10-21 1
22 2010-12-30 3
22 2010-12-22 2
23 2010-12-25 2
30 2011-01-14 1
31 2011-01-14 1
33 2011-02-05 2
33 2011-01-26 1
I know i need to groupby subject_id and perhaps get the counts based on the sequence of the dates.
Am stuck after the following codes, appreciate any form of help, thank you!
df %>%
group_by(subject_id) %>%
We can use mutate after grouping by 'subject_id'
library(dplyr)
df %>%
arrange(subject_id, as.Date(admit_date)) %>%
group_by(subject_id) %>%
mutate(num_visits = row_number())
or with data.table
library(data.table)
setDT(df)[order(as.IDate(admit_date)), num_visits := rowid(subject_id)][]
# subject_id admit_date num_visits
# 1: 22 2010-10-20 1
# 2: 23 2010-10-20 1
# 3: 24 2010-10-21 1
# 4: 25 2010-10-21 1
# 5: 22 2010-12-30 3
# 6: 22 2010-12-22 2
# 7: 23 2010-12-25 2
# 8: 30 2011-01-14 1
# 9: 31 2011-01-14 1
#10: 33 2011-02-05 2
#11: 33 2011-01-26 1
data
df <- structure(list(subject_id = c(22L, 23L, 24L, 25L, 22L, 22L, 23L,
30L, 31L, 33L, 33L), admit_date = c("2010-10-20", "2010-10-20",
"2010-10-21", "2010-10-21", "2010-12-30", "2010-12-22", "2010-12-25",
"2011-01-14", "2011-01-14", "2011-02-05", "2011-01-26")), row.names = c(NA,
-11L), class = "data.frame")

Select rows with all longitudinal measurements

I have a longitudinal dataset with ID, Wave (Wave1-4), and Score. Here's sample data with the same structure. The length of the original data is around 2000, with 500 participants total, put in long form.
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
5 1002 1 30
6 1002 3 30
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
11 1004 1 22
12 1005 1 20
13 1005 2 18
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
I would like to select the 'ID's with all four measurements of 'Score' available. In other words, I want to select rows of the participants with 'Score' available for all 4 waves.
I've been trying to select rows with 'ID's that have data in all 'Wave's. My tryout so far has been based on this idea: if a participant has all four measurements, the ID will appear in the data four times.
That's why I tried to count the number of IDs,
table(data$id) == 4
and although it showed me the number of each ID appearing in the data, I cannot select the corresponding rows.
all.data <- subset(data, subset=table(data$id) == 4)
Because the length of the original data is different, being in long form. "Length of logical index must be 1 or 2637, not 828" I would need a long-form data for further analysis, so I wish not to change it.
You can try:
df[as.logical(with(df, ave(Wave, ID, FUN = function(x) length(x) == 4))), ]
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
Or if you want to keep your basic idea, a slight modification of #jay.sf code:
df[df$ID %in% names(which(table(df$ID) == 4)), ]
I like your table() approach.
> table(d$ID) == 4
1001 1002 1003 1004 1005 1006
TRUE FALSE TRUE FALSE FALSE TRUE
The interesting IDs are in the names() though. So to get your code to work you could extract the IDs like so
subs <- names(which(table(d$ID) == 4))
and get your desired subset using %in%.
all.data <- subset(d, subset=d$ID %in% subs)
Result
> all.data
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
(BTW: Always make sure with ?<name> that you do not define any existing function names as object names, this will save you a lot of trouble. In your case type ?data in a fresh session before loading the object.)
Data
> dput(d)
structure(list(ID = c(1001L, 1001L, 1001L, 1001L, 1002L, 1002L,
1003L, 1003L, 1003L, 1003L, 1004L, 1005L, 1005L, 1006L, 1006L,
1006L, 1006L), Wave = c(1L, 2L, 3L, 4L, 1L, 3L, 1L, 2L, 3L, 4L,
1L, 1L, 2L, 1L, 2L, 3L, 4L), Score = c(28L, 27L, 28L, 26L, 30L,
30L, 30L, 30L, 29L, 28L, 22L, 20L, 18L, 22L, 23L, 25L, 19L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17"))
instead of feeding table(data$ID), try with
ID %in% names(table(data$ID)[table(data$ID)==4])
As the table gives you the number of occurrences for each ID (named vector)
This is a quick data.table answer.
library(data.table)
dt <- structure(list(ID = c(1001, 1001, 1001, 1001, 1002, 1002, 1003,
1003, 1003, 1003, 1004, 1005, 1005, 1006, 1006, 1006, 1006),
Wave = c(1, 2, 3, 4, 1, 3, 1, 2, 3, 4, 1, 1, 2, 1, 2, 3,
4), Score = c(28, 27, 28, 26, 30, 30, 30, 30, 29, 28, 22,
20, 18, 22, 23, 25, 19)), row.names = c(NA, -17L), class = c("data.table",
"data.frame"))
dt[ , .(Score, N = uniqueN(.SD)) , by = list(ID), .SDcols = c("Wave")][N == 4,]
> ID Score N
1: 1001 28 4
2: 1001 27 4
3: 1001 28 4
4: 1001 26 4
5: 1003 30 4
6: 1003 30 4
7: 1003 29 4
8: 1003 28 4
9: 1006 22 4
10: 1006 23 4
11: 1006 25 4
12: 1006 19 4
For the sake of completeness, here are two data.table solutions. Both identify those IDs for which Wave has values 1 to 4. One approach uses subsetting, the other one is joining.
Subsetting
library(data.table)
setDT(df)[ID %in% dt[ , which(uniqueN(Wave) == 4L), by = ID]$ID]
ID Wave Score
1: 1001 1 28
2: 1001 2 27
3: 1001 3 28
4: 1001 4 26
5: 1003 1 30
6: 1003 2 30
7: 1003 3 29
8: 1003 4 28
9: 1006 1 22
10: 1006 2 23
11: 1006 3 25
12: 1006 4 19
Joining
library(data.table)
setDT(df)[df[, .N, .(ID, Wave)][, .N, ID][N == 4L, .(ID)], on = "ID"]
which returns the same result.
Data
library(data.table)
fread("
rn ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
5 1002 1 30
6 1002 3 30
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
11 1004 1 22
12 1005 1 20
13 1005 2 18
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19", drop = 1L)

Moving average and moving slope in R

I am looking to separately calculate a 7-day moving average and 7-day moving slope of 'oldvar'.
My sincere apologies that I didn't add the details below in my original post. These are repeated observations for each id which can go from a minimum of 3 observations per id to 100 observations per id. The start day can be different for different IDs, and to make things complicated, the days are not equally spaced, so some IDs have missing days.
Here is the data structure. Please note that 'average' is the variable that I am trying to create as moving 7-day average for each ID:
id day outcome average
1 1 15 100 NA
2 1 16 110 NA
3 1 17 190 NA
4 1 18 130 NA
5 1 19 140 NA
6 1 20 150 NA
7 1 21 160 140
8 1 22 100 140
9 1 23 180 150
10 1 24 120 140
12 2 16 90 NA
13 2 17 110 NA
14 2 18 120 NA
12 2 20 130 NA
15 3 16 110 NA
16 3 18 200 NA
17 3 19 180 NA
18 3 21 170 NA
19 3 22 180 168
20 3 24 210 188
21 3 25 160 180
22 3 27 200 184
Also, would appreciate advice on how to calculate a moving 7-day slope using the same.
Thank you and again many apologies for being unclear the first time around.
The real challenge is to create a data.frame after completing the missing rows. One solution could be using zoo library. The rollapply function will provide a way to assign NA value for the initial rows.
Using data from OP as is, the solution could be:
library(zoo)
library(dplyr)
# Data from OP
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
day = c(15L,16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 16L, 17L, 18L, 20L,
16L, 18L, 19L, 21L, 22L, 24L, 25L, 27L),
outcome = c(100L, 110L,190L, 130L, 140L, 150L, 160L, 100L, 180L, 120L, 90L, 110L, 120L,
130L, 110L, 200L, 180L, 170L, 180L, 210L, 160L, 200L)),
.Names = c("id", "day", "outcome"), row.names = c(NA, -22L), class = "data.frame")
# Make a list without missing day for each id
df_complete <- merge(
expand.grid(id=unique(df$id), day=min(df$day):max(df$day)),
df, all=TRUE)
# Valid range of day for each ID group
df_id_wise_range <- df %>% group_by(id) %>%
summarise(min_day = min(day), max_day = max(day)) %>% as.data.frame()
# id min_day max_day
# 1 1 15 24
# 2 2 16 20
# 3 3 16 27
# Join original df and df_complete and then use df_id_wise_range to
# filter it for valid range of day for each group
df_final <- df_complete %>%
left_join(df, by=c("id","day")) %>%
select(-outcome.y) %>%
inner_join(df_id_wise_range, by="id") %>%
filter(day >= min_day & day <= max_day) %>%
mutate(outcome = outcome.x) %>%
select( id, day, outcome) %>%
as.data.frame()
# Now apply mean to get average
df_average <- df_final %>% group_by(id) %>%
mutate(average= rollapply(outcome, 7, mean, na.rm = TRUE, by = 1,
fill = NA, align = "right", partial = 7)) %>% as.data.frame()
df_average
# The result
# id day outcome average
#1 1 15 100 NA
#2 1 16 110 NA
#3 1 17 190 NA
#4 1 18 130 NA
#5 1 19 140 NA
#6 1 20 150 NA
#7 1 21 160 140.0
#8 1 22 100 140.0
#9 1 23 180 150.0
#10 1 24 120 140.0
#11 2 16 90 NA
#12 2 17 110 NA
#13 2 18 120 NA
#....
#....
#19 3 19 180 NA
#20 3 20 NA NA
#21 3 21 170 NA
#22 3 22 180 168.0
#23 3 23 NA 182.5
#24 3 24 210 188.0
#25 3 25 160 180.0
#26 3 26 NA 180.0
#27 3 27 200 184.0
The steps to calculate moving slope are:
First create a function to return slope
Use function as as part of rollapplyr
#Function to calculate slope
slop_e <- function(z) coef(lm(b ~ a, as.data.frame(z)))[[2]]
#Apply function
z2$slope <- rollapplyr(zoo(z2), 7, slop_e , by.column = FALSE, fill = NA, align = "right")
z2
a b mean_a slope
1 1 21 NA NA
2 2 22 NA NA
3 3 23 NA NA
4 4 24 NA NA
5 5 25 NA NA
6 6 26 NA NA
7 7 27 4 1
8 8 28 5 1
9 9 29 6 1
10 10 30 7 1
11 11 31 8 1
12 12 32 9 1
13 13 33 10 1
14 14 34 11 1
15 15 35 12 1
16 16 36 13 1
17 17 37 14 1
18 18 38 15 1
19 19 39 16 1
20 20 40 17 1

lapply alternative to for loop to append to data frame

I have a data frame:
df<-structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L,
45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), row.names = c(NA, -14L), class = "data.frame")
> head(df)
chrom pos
1 1 10
2 1 200
3 1 134
4 1 400
5 1 600
6 1 1000
And I want to calculate pos[i+1] - pos[i] on the sample chromosome (chrom)
By using a for loop over each chrom level, and another over each row I get the expected results:
for (c in levels(df$chrom)){
df_chrom<-filter(df, chrom == c)
df_chrom<-arrange(df_chrom, df_chrom$pos)
for (i in 1:nrow(df_chrom)){
dist<-(df_chrom$pos[i+1] - df_chrom$pos[i])
logdist<-log10(dist)
cat(c, i, df_chrom$pos[i], dist, logdist, "\n")
}
}
However, I want to save this to a data frame, and think that lapply or apply is the right way to go about this. I can't work out how to make the pos[i+1] - pos[i] calculation though (seeing as lapply works on each row/column.
Any pointers would be appreciated
Here's the output from my solution:
chrom index pos dist log10dist
1 1 10 124 2.093422
1 2 134 66 1.819544
1 3 200 200 2.30103
1 4 400 200 2.30103
1 5 600 400 2.60206
1 6 1000 NA NA
2 1 20 13 1.113943
2 2 33 NA NA
3 1 40 5 0.69897
3 2 45 NA NA
4 1 50 5 0.69897
4 2 55 45 1.653213
4 3 100 23 1.361728
4 4 123 NA NA
We could do this using a group by difference. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'chrom', order the 'pos', get the difference of 'pos' (diff) and also log of the difference
library(data.table)
setDT(df)[order(pos), {v1 <- diff(pos)
.(index = seq_len(.N), pos = pos,
dist = c(v1, NA), logdiff = c(log10(v1), NA))}
, by = chrom]
# chrom index pos dist logdiff
# 1: 1 1 10 124 2.093422
# 2: 1 2 134 66 1.819544
# 3: 1 3 200 200 2.301030
# 4: 1 4 400 200 2.301030
# 5: 1 5 600 400 2.602060
# 6: 1 6 1000 NA NA
# 7: 2 1 20 13 1.113943
# 8: 2 2 33 NA NA
# 9: 3 1 40 5 0.698970
#10: 3 2 45 NA NA
#11: 4 1 50 5 0.698970
#12: 4 2 55 45 1.653213
#13: 4 3 100 23 1.361728
#14: 4 4 123 NA NA
Upon running the OP's code the output printed are
#1 1 10 124 2.093422
#1 2 134 66 1.819544
#1 3 200 200 2.30103
#1 4 400 200 2.30103
#1 5 600 400 2.60206
#1 6 1000 NA NA
#2 1 20 13 1.113943
#2 2 33 NA NA
#3 1 40 5 0.69897
#3 2 45 NA NA
#4 1 50 5 0.69897
#4 2 55 45 1.653213
#4 3 100 23 1.361728
#4 4 123 NA NA
We split df by df$chrom (Note that we reorder both df and df$chrom before splitting). Then we go through each of the subgroups (the subgroups are called a in this example) using lapply. On the pos column of each subgroup, we calculate difference (diff) of consecutive elements and take log10. Since diff decreases the number of elements by 1, we add a NA to the end. Finally, we rbind all the subgroups together using do.call.
do.call(rbind, lapply(split(df[order(df$chrom, df$pos),], df$chrom[order(df$chrom, df$pos)]),
function(a) data.frame(a, dist = c(log10(diff(a$pos)), NA))))
# chrom pos dist
#1.1 1 10 2.093422
#1.3 1 134 1.819544
#1.2 1 200 2.301030
#1.4 1 400 2.301030
#1.5 1 600 2.602060
#1.6 1 1000 NA
#2.7 2 20 1.113943
#2.8 2 33 NA
#3.9 3 40 0.698970
#3.10 3 45 NA
#4.11 4 50 0.698970
#4.12 4 55 1.653213
#4.13 4 100 1.361728
#4.14 4 123 NA

Resources