Ignore NA values of a column within a statement - r

Until now I've been working with a medium size dataset for an Ocupation Survey(around 200 mb total), here's the data if you want to review it: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
I have the following code:
base<-tibble(ID_VIV=datos$ID_VIV, ID_PERSONA=datos$ID_PERSONA, EDAD=datos$EDAD, CONACT=datos$CONACT)
base$maxage <- ave(base$EDAD, base$ID_VIV, FUN=max)
base$Condición_I<-case_when(base$CONACT==32 & base$EDAD>=60 ~ 1,
base$CONACT>=10 & base$EDAD>=60 & base$CONACT<=16 ~ 2,
base$CONACT==20 & base$EDAD>=60 | base$CONACT==31 & base$EDAD>=60 | (base$CONACT>=33 & base$CONACT<=35 & base$EDAD>=60) ~ 3)
base <- subset(base, maxage >= 60)
base<- base %>% group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else Condición_I)
And ended up with:
# A tibble: 38,307 x 7
# Groups: ID_VIV [10,499]
ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 10010000007 1001000000701 69 32 69 1 1
2 10010000008 1001000000803 83 33 83 3 4
3 10010000008 1001000000802 47 33 83 NA 4
4 10010000008 1001000000801 47 10 83 NA 4
5 10010000012 1001000001204 4 NA 60 NA 4
6 10010000012 1001000001203 2 NA 60 NA 4
7 10010000012 1001000001201 60 10 60 2 4
8 10010000012 1001000001202 21 10 60 NA 4
9 10010000014 1001000001401 67 32 67 1 4
10 10010000014 1001000001402 64 33 67 3 4
The Condición_I column value is a code for the labour conditions of each individual(row), some of this individuals share house (that's why they share ID_VIV), I only care about the individuals that are 60yo or more, all the NA are individuals who live with a 60+yo but I do not care about their situation (but I need to keep them), I need the column Condición_V to display another value following this conditions:
Condición_I == 1 ~ 1
Condición_I == 2 ~ 2
Condición_I == 3 ~ 3
Any combination of Condición_I ~ 4
This means that if all the 60 and+_yo individuals in a house have Condición_I == 1 then Condición_V will be 1 that's true up to code 3, when there are x.e. one person C_I == 1 and another one C_I == 3 in the same house, then Condición_V will be 4
And I'm hoping to get this kind of result:
A tibble: 38,307 x 7
# Groups: ID_VIV [10,499]
ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 10010000007 1001000000701 69 32 69 1 1
2 10010000008 1001000000803 83 33 83 3 3
3 10010000008 1001000000802 47 33 83 NA 3
4 10010000008 1001000000801 47 10 83 NA 3
5 10010000012 1001000001204 4 NA 60 NA 2
6 10010000012 1001000001203 2 NA 60 NA 2
7 10010000012 1001000001201 60 10 60 2 2
8 10010000012 1001000001202 21 10 60 NA 2
9 10010000014 1001000001401 67 32 67 1 4
10 10010000014 1001000001402 64 33 67 3 4
I know my error is in:
`#base<- base %>% group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else` Condición_I)
Is there a way to use that line of code ignoring the NA values or is it my best option to do it otherway, I do not have to do it the way I'm trying and any other way or help will be much appreciated!

We can wrap with na.omit on the Condición_I column, check the number of distinct elements with n_distinct and if it is greater than 1, return 4 or else return the na.omit of the column
base %>%
group_by(ID_VIV) %>%
mutate(Condición_V = if(n_distinct(na.omit(Condición_I)) > 1)
4 else na.omit(Condición_I)[1])
# A tibble: 10 x 7
# Groups: ID_VIV [4]
# ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
# <chr> <chr> <int> <int> <int> <int> <dbl>
# 1 10010000007 1001000000701 69 32 69 1 1
# 2 10010000008 1001000000803 83 33 83 3 3
# 3 10010000008 1001000000802 47 33 83 NA 3
# 4 10010000008 1001000000801 47 10 83 NA 3
# 5 10010000012 1001000001204 4 NA 60 NA 2
# 6 10010000012 1001000001203 2 NA 60 NA 2
# 7 10010000012 1001000001201 60 10 60 2 2
# 8 10010000012 1001000001202 21 10 60 NA 2
# 9 10010000014 1001000001401 67 32 67 1 4
#10 10010000014 1001000001402 64 33 67 3 4
base <- structure(list(ID_VIV = c("10010000007", "10010000008", "10010000008",
"10010000008", "10010000012", "10010000012", "10010000012", "10010000012",
"10010000014", "10010000014"), ID_PERSONA = c("1001000000701",
"1001000000803", "1001000000802", "1001000000801", "1001000001204",
"1001000001203", "1001000001201", "1001000001202", "1001000001401",
"1001000001402"), EDAD = c(69L, 83L, 47L, 47L, 4L, 2L, 60L, 21L,
67L, 64L), CONACT = c(32L, 33L, 33L, 10L, NA, NA, 10L, 10L, 32L,
33L), maxage = c(69L, 83L, 83L, 83L, 60L, 60L, 60L, 60L, 67L,
67L), Condición_I = c(1L, 3L, NA, NA, NA, NA, 2L, NA, 1L, 3L
)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10"), class = "data.frame")


How to fill missing values grouped on id and based on time period from index date

I want to fill in missing values for a data.frame based on a period of time within groups of ID.
For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.
Sample version of my data:
ID registration_dat value1 value2
1 2020-03-04 NA NA
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 NA NA
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 NA NA
8 2019-10-31 NA 43
8 2019-08-12 32 46
Desired output:
ID registration_dat value1 value2
1 2020-03-04 33 25
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 32 83
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 32 43
8 2019-10-31 NA 43
8 2019-08-12 32 46
I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million+ rows so I would need it to work on a much bigger data.frame than the one shown as an example.
I'd appreciate any ideas!
You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.
df %>%
group_by(ID) %>%
arrange(desc(registration_dat), .by_group = TRUE) %>%
~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
.x[which.max(!is.na(.x))], .x))) %>%
# # A tibble: 12 x 4
# ID registration_dat value1 value2
# <int> <date> <int> <int>
# 1 1 2020-03-04 33 25
# 2 1 2019-05-06 33 25
# 3 1 2019-01-02 32 21
# 4 3 2021-10-31 NA NA
# 5 3 2018-10-12 33 NA
# 6 3 2018-10-10 25 35
# 7 4 2020-01-02 32 83
# 8 4 2019-10-31 32 83
# 9 4 2019-09-20 33 56
# 10 8 2019-12-12 32 43
# 11 8 2019-10-31 NA 43
# 12 8 2019-08-12 32 46
df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))
You could make a small function (f, below) to handle each value column.
Make a grouped ID, and generate a rowid (this is only to retain your original order)
dat <- dat %>%
mutate(rowid = row_number()) %>%
arrange(registration_dat) %>%
Make a function that takes a df and val column, and returns and updated df with val fixed
f <- function(df, val) {
df %>% filter(is.na({{val}}) & row_number()!=n()),
df %>% filter(!is.na({{val}}) | row_number()==n()) %>%
mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
Apply the function to the columns of interest
dat = f(dat,value1)
dat = f(dat,value2)
If you want, recover the original order
dat %>% arrange(rowid) %>% select(-rowid)
ID registration_dat value1 value2
<int> <date> <int> <int>
1 1 2020-03-04 33 25
2 1 2019-05-06 33 25
3 1 2019-01-02 32 21
4 3 2021-10-31 NA NA
5 3 2018-10-12 33 NA
6 3 2018-10-10 25 35
7 4 2020-01-02 32 83
8 4 2019-10-31 32 83
9 4 2019-09-20 33 56
10 8 2019-12-12 32 46
11 8 2019-10-31 NA 43
12 8 2019-08-12 32 46
The OP wants the final row (i.e the last registration_dat) per ID. With 3 million rows and 14 value columns, I would use data.table and do something like this:
f <- function(df) {
df = df[df[1,registration_dat]-registration_dat<=365]
melt(setDT(dat), id=c("ID", "registration_dat"))[order(-registration_dat),f(.SD), by=.(ID,variable)],
ID+registration_dat~variable, value.var="value"
ID registration_dat value1 value2
<int> <Date> <int> <int>
1: 1 2020-03-04 33 25
2: 3 2021-10-31 NA NA
3: 4 2020-01-02 32 83
4: 8 2019-12-12 32 43

How to concatenate two pairs of columns by name with shifting rows, in a dataframe with multiple column pairs [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 1 year ago.
I have this dataframe:
id a1 a2 b1 b2 c1 c2
<int> <int> <int> <int> <int> <int> <int>
1 1 83 33 55 33 85 86
2 2 37 0 60 98 51 0
3 3 97 71 85 8 44 40
4 4 51 6 43 15 55 57
5 5 28 53 62 73 70 9
df <- structure(list(id = 1:5, a1 = c(83L, 37L, 97L, 51L, 28L), a2 = c(33L,
0L, 71L, 6L, 53L), b1 = c(55L, 60L, 85L, 43L, 62L), b2 = c(33L,
98L, 8L, 15L, 73L), c1 = c(85L, 51L, 44L, 55L, 70L), c2 = c(86L,
0L, 40L, 57L, 9L)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
I want to:
Combine columns with same starting character to one column by shifting each row of the second column by 1 down and naming the new column with the character of the two columns.
My desired output:
id a b c
<dbl> <dbl> <dbl> <dbl>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
I have tried using lagfunction but I don`t know how to combine and shift columns at the same time!
To clarify a picture:
You can use the following solution. I also have modified your data set an added an id column:
df %>%
pivot_longer(!id, names_to = c(".value", NA), names_pattern = "([[:alpha:]])(\\d)")
# A tibble: 10 x 4
id a b c
<int> <int> <int> <int>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
We can pivot_longer, remove the digits from name, then pivot_wider and unnest
df %>% pivot_longer(cols = -id)%>%
mutate(name=str_remove(name, '[0-9]'))%>%
pivot_wider(names_from = name)%>%
# A tibble: 10 x 4
id a b c
<int> <int> <int> <int>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
Doing it as a pivot_longer(), then pivot_wider() is easier to read, but #Anoushiravan R's answer to more direct
df %>%
rownames_to_column(var = "id") %>% # Add the id column
pivot_longer(-id) %>% # Make long
mutate(order = str_sub(name, -1), name = str_sub(name, 1, 1)) %>% # Breakout the name column
pivot_wider(names_from = name) %>% # Make wide again
select(-order) # Drop the ordering column
I think ANoushiravan's solution is the tidiest way to do it. We could also use {dplyover} (disclaimer) for this:
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
group_by(id) %>%
~ c(.x,.y),
.names = "{pre}"),
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 10 x 4
#> # Groups: id [5]
#> id a b c
#> <int> <int> <int> <int>
#> 1 1 83 55 85
#> 2 1 33 33 86
#> 3 2 37 60 51
#> 4 2 0 98 0
#> 5 3 97 85 44
#> 6 3 71 8 40
#> 7 4 51 43 55
#> 8 4 6 15 57
#> 9 5 28 62 70
#> 10 5 53 73 9
Created on 2021-07-28 by the reprex package (v0.3.0)

Fill in values between start and end value in R

W (blue line below) in my data.frame represents where the water level in the river intersects the elevation profile.
In my data.frame, for each group in ID, I need to fill in values between the start and end value (W)
My data
> head(df, 23)
ID elevation code
1 1 150 <NA>
2 1 140 <NA>
3 1 130 W
4 1 120 <NA>
5 1 110 <NA>
6 1 120 <NA>
7 1 130 W
8 1 140 <NA>
9 1 150 <NA>
10 2 90 <NA>
11 2 80 <NA>
12 2 70 <NA>
13 2 66 W
14 2 60 <NA>
15 2 50 <NA>
16 2 66 W
17 2 70 <NA>
18 2 72 <NA>
19 2 68 W
20 2 65 <NA>
21 2 60 <NA>
22 2 68 W
23 2 70 <NA>
I want the final result to look like below
ID elevation code
1 1 150 <NA>
2 1 140 <NA>
3 1 130 W
4 1 120 W
5 1 110 W
6 1 120 W
7 1 130 W
8 1 140 <NA>
9 1 150 <NA>
10 2 90 <NA>
11 2 80 <NA>
12 2 70 <NA>
13 2 66 W
14 2 60 W
15 2 50 W
16 2 66 W
17 2 70 <NA>
18 2 72 <NA>
19 2 68 W
20 2 65 W
21 2 60 W
22 2 68 W
23 2 70 <NA>
I tried many things but my trials were not successful. Your help will be appreciated.
> dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), elevation = c(150L,
140L, 130L, 120L, 110L, 120L, 130L, 140L, 150L, 90L, 80L, 70L,
66L, 60L, 50L, 66L, 70L, 72L, 68L, 65L, 60L, 68L, 70L), code = c(NA,
NA, "W", NA, NA, NA, "W", NA, NA, NA, NA, NA, "W", NA, NA, "W",
NA, NA, "W", NA, NA, "W", NA)), class = "data.frame", row.names = c(NA,
You could do:
df %>%
mutate(code = coalesce(code, c(NA, "W")[cumsum(!is.na(code)) %% 2 + 1]))
ID elevation code
1 1 150 <NA>
2 1 140 <NA>
3 1 130 W
4 1 120 W
5 1 110 W
6 1 120 W
7 1 130 W
8 1 140 <NA>
9 1 150 <NA>
10 2 90 <NA>
11 2 80 <NA>
12 2 70 <NA>
13 2 66 W
14 2 60 W
15 2 50 W
16 2 66 W
17 2 70 <NA>
18 2 72 <NA>
19 2 68 W
20 2 65 W
21 2 60 W
22 2 68 W
23 2 70 <NA>
We can try replace + cumsum
df %>%
group_by(ID) %>%
mutate(code = replace(code, cumsum(!is.na(code)) %% 2 == 1, "W")) %>%
which gives
# A tibble: 23 x 3
ID elevation code
<int> <int> <chr>
1 1 150 NA
2 1 140 NA
3 1 130 W
4 1 120 W
5 1 110 W
6 1 120 W
7 1 130 W
8 1 140 NA
9 1 150 NA
10 2 90 NA
# ... with 13 more rows
You can create a helper function that creates a sequence between each start and end and assigns 'W' to it.
assign_w <- function(code) {
inds <- which(code == 'W')
code[unlist(Map(seq, inds[c(TRUE, FALSE)], inds[c(FALSE, TRUE)]))] <- 'W'
and apply it for each ID :
df %>%
group_by(ID) %>%
mutate(result = assign_w(code)) %>%
# ID elevation code result
#1 1 150 <NA> <NA>
#2 1 140 <NA> <NA>
#3 1 130 W W
#4 1 120 <NA> W
#5 1 110 <NA> W
#6 1 120 <NA> W
#7 1 130 W W
#8 1 140 <NA> <NA>
#9 1 150 <NA> <NA>
#10 2 90 <NA> <NA>
#11 2 80 <NA> <NA>
#12 2 70 <NA> <NA>
#13 2 66 W W
#14 2 60 <NA> W
#15 2 50 <NA> W
#16 2 66 W W
#17 2 70 <NA> <NA>
#18 2 72 <NA> <NA>
#19 2 68 W W
#20 2 65 <NA> W
#21 2 60 <NA> W
#22 2 68 W W
#23 2 70 <NA> <NA>
df %>%
group_by(ID) %>%
mutate(water_flag = (1 * !is.na(code)) * if_else(elevation < lag(elevation, default = 0), 1, -1),
water = if_else(cumsum(water_flag) == 1, "W", NA_character_))
First I tried to use fill but had no success. Then I learned here about the benefit of R's recycling property Rename first and second occurence of the same specific value in a column iteratively (Thanks to Ronak!)
# prepare data with renaming `start` and `stop` sequence
df$code[is.na(df$code)] <- "NA"
df$code[df$code == 'W'] <- c('start', 'end')
# Now with different names of start and stop sequence I was able to implement `cumsum`
df <- df %>%
group_by(grp = cumsum(!is.na(code))) %>%
dplyr::mutate(code = replace(code, first(code) == 'start', 'W'),
code = replace(code, code=='end', 'W')) %>%
ungroup() %>%
# A tibble: 23 x 3
ID elevation code
<int> <int> <chr>
1 1 150 NA
2 1 140 NA
3 1 130 W
4 1 120 W
5 1 110 W
6 1 120 W
7 1 130 W
8 1 140 NA
9 1 150 NA
10 2 90 NA
11 2 80 NA
12 2 70 NA
13 2 66 W
14 2 60 W
15 2 50 W
16 2 66 W
17 2 70 NA
18 2 72 NA
19 2 68 W
20 2 65 W
21 2 60 W
22 2 68 W
23 2 70 NA
This answer is similar to #Onyambu's: create an 'index' (ind) that increases by one each time a non-NA is encountered in the 'code' column. If the index value is divisible by 2 (i.e. it is an even number) insert "NA" into the new column. If the index is not divisible by 2, add a "W" into the new column. Then if there is a "W" in the 'code' or 'new' columns, replace the NA in the 'code' column with W and drop the 'new' column from the dataframe.
df %>%
mutate(ind = ifelse(cumsum(!is.na(code)) %% 2 == 0, NA, "W")) %>%
mutate(code = ifelse(ind == "W" | code == "W", "W", NA)) %>%
#> ID elevation code
#>1 1 150 <NA>
#>2 1 140 <NA>
#>3 1 130 W
#>4 1 120 W
#>5 1 110 W
#>6 1 120 W
#>7 1 130 W
#>8 1 140 <NA>
#>9 1 150 <NA>
#>10 2 90 <NA>
#>11 2 80 <NA>
#>12 2 70 <NA>
#>13 2 66 W
#>14 2 60 W
#>15 2 50 W
#>16 2 66 W
#>17 2 70 <NA>
#>18 2 72 <NA>
#>19 2 68 W
#>20 2 65 W
#>21 2 60 W
#>22 2 68 W
#>23 2 70 <NA>
Though the question has been marked as solved(answer accepted) yet for further/future reference, there is a function fill_run in library runner which does exactly this.
fill_run replaces NA values if they were surrounded by pair of identical values. Since our additional requirement is to look at elevation too we can do something like this
df %>% group_by(ID) %>%
mutate(code = runner::fill_run(ifelse(!is.na(code), paste(elevation,code), code), only_within = T))
# A tibble: 23 x 3
# Groups: ID [2]
ID elevation code
<int> <int> <chr>
1 1 150 NA
2 1 140 NA
3 1 130 130 W
4 1 120 130 W
5 1 110 130 W
6 1 120 130 W
7 1 130 130 W
8 1 140 NA
9 1 150 NA
10 2 90 NA
# ... with 13 more rows
Needless to say, you can again mutate non-NA values from code to W very easily, if required.

Select top x % of values per group - retain row ID

I am trying to identify the top 15% of scores for each watershed but retain the polygon ID when I print the results.
# here's a small example dataset (called "data"):
polygon watershed score
1 1 61
2 1 81
3 1 16
4 2 18
5 2 12
6 3 78
7 3 81
8 3 20
9 3 97
10 3 95
# I obtain the top 15% using this method:
top15 <- (data %>% select(watershed, score) %>%
group_by(watershed) %>%
arrange(watershed, desc(score)) %>%
filter(score > quantile(score, 0.15)))
# results look like this:
<int> <int>
1 1 81
2 1 61
3 2 18
4 3 97
5 3 95
6 3 81
7 3 78
How can I include the column "polygon" when I print the results?
Thanks so much for the help!
In your statement you selected only watershed and score but excluded polygon. So remove the select statement and you should get what you want. Additionally the arrange doesn't add value so I removed it:
mdat <- structure(list(polygon = 1:10,
watershed = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L),
score = c(61L, 81L, 16L, 18L, 12L, 78L, 81L, 20L, 97L, 95L)),
class = "data.frame", row.names = c(NA, -10L))
mdat %>%
group_by(watershed) %>%
filter(score > quantile(score, 0.15))
# # A tibble: 7 x 3
# # Groups: watershed [3]
# polygon watershed score
# <int> <int> <int>
# 1 1 1 61
# 2 2 1 81
# 3 4 2 18
# 4 6 3 78
# 5 7 3 81
# 6 9 3 97
# 7 10 3 95

lapply alternative to for loop to append to data frame

I have a data frame:
df<-structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L,
45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), row.names = c(NA, -14L), class = "data.frame")
> head(df)
chrom pos
1 1 10
2 1 200
3 1 134
4 1 400
5 1 600
6 1 1000
And I want to calculate pos[i+1] - pos[i] on the sample chromosome (chrom)
By using a for loop over each chrom level, and another over each row I get the expected results:
for (c in levels(df$chrom)){
df_chrom<-filter(df, chrom == c)
df_chrom<-arrange(df_chrom, df_chrom$pos)
for (i in 1:nrow(df_chrom)){
dist<-(df_chrom$pos[i+1] - df_chrom$pos[i])
cat(c, i, df_chrom$pos[i], dist, logdist, "\n")
However, I want to save this to a data frame, and think that lapply or apply is the right way to go about this. I can't work out how to make the pos[i+1] - pos[i] calculation though (seeing as lapply works on each row/column.
Any pointers would be appreciated
Here's the output from my solution:
chrom index pos dist log10dist
1 1 10 124 2.093422
1 2 134 66 1.819544
1 3 200 200 2.30103
1 4 400 200 2.30103
1 5 600 400 2.60206
1 6 1000 NA NA
2 1 20 13 1.113943
2 2 33 NA NA
3 1 40 5 0.69897
3 2 45 NA NA
4 1 50 5 0.69897
4 2 55 45 1.653213
4 3 100 23 1.361728
4 4 123 NA NA
We could do this using a group by difference. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'chrom', order the 'pos', get the difference of 'pos' (diff) and also log of the difference
setDT(df)[order(pos), {v1 <- diff(pos)
.(index = seq_len(.N), pos = pos,
dist = c(v1, NA), logdiff = c(log10(v1), NA))}
, by = chrom]
# chrom index pos dist logdiff
# 1: 1 1 10 124 2.093422
# 2: 1 2 134 66 1.819544
# 3: 1 3 200 200 2.301030
# 4: 1 4 400 200 2.301030
# 5: 1 5 600 400 2.602060
# 6: 1 6 1000 NA NA
# 7: 2 1 20 13 1.113943
# 8: 2 2 33 NA NA
# 9: 3 1 40 5 0.698970
#10: 3 2 45 NA NA
#11: 4 1 50 5 0.698970
#12: 4 2 55 45 1.653213
#13: 4 3 100 23 1.361728
#14: 4 4 123 NA NA
Upon running the OP's code the output printed are
#1 1 10 124 2.093422
#1 2 134 66 1.819544
#1 3 200 200 2.30103
#1 4 400 200 2.30103
#1 5 600 400 2.60206
#1 6 1000 NA NA
#2 1 20 13 1.113943
#2 2 33 NA NA
#3 1 40 5 0.69897
#3 2 45 NA NA
#4 1 50 5 0.69897
#4 2 55 45 1.653213
#4 3 100 23 1.361728
#4 4 123 NA NA
We split df by df$chrom (Note that we reorder both df and df$chrom before splitting). Then we go through each of the subgroups (the subgroups are called a in this example) using lapply. On the pos column of each subgroup, we calculate difference (diff) of consecutive elements and take log10. Since diff decreases the number of elements by 1, we add a NA to the end. Finally, we rbind all the subgroups together using do.call.
do.call(rbind, lapply(split(df[order(df$chrom, df$pos),], df$chrom[order(df$chrom, df$pos)]),
function(a) data.frame(a, dist = c(log10(diff(a$pos)), NA))))
# chrom pos dist
#1.1 1 10 2.093422
#1.3 1 134 1.819544
#1.2 1 200 2.301030
#1.4 1 400 2.301030
#1.5 1 600 2.602060
#1.6 1 1000 NA
#2.7 2 20 1.113943
#2.8 2 33 NA
#3.9 3 40 0.698970
#3.10 3 45 NA
#4.11 4 50 0.698970
#4.12 4 55 1.653213
#4.13 4 100 1.361728
#4.14 4 123 NA
