Replacing NAs with existing data when merging two dataframes in R

Replacing NAs with existing data when merging two dataframes in R - r

I would like to merge two dataframes. There are some shared variables and some different variables and there are different numbers of rows in each dataframe. The dataframes share some rows, but not all. And both dataframes have missing data that the other my have.
DF1:
name
age
weight
height
Tim
7
54
112
Dave
5
50
NA
Larry
NA
42
73
Rob
1
30
43
DF2:
name
age
weight
height
grade
Tim
7
NA
112
2
Dave
NA
50
103
1
Larry
3
NA
73
NA
Rob
1
30
NA
NA
John
6
60
NA
1
Tom
8
61
112
2
I want to merge these two dataframes together by the shared columns (name, age, weight, and height). However, I want NAs to be overridden, such that if one of the two dataframes has a value where the other has NA, I want the value to be carried through into the third dataframe. Ideally, the last dataframe should only have NAs when both DF1 and DF2 had NAs in that same location.
Ideal Data Frame
name
age
weight
height
grade
Tim
7
54
112
2
Dave
5
50
103
1
Larry
3
42
73
NA
Rob
1
30
43
NA
John
6
60
NA
1
Tom
8
61
112
2
I've been using full_join and left_join, but I don't know how to merge these in such a way that NAs are replaced with actual data (if it is present in one of the dataframes). Is there a way to do this?

This is a typical case that rows_patch() from dplyr can treat.
library(dplyr)
rows_patch(df2, df1, by = "name")
name age weight height grade
1 Tim 7 54 112 2
2 Dave 5 50 103 1
3 Larry 3 42 73 NA
4 Rob 1 30 43 NA
5 John 6 60 NA 1
6 Tom 8 61 112 2
Data
df1 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob"), age = c(7L,
5L, NA, 1L), weight = c(54L, 50L, 42L, 30L), height = c(112L,
NA, 73L, 43L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob", "John",
"Tom"), age = c(7L, NA, 3L, 1L, 6L, 8L), weight = c(NA, 50L,
NA, 30L, 60L, 61L), height = c(112L, 103L, 73L, NA, NA, 112L),
grade = c(2L, 1L, NA, NA, 1L, 2L)), class = "data.frame", row.names = c(NA, -6L))

I like the powerjoin package suggested as an answer to the question in the first comment, which I had never heard of before.
However, if you want to avoid using extra packages, you can do it in base R. This approach also avoids having to explicitly name each column - the dplyr approaches suggested in the comments do not do that, although perhaps could be modified.
# Load data
df1 <- read.table(text = "name age weight height
Tim 7 54 112
Dave 5 50 NA
Larry NA 42 73
Rob 1 30 43", header=TRUE)
df2 <- read.table(text = "name age weight height grade
Tim 7 NA 112 2
Dave NA 50 103 1
Larry 3 NA 73 NA
Rob 1 30 NA NA
John 6 60 NA 1
Tom 8 61 112 2", header=TRUE)
df3 <- merge(df1, df2, by = "name", all = TRUE, sort=FALSE)
# Coalesce the common columns
common_cols <- names(df1)[names(df1)!="name"]
df3[common_cols] <- lapply(common_cols, function(col) {
coalesce(df3[[paste0(col, ".x")]], df3[[paste0(col, ".y")]])
})
# Select desired columns
df3[names(df2)]
# name age weight height grade
# 1 Tim 7 54 112 2
# 2 Dave 5 50 103 1
# 3 Larry 3 42 73 NA
# 4 Rob 1 30 43 NA
# 5 John 6 60 NA 1
# 6 Tom 8 61 112 2
There are advantages to using base R, but powerjoin looks like an interesting package too.

Another possible solution:
library(tidyverse)
df2 %>%
bind_rows(df1) %>%
group_by(name) %>%
fill(age:grade, .direction = "updown") %>%
ungroup %>%
distinct
#> # A tibble: 6 x 5
#> name age weight height grade
#> <chr> <int> <int> <int> <int>
#> 1 Tim 7 54 112 2
#> 2 Dave 5 50 103 1
#> 3 Larry 3 42 73 NA
#> 4 Rob 1 30 43 NA
#> 5 John 6 60 NA 1
#> 6 Tom 8 61 112 2

Related

How to fill missing values grouped on id and based on time period from index date

I want to fill in missing values for a data.frame based on a period of time within groups of ID.
For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.
Sample version of my data:
ID registration_dat value1 value2
1 2020-03-04 NA NA
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 NA NA
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 NA NA
8 2019-10-31 NA 43
8 2019-08-12 32 46
Desired output:
ID registration_dat value1 value2
1 2020-03-04 33 25
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 32 83
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 32 43
8 2019-10-31 NA 43
8 2019-08-12 32 46
I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million+ rows so I would need it to work on a much bigger data.frame than the one shown as an example.
I'd appreciate any ideas!

You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
arrange(desc(registration_dat), .by_group = TRUE) %>%
mutate(across(starts_with("value"),
~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
.x[which.max(!is.na(.x))], .x))) %>%
ungroup()
# # A tibble: 12 x 4
# ID registration_dat value1 value2
# <int> <date> <int> <int>
# 1 1 2020-03-04 33 25
# 2 1 2019-05-06 33 25
# 3 1 2019-01-02 32 21
# 4 3 2021-10-31 NA NA
# 5 3 2018-10-12 33 NA
# 6 3 2018-10-10 25 35
# 7 4 2020-01-02 32 83
# 8 4 2019-10-31 32 83
# 9 4 2019-09-20 33 56
# 10 8 2019-12-12 32 43
# 11 8 2019-10-31 NA 43
# 12 8 2019-08-12 32 46
Data
df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))

You could make a small function (f, below) to handle each value column.
Make a grouped ID, and generate a rowid (this is only to retain your original order)
dat <- dat %>%
mutate(rowid = row_number()) %>%
arrange(registration_dat) %>%
group_by(ID)
Make a function that takes a df and val column, and returns and updated df with val fixed
f <- function(df, val) {
bind_rows(
df %>% filter(is.na({{val}}) & row_number()!=n()),
df %>% filter(!is.na({{val}}) | row_number()==n()) %>%
mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
)
}
Apply the function to the columns of interest
dat = f(dat,value1)
dat = f(dat,value2)
If you want, recover the original order
dat %>% arrange(rowid) %>% select(-rowid)
Output:
ID registration_dat value1 value2
<int> <date> <int> <int>
1 1 2020-03-04 33 25
2 1 2019-05-06 33 25
3 1 2019-01-02 32 21
4 3 2021-10-31 NA NA
5 3 2018-10-12 33 NA
6 3 2018-10-10 25 35
7 4 2020-01-02 32 83
8 4 2019-10-31 32 83
9 4 2019-09-20 33 56
10 8 2019-12-12 32 46
11 8 2019-10-31 NA 43
12 8 2019-08-12 32 46
Update:
The OP wants the final row (i.e the last registration_dat) per ID. With 3 million rows and 14 value columns, I would use data.table and do something like this:
library(data.table)
f <- function(df) {
df = df[df[1,registration_dat]-registration_dat<=365]
df[1,value:=df[2:.N][!is.na(value)][1,value]][1]
}
dcast(
melt(setDT(dat), id=c("ID", "registration_dat"))[order(-registration_dat),f(.SD), by=.(ID,variable)],
ID+registration_dat~variable, value.var="value"
)
Output:
ID registration_dat value1 value2
<int> <Date> <int> <int>
1: 1 2020-03-04 33 25
2: 3 2021-10-31 NA NA
3: 4 2020-01-02 32 83
4: 8 2019-12-12 32 43

Create a new variable with an existing variable name in a data frame, filling it when matching a non NA value in each of the variable lists

I want to create a column - C - in dfABy with the name of the existing variables, when in the list A or B it is a "non NA" value. For example, my df is:
>dfABy
A B
56 NA
NA 45
NA 77
67 NA
NA 65
The result what I will attend is:
> dfABy
A B C
56 NA A
NA 45 B
NA 77 B
67 NA A
NA 65 B

One option using dplyr could be:
df %>%
rowwise() %>%
mutate(C = names(.[!is.na(c_across(everything()))]))
A B C
<int> <int> <chr>
1 56 NA A
2 NA 45 B
3 NA 77 B
4 67 NA A
5 NA 65 B
Or with the addition of purrr:
df %>%
mutate(C = pmap_chr(across(A:B), ~ names(c(...)[!is.na(c(...))])))

You can use max.col over is.na values to get the column numbers where non-NA value is present. From those numbers you can get the column names.
dfABy$C <- names(dfABy)[max.col(!is.na(dfABy))]
dfABy
# A B C
#1 56 NA A
#2 NA 45 B
#3 NA 77 B
#4 67 NA A
#5 NA 65 B
If there are more than one non-NA value in a row take a look at at ties.method argument in ?max.col on how to handle ties.
data
dfABy <- structure(list(A = c(56L, NA, NA, 67L, NA), B = c(NA, 45L, 77L,
NA, 65L)), class = "data.frame", row.names = c(NA, -5L))

Using the data.table package I recommend:
dfABy[, C := apply(cbind(dfABy), 1, function(x) names(x[!is.na(x)]))]
creating the following output:
A B C
1 56 NA A
2 NA 45 B
3 NA 77 B
4 67 NA A
5 NA 65 B

This is just another solution, However other proposed solutions are better.
library(dplyr)
library(purrr)
df %>%
rowwise() %>%
mutate(C = detect_index(c(A, B), ~ !is.na(.x)),
C = names(.[C]))
# A tibble: 5 x 3
# Rowwise:
A B C
<dbl> <dbl> <chr>
1 56 NA A
2 NA 45 B
3 NA 77 B
4 67 NA A
5 NA 65 B

Ignore NA values of a column within a statement

Until now I've been working with a medium size dataset for an Ocupation Survey(around 200 mb total), here's the data if you want to review it: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
I have the following code:
hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)
library(dplyr)
base<-tibble(ID_VIV=datos$ID_VIV, ID_PERSONA=datos$ID_PERSONA, EDAD=datos$EDAD, CONACT=datos$CONACT)
base$maxage <- ave(base$EDAD, base$ID_VIV, FUN=max)
base$Condición_I<-case_when(base$CONACT==32 & base$EDAD>=60 ~ 1,
base$CONACT>=10 & base$EDAD>=60 & base$CONACT<=16 ~ 2,
base$CONACT==20 & base$EDAD>=60 | base$CONACT==31 & base$EDAD>=60 | (base$CONACT>=33 & base$CONACT<=35 & base$EDAD>=60) ~ 3)
base <- subset(base, maxage >= 60)
base<- base %>% group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else Condición_I)
base$ID_VIV<-as.character(base$ID_VIV)
base$ID_PERSONA<-as.character(base$ID_PERSONA)
base
And ended up with:
# A tibble: 38,307 x 7
# Groups: ID_VIV [10,499]
ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 10010000007 1001000000701 69 32 69 1 1
2 10010000008 1001000000803 83 33 83 3 4
3 10010000008 1001000000802 47 33 83 NA 4
4 10010000008 1001000000801 47 10 83 NA 4
5 10010000012 1001000001204 4 NA 60 NA 4
6 10010000012 1001000001203 2 NA 60 NA 4
7 10010000012 1001000001201 60 10 60 2 4
8 10010000012 1001000001202 21 10 60 NA 4
9 10010000014 1001000001401 67 32 67 1 4
10 10010000014 1001000001402 64 33 67 3 4
The Condición_I column value is a code for the labour conditions of each individual(row), some of this individuals share house (that's why they share ID_VIV), I only care about the individuals that are 60yo or more, all the NA are individuals who live with a 60+yo but I do not care about their situation (but I need to keep them), I need the column Condición_V to display another value following this conditions:
Condición_I == 1 ~ 1
Condición_I == 2 ~ 2
Condición_I == 3 ~ 3
Any combination of Condición_I ~ 4
This means that if all the 60 and+_yo individuals in a house have Condición_I == 1 then Condición_V will be 1 that's true up to code 3, when there are x.e. one person C_I == 1 and another one C_I == 3 in the same house, then Condición_V will be 4
And I'm hoping to get this kind of result:
A tibble: 38,307 x 7
# Groups: ID_VIV [10,499]
ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 10010000007 1001000000701 69 32 69 1 1
2 10010000008 1001000000803 83 33 83 3 3
3 10010000008 1001000000802 47 33 83 NA 3
4 10010000008 1001000000801 47 10 83 NA 3
5 10010000012 1001000001204 4 NA 60 NA 2
6 10010000012 1001000001203 2 NA 60 NA 2
7 10010000012 1001000001201 60 10 60 2 2
8 10010000012 1001000001202 21 10 60 NA 2
9 10010000014 1001000001401 67 32 67 1 4
10 10010000014 1001000001402 64 33 67 3 4
I know my error is in:
`#base<- base %>% group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else` Condición_I)
Is there a way to use that line of code ignoring the NA values or is it my best option to do it otherway, I do not have to do it the way I'm trying and any other way or help will be much appreciated!

We can wrap with na.omit on the Condición_I column, check the number of distinct elements with n_distinct and if it is greater than 1, return 4 or else return the na.omit of the column
library(dplyr)
base %>%
group_by(ID_VIV) %>%
mutate(Condición_V = if(n_distinct(na.omit(Condición_I)) > 1)
4 else na.omit(Condición_I)[1])
# A tibble: 10 x 7
# Groups: ID_VIV [4]
# ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
# <chr> <chr> <int> <int> <int> <int> <dbl>
# 1 10010000007 1001000000701 69 32 69 1 1
# 2 10010000008 1001000000803 83 33 83 3 3
# 3 10010000008 1001000000802 47 33 83 NA 3
# 4 10010000008 1001000000801 47 10 83 NA 3
# 5 10010000012 1001000001204 4 NA 60 NA 2
# 6 10010000012 1001000001203 2 NA 60 NA 2
# 7 10010000012 1001000001201 60 10 60 2 2
# 8 10010000012 1001000001202 21 10 60 NA 2
# 9 10010000014 1001000001401 67 32 67 1 4
#10 10010000014 1001000001402 64 33 67 3 4
data
base <- structure(list(ID_VIV = c("10010000007", "10010000008", "10010000008",
"10010000008", "10010000012", "10010000012", "10010000012", "10010000012",
"10010000014", "10010000014"), ID_PERSONA = c("1001000000701",
"1001000000803", "1001000000802", "1001000000801", "1001000001204",
"1001000001203", "1001000001201", "1001000001202", "1001000001401",
"1001000001402"), EDAD = c(69L, 83L, 47L, 47L, 4L, 2L, 60L, 21L,
67L, 64L), CONACT = c(32L, 33L, 33L, 10L, NA, NA, 10L, 10L, 32L,
33L), maxage = c(69L, 83L, 83L, 83L, 60L, 60L, 60L, 60L, 67L,
67L), Condición_I = c(1L, 3L, NA, NA, NA, NA, 2L, NA, 1L, 3L
)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10"), class = "data.frame")

Separate hour and minutes in R

I have a column for time, but it hasn't been separated by : or any thing. It looks like this:
person time
1 356
1 931
1 2017
1 2103
2 256
2 1031
2 1517
2 2206
How do I separate them?

There are different ways of approaching the issue. Which method you choose depends on your desired output.
For example, you could use stringr::str_split to split time into a list vector of hours and minutes using a positive look-ahead
library(tidyverse)
df %>% mutate(time = str_split(time, "(?=\\d{2}$)"))
# person time
#1 1 3, 56
#2 1 9, 31
#3 1 20, 17
#4 1 2, 13
#5 2 2, 56
#6 2 10, 31
#7 2 15, 17
#8 2 2, 26
Or we can use tidyr::separate to create two new columns hours and minutes
df %>% separate(time, c("hours", "minutes"), sep = "(?=\\d{2}$)")
# person hours minutes
#1 1 3 56
#2 1 9 31
#3 1 20 17
#4 1 2 13
#5 2 2 56
#6 2 10 31
#7 2 15 17
#8 2 2 26
In response to your comment you could use stringr::str_replace
df %>% mutate(time = str_replace(time, "(?=\\d{2}$)", ":"))
# person time
#1 1 3:56
#2 1 9:31
#3 1 20:17
#4 1 2:13
#5 2 2:56
#6 2 10:31
#7 2 15:17
#8 2 2:26
And the same in base R using sub
transform(df, time = sub("(?=\\d{2}$)", ":", time, perl = TRUE))
giving the same result.
Sample data
df <- read.table(text = "
person time
1 356
1 931
1 2017
1 213
2 256
2 1031
2 1517
2 226", header = T)

We can use strptime with sprintf in base R
df[c("hour", "min")] <- unclass(strptime(sprintf("%04d00", df$time),
"%H%M%S"))[c('hour', 'min')]
df
# person time hour min
#1 1 356 3 56
#2 1 931 9 31
#3 1 2017 20 17
#4 1 213 2 13
#5 2 256 2 56
#6 2 1031 10 31
#7 2 1517 15 17
#8 2 226 2 26
Or if it needs to only create a delimiter
tmp <- sub('(\\d{2})$', ':\\1', df$time)
tmp
#[1] "3:56" "9:31" "20:17" "2:13" "2:56" "10:31" "15:17" "2:26"
and then it can be separated in to two column with read.table
read.table(text = tmp, sep=":", header = FALSE, col.names = c('hour', 'min'))
data
df <- structure(list(person = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), time = c(356L,
931L, 2017L, 213L, 256L, 1031L, 1517L, 226L)),
class = "data.frame", row.names = c(NA,
-8L))

Another possibility:
res<-strsplit(gsub("(\\d+(?=\\d{2,}))(\\d{1,})",
"\\1:\\2",df$time,perl = T),":")
df$Minutes <- sapply(res,"[[",2)
df$Hr <- sapply(res,"[[",1)
df
Result:
person time Minutes Hr
1 1 356 56 3
2 1 931 31 9
3 1 2017 17 20
4 1 2103 03 21
5 2 256 56 2
6 2 1031 31 10
7 2 1517 17 15
8 2 2206 06 22
Data:
df <-structure(list(person = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), time = c(356L,
931L, 2017L, 2103L, 256L, 1031L, 1517L, 2206L)), row.names = c(NA,
-8L), class = "data.frame")

If you want to show time in HH:MM format, probably we can use sprintf with sub to enter semicolon (:) in between
sub("(\\d{2})(\\d{2})", "\\1:\\2",sprintf("%04d", df$time))
#[1] "03:56" "09:31" "20:17" "21:03" "02:56" "10:31" "15:17" "22:06"

lapply alternative to for loop to append to data frame

I have a data frame:
df<-structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L,
45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), row.names = c(NA, -14L), class = "data.frame")
> head(df)
chrom pos
1 1 10
2 1 200
3 1 134
4 1 400
5 1 600
6 1 1000
And I want to calculate pos[i+1] - pos[i] on the sample chromosome (chrom)
By using a for loop over each chrom level, and another over each row I get the expected results:
for (c in levels(df$chrom)){
df_chrom<-filter(df, chrom == c)
df_chrom<-arrange(df_chrom, df_chrom$pos)
for (i in 1:nrow(df_chrom)){
dist<-(df_chrom$pos[i+1] - df_chrom$pos[i])
logdist<-log10(dist)
cat(c, i, df_chrom$pos[i], dist, logdist, "\n")
}
}
However, I want to save this to a data frame, and think that lapply or apply is the right way to go about this. I can't work out how to make the pos[i+1] - pos[i] calculation though (seeing as lapply works on each row/column.
Any pointers would be appreciated
Here's the output from my solution:
chrom index pos dist log10dist
1 1 10 124 2.093422
1 2 134 66 1.819544
1 3 200 200 2.30103
1 4 400 200 2.30103
1 5 600 400 2.60206
1 6 1000 NA NA
2 1 20 13 1.113943
2 2 33 NA NA
3 1 40 5 0.69897
3 2 45 NA NA
4 1 50 5 0.69897
4 2 55 45 1.653213
4 3 100 23 1.361728
4 4 123 NA NA

We could do this using a group by difference. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'chrom', order the 'pos', get the difference of 'pos' (diff) and also log of the difference
library(data.table)
setDT(df)[order(pos), {v1 <- diff(pos)
.(index = seq_len(.N), pos = pos,
dist = c(v1, NA), logdiff = c(log10(v1), NA))}
, by = chrom]
# chrom index pos dist logdiff
# 1: 1 1 10 124 2.093422
# 2: 1 2 134 66 1.819544
# 3: 1 3 200 200 2.301030
# 4: 1 4 400 200 2.301030
# 5: 1 5 600 400 2.602060
# 6: 1 6 1000 NA NA
# 7: 2 1 20 13 1.113943
# 8: 2 2 33 NA NA
# 9: 3 1 40 5 0.698970
#10: 3 2 45 NA NA
#11: 4 1 50 5 0.698970
#12: 4 2 55 45 1.653213
#13: 4 3 100 23 1.361728
#14: 4 4 123 NA NA
Upon running the OP's code the output printed are
#1 1 10 124 2.093422
#1 2 134 66 1.819544
#1 3 200 200 2.30103
#1 4 400 200 2.30103
#1 5 600 400 2.60206
#1 6 1000 NA NA
#2 1 20 13 1.113943
#2 2 33 NA NA
#3 1 40 5 0.69897
#3 2 45 NA NA
#4 1 50 5 0.69897
#4 2 55 45 1.653213
#4 3 100 23 1.361728
#4 4 123 NA NA

We split df by df$chrom (Note that we reorder both df and df$chrom before splitting). Then we go through each of the subgroups (the subgroups are called a in this example) using lapply. On the pos column of each subgroup, we calculate difference (diff) of consecutive elements and take log10. Since diff decreases the number of elements by 1, we add a NA to the end. Finally, we rbind all the subgroups together using do.call.
do.call(rbind, lapply(split(df[order(df$chrom, df$pos),], df$chrom[order(df$chrom, df$pos)]),
function(a) data.frame(a, dist = c(log10(diff(a$pos)), NA))))
# chrom pos dist
#1.1 1 10 2.093422
#1.3 1 134 1.819544
#1.2 1 200 2.301030
#1.4 1 400 2.301030
#1.5 1 600 2.602060
#1.6 1 1000 NA
#2.7 2 20 1.113943
#2.8 2 33 NA
#3.9 3 40 0.698970
#3.10 3 45 NA
#4.11 4 50 0.698970
#4.12 4 55 1.653213
#4.13 4 100 1.361728
#4.14 4 123 NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replacing NAs with existing data when merging two dataframes in R - r

Related

How to fill missing values grouped on id and based on time period from index date

Create a new variable with an existing variable name in a data frame, filling it when matching a non NA value in each of the variable lists

Ignore NA values of a column within a statement

Separate hour and minutes in R

lapply alternative to for loop to append to data frame

Categories

Resources