R Overwrite column values with non NA values from column in separate dataframe - r

I have a dataframe 'df1' with a lot of columns, but the ones of interest are:
Number
Code
1
2
3
10
11
AMRO
4
277
2100
BLPH
And I have another dataframe 'df2' with a lot of columns, but the ones of interest are:
Number
Code
1
AMCR
2
AMCR
3
BANO
10
BAEA
12
AMRO
4
NA
277
NA
2100
NA
I want matching values in the 'Number' columns of 'df1' and 'df2' to lead to values in the 'Code' column in 'df2' to overwrite the 'Code' values in 'df1' as long as the 'Code' values in 'df2' don't contain an NA, so that the final result of 'df1' looks like:
Number
Code
1
AMCR
2
AMCR
3
BANO
10
BAEA
11
AMRO
4
277
2100
BLPH
Thank you for your help!

We can do
library(powerjoin)
power_left_join(df1, df2, by = "Number", conflict = coalesce)
-output
Number Code
1 1 AMCR
2 2 AMCR
3 3 BANO
4 10 BAEA
5 11 AMRO
6 4 <NA>
7 277 <NA>
8 2100 BLPH
Or to do an overwrite, use data.table
library(data.table)
setDT(df1)[df2, Code := fcoalesce(Code, i.Code), on = .(Number)]
-output
> df1
Number Code
<int> <char>
1: 1 AMCR
2: 2 AMCR
3: 3 BANO
4: 10 BAEA
5: 11 AMRO
6: 4 <NA>
7: 277 <NA>
8: 2100 BLPH
data
df1 <- structure(list(Number = c(1L, 2L, 3L, 10L, 11L, 4L, 277L, 2100L
), Code = c(NA, NA, NA, NA, "AMRO", NA, NA, "BLPH")),
class = "data.frame", row.names = c(NA,
-8L))
df2 <- structure(list(Number = c(1L, 2L, 3L, 10L, 12L, 4L, 277L, 2100L
), Code = c("AMCR", "AMCR", "BANO", "BAEA", "AMRO", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))

Here is an alternative approach using bind_cols:
library(dplyr)
bind_cols(df1, df2) %>%
mutate(Code = coalesce(Code...2, Code...4)) %>%
select(Number = Number...1, Code)
Number Code
1 1 AMCR
2 2 AMCR
3 3 BANO
4 10 BAEA
5 11 AMRO
6 4 <NA>
7 277 <NA>
8 2100 BLPH

Here is a solution playing with dplyr full_join and inner_join
library(dplyr)
df1 %>%
full_join(df2) %>% na.omit() %>%
full_join(df1 %>% inner_join(df2)) %>%
filter(Number %in% df1$Number) %>%
arrange(Number)
Output
#> Number Code
#> 1 1 AMCR
#> 2 2 AMCR
#> 3 3 BANO
#> 4 4 <NA>
#> 5 10 BAEA
#> 6 11 AMRO
#> 7 277 <NA>
#> 8 2100 BLPH

Related

How to use R to replace missing values with the sum of previous 4 values in a column?

I have a dataframe that contains (among other things) three columns that have missing values every 5 rows. These missing values need to be replaced with the sum of the previous 4 values in their respective column.
For example, let's say my dataframe looked like this:
id category1 category2 category3
123 5 10 10
123 6 11 15
123 6 12 23
123 4 10 6
123 NA NA NA
567 24 17 15
Those NAs need to represent a "total" based on the sum of the previous 4 values in their column, and this needs to repeat throughout the entire dataframe because the NAs occur every 5 rows. For instance, the three NAs in the mock example above should be replaced with 21, 43, and 54. 5 rows later, the same process will need to be repeated. How can I achieve this?
Another possible solution:
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(everything(), ~ if_else(is.na(.x), sum(.x, na.rm = T), .x))) %>%
ungroup
#> # A tibble: 6 × 4
#> id category1 category2 category3
#> <int> <int> <int> <int>
#> 1 123 5 10 10
#> 2 123 6 11 15
#> 3 123 6 12 23
#> 4 123 4 10 6
#> 5 123 21 43 54
#> 6 567 24 17 15
The following should work if there are no occurrences of NA values within the first 4 rows and I am assuming that the NA values appear in all columns at the same time.
for(i in 1:nrow(data)){
if(is.na(data[i, 2])){
data[i, 2] <- sum(data[seq(i-5, i-1), 2])
data[i, 3] <- sum(data[seq(i-5, i-1), 3])
data[i, 4] <- sum(data[seq(i-5, i-1), 4])
}
}
If the NAs appear at the end row for each 'id', we may remove it and do a group by summarise to create a row
library(dplyr)
df1 <- df1 %>%
na.omit %>%
group_by(id) %>%
summarise(across(everything(), ~ c(.x, sum(.x))), .groups = 'drop')
-output
df1
# A tibble: 7 × 4
id category1 category2 category3
<int> <int> <int> <int>
1 123 5 10 10
2 123 6 11 15
3 123 6 12 23
4 123 4 10 6
5 123 21 43 54
6 567 24 17 15
7 567 24 17 15
Or another approach would be to replace the NA with the sum using na.aggregate from zoo
library(zoo)
df1 %>%
group_by(id) %>%
mutate(across(everything(), na.aggregate, FUN = sum)) %>%
ungroup
# A tibble: 6 × 4
id category1 category2 category3
<int> <int> <int> <int>
1 123 5 10 10
2 123 6 11 15
3 123 6 12 23
4 123 4 10 6
5 123 21 43 54
6 567 24 17 15
data
df1 <- structure(list(id = c(123L, 123L, 123L, 123L, 123L, 567L),
category1 = c(5L,
6L, 6L, 4L, NA, 24L), category2 = c(10L, 11L, 12L, 10L, NA, 17L
), category3 = c(10L, 15L, 23L, 6L, NA, 15L)),
class = "data.frame", row.names = c(NA,
-6L))

Replacing NAs with existing data when merging two dataframes in R

I would like to merge two dataframes. There are some shared variables and some different variables and there are different numbers of rows in each dataframe. The dataframes share some rows, but not all. And both dataframes have missing data that the other my have.
DF1:
name
age
weight
height
Tim
7
54
112
Dave
5
50
NA
Larry
NA
42
73
Rob
1
30
43
DF2:
name
age
weight
height
grade
Tim
7
NA
112
2
Dave
NA
50
103
1
Larry
3
NA
73
NA
Rob
1
30
NA
NA
John
6
60
NA
1
Tom
8
61
112
2
I want to merge these two dataframes together by the shared columns (name, age, weight, and height). However, I want NAs to be overridden, such that if one of the two dataframes has a value where the other has NA, I want the value to be carried through into the third dataframe. Ideally, the last dataframe should only have NAs when both DF1 and DF2 had NAs in that same location.
Ideal Data Frame
name
age
weight
height
grade
Tim
7
54
112
2
Dave
5
50
103
1
Larry
3
42
73
NA
Rob
1
30
43
NA
John
6
60
NA
1
Tom
8
61
112
2
I've been using full_join and left_join, but I don't know how to merge these in such a way that NAs are replaced with actual data (if it is present in one of the dataframes). Is there a way to do this?
This is a typical case that rows_patch() from dplyr can treat.
library(dplyr)
rows_patch(df2, df1, by = "name")
name age weight height grade
1 Tim 7 54 112 2
2 Dave 5 50 103 1
3 Larry 3 42 73 NA
4 Rob 1 30 43 NA
5 John 6 60 NA 1
6 Tom 8 61 112 2
Data
df1 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob"), age = c(7L,
5L, NA, 1L), weight = c(54L, 50L, 42L, 30L), height = c(112L,
NA, 73L, 43L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob", "John",
"Tom"), age = c(7L, NA, 3L, 1L, 6L, 8L), weight = c(NA, 50L,
NA, 30L, 60L, 61L), height = c(112L, 103L, 73L, NA, NA, 112L),
grade = c(2L, 1L, NA, NA, 1L, 2L)), class = "data.frame", row.names = c(NA, -6L))
I like the powerjoin package suggested as an answer to the question in the first comment, which I had never heard of before.
However, if you want to avoid using extra packages, you can do it in base R. This approach also avoids having to explicitly name each column - the dplyr approaches suggested in the comments do not do that, although perhaps could be modified.
# Load data
df1 <- read.table(text = "name age weight height
Tim 7 54 112
Dave 5 50 NA
Larry NA 42 73
Rob 1 30 43", header=TRUE)
df2 <- read.table(text = "name age weight height grade
Tim 7 NA 112 2
Dave NA 50 103 1
Larry 3 NA 73 NA
Rob 1 30 NA NA
John 6 60 NA 1
Tom 8 61 112 2", header=TRUE)
df3 <- merge(df1, df2, by = "name", all = TRUE, sort=FALSE)
# Coalesce the common columns
common_cols <- names(df1)[names(df1)!="name"]
df3[common_cols] <- lapply(common_cols, function(col) {
coalesce(df3[[paste0(col, ".x")]], df3[[paste0(col, ".y")]])
})
# Select desired columns
df3[names(df2)]
# name age weight height grade
# 1 Tim 7 54 112 2
# 2 Dave 5 50 103 1
# 3 Larry 3 42 73 NA
# 4 Rob 1 30 43 NA
# 5 John 6 60 NA 1
# 6 Tom 8 61 112 2
There are advantages to using base R, but powerjoin looks like an interesting package too.
Another possible solution:
library(tidyverse)
df2 %>%
bind_rows(df1) %>%
group_by(name) %>%
fill(age:grade, .direction = "updown") %>%
ungroup %>%
distinct
#> # A tibble: 6 x 5
#> name age weight height grade
#> <chr> <int> <int> <int> <int>
#> 1 Tim 7 54 112 2
#> 2 Dave 5 50 103 1
#> 3 Larry 3 42 73 NA
#> 4 Rob 1 30 43 NA
#> 5 John 6 60 NA 1
#> 6 Tom 8 61 112 2

How to fill missing values grouped on id and based on time period from index date

I want to fill in missing values for a data.frame based on a period of time within groups of ID.
For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.
Sample version of my data:
ID registration_dat value1 value2
1 2020-03-04 NA NA
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 NA NA
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 NA NA
8 2019-10-31 NA 43
8 2019-08-12 32 46
Desired output:
ID registration_dat value1 value2
1 2020-03-04 33 25
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 32 83
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 32 43
8 2019-10-31 NA 43
8 2019-08-12 32 46
I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million+ rows so I would need it to work on a much bigger data.frame than the one shown as an example.
I'd appreciate any ideas!
You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
arrange(desc(registration_dat), .by_group = TRUE) %>%
mutate(across(starts_with("value"),
~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
.x[which.max(!is.na(.x))], .x))) %>%
ungroup()
# # A tibble: 12 x 4
# ID registration_dat value1 value2
# <int> <date> <int> <int>
# 1 1 2020-03-04 33 25
# 2 1 2019-05-06 33 25
# 3 1 2019-01-02 32 21
# 4 3 2021-10-31 NA NA
# 5 3 2018-10-12 33 NA
# 6 3 2018-10-10 25 35
# 7 4 2020-01-02 32 83
# 8 4 2019-10-31 32 83
# 9 4 2019-09-20 33 56
# 10 8 2019-12-12 32 43
# 11 8 2019-10-31 NA 43
# 12 8 2019-08-12 32 46
Data
df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))
You could make a small function (f, below) to handle each value column.
Make a grouped ID, and generate a rowid (this is only to retain your original order)
dat <- dat %>%
mutate(rowid = row_number()) %>%
arrange(registration_dat) %>%
group_by(ID)
Make a function that takes a df and val column, and returns and updated df with val fixed
f <- function(df, val) {
bind_rows(
df %>% filter(is.na({{val}}) & row_number()!=n()),
df %>% filter(!is.na({{val}}) | row_number()==n()) %>%
mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
)
}
Apply the function to the columns of interest
dat = f(dat,value1)
dat = f(dat,value2)
If you want, recover the original order
dat %>% arrange(rowid) %>% select(-rowid)
Output:
ID registration_dat value1 value2
<int> <date> <int> <int>
1 1 2020-03-04 33 25
2 1 2019-05-06 33 25
3 1 2019-01-02 32 21
4 3 2021-10-31 NA NA
5 3 2018-10-12 33 NA
6 3 2018-10-10 25 35
7 4 2020-01-02 32 83
8 4 2019-10-31 32 83
9 4 2019-09-20 33 56
10 8 2019-12-12 32 46
11 8 2019-10-31 NA 43
12 8 2019-08-12 32 46
Update:
The OP wants the final row (i.e the last registration_dat) per ID. With 3 million rows and 14 value columns, I would use data.table and do something like this:
library(data.table)
f <- function(df) {
df = df[df[1,registration_dat]-registration_dat<=365]
df[1,value:=df[2:.N][!is.na(value)][1,value]][1]
}
dcast(
melt(setDT(dat), id=c("ID", "registration_dat"))[order(-registration_dat),f(.SD), by=.(ID,variable)],
ID+registration_dat~variable, value.var="value"
)
Output:
ID registration_dat value1 value2
<int> <Date> <int> <int>
1: 1 2020-03-04 33 25
2: 3 2021-10-31 NA NA
3: 4 2020-01-02 32 83
4: 8 2019-12-12 32 43

Match and Remove Rows Based on Condition R [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I've got an interesting one for you all.
I'm looking to first: Look through the ID column and identify duplicate values. Once those are identified, the code should go through the income of the duplicated values and keep the row with the larger income.
So if there are three ID values of 2, it will look for the one with the highest income and keep that row.
ID Income
1 98765
2 3456
2 67
2 5498
5 23
6 98
7 5645
7 67871
9 983754
10 982
10 2374
10 875
10 4744
11 6853
I know its as easy as subsetting based on a condition, but I don't know how to remove the rows based on if the income in one cell is greater than the other.(Only done if the id's match)
I was thinking of using an ifelse statement to create a new column to identify duplicates (through subsetting or not) then use the new column's values to ifelse again to identify the larger income. From there I can just subset based on the new columns I have created.
Is there a faster, more efficient way of doing this?
The outcome should look like this.
ID Income
1 98765
2 5498
5 23
6 98
7 67871
9 983754
10 4744
11 6853
Thank you
We can slice the rows by checking the highest value in 'Income' grouped by 'ID'
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.max(Income))
Or using data.table
library(data.table)
setDT(df1)[, .SD[which.max(Income)], by = ID]
Or with base R
df1[with(df1, ave(Income, ID, FUN = max) == Income),]
# ID Income
#1 1 98765
#4 2 5498
#5 5 23
#6 6 98
#8 7 67871
#9 9 983754
#13 10 4744
#14 11 6853
data
df1 <- structure(list(ID = c(1L, 2L, 2L, 2L, 5L, 6L, 7L, 7L, 9L, 10L,
10L, 10L, 10L, 11L), Income = c(98765L, 3456L, 67L, 5498L, 23L,
98L, 5645L, 67871L, 983754L, 982L, 2374L, 875L, 4744L, 6853L)),
class = "data.frame", row.names = c(NA,
-14L))
order with duplicated( Base R)
df=df[order(df$ID,-df$Income),]
df[!duplicated(df$ID),]
ID Income
1 1 98765
4 2 5498
5 5 23
6 6 98
8 7 67871
9 9 983754
13 10 4744
14 11 6853
Here is another dplyr method. We can arrange the column and then slice the data frame for the first row.
library(dplyr)
df2 <- df %>%
arrange(ID, desc(Income)) %>%
group_by(ID) %>%
slice(1) %>%
ungroup()
df2
# # A tibble: 8 x 2
# ID Income
# <int> <int>
# 1 1 98765
# 2 2 5498
# 3 5 23
# 4 6 98
# 5 7 67871
# 6 9 983754
# 7 10 4744
# 8 11 6853
DATA
df <- read.table(text = "ID Income
1 98765
2 3456
2 67
2 5498
5 23
6 98
7 5645
7 67871
9 983754
10 982
10 2374
10 875
10 4744
11 6853",
header = TRUE)
Group_by and summarise from dplyr would work too
df1 %>%
group_by(ID) %>%
summarise(Income=max(Income))
ID Income
<int> <dbl>
1 1 98765.
2 2 5498.
3 5 23.
4 6 98.
5 7 67871.
6 9 983754.
7 10 4744.
8 11 6853.
Using sqldf: Group by ID and select the corresponding max Income
library(sqldf)
sqldf("select ID,max(Income) from df group by ID")
Output:
ID max(Income)
1 1 98765
2 2 5498
3 5 23
4 6 98
5 7 67871
6 9 983754
7 10 4744
8 11 6853

How can I use merge so that I have data for all times?

I'm trying to change a data into which all entities have value for all possible times(months). Here's what I'm trying;
Class Value month
A 10 1
A 12 3
A 9 12
B 11 1
B 10 8
From the data above, I want to get the following data;
Class Value month
A 10 1
A NA 2
A 12 3
A NA 4
....
A 9 12
B 11 1
B NA 2
....
B 10 8
B NA 9
....
B NA 12
So I want to have all possible cells with through month from 1 to 12;
How can I do this? I'm right now trying it with merge function, but appreciate any other ways to approach.
We can use tidyverse
library(tidyverse)
df1 %>%
complete(Class, month = min(month):max(month)) %>%
select_(.dots = names(df1)) %>% #if we need to be in the same column order
as.data.frame() #if needed to convert to 'data.frame'
In base R using merge (where df is your data):
res <- data.frame(Class=rep(levels(df$Class), each=12), value=NA, month=1:12)
merge(df, res, by = c("Class", "month"), all.y = TRUE)[,c(1,3,2)]
# Class Value month
# 1 A 10 1
# 2 A NA 2
# 3 A 12 3
# 4 A NA 4
# 5 A NA 5
# 6 A NA 6
# 7 A NA 7
# 8 A NA 8
# 9 A NA 9
# 10 A NA 10
# 11 A NA 11
# 12 A 9 12
# 13 B 11 1
# 14 B NA 2
# 15 B NA 3
# 16 B NA 4
# 17 B NA 5
# 18 B NA 6
# 19 B NA 7
# 20 B 10 8
# 21 B NA 9
# 22 B NA 10
# 23 B NA 11
# 24 B NA 12
df <- structure(list(Class = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Value = c(10L, 12L, 9L, 11L, 10L), month = c(1L,
3L, 12L, 1L, 8L)), .Names = c("Class", "Value", "month"), class = "data.frame", row.names = c(NA,
-5L))
To add to #akrun's answer, if you want to replace the NA values with 0, you can do the following:
library(dplyr)
library(tidyr)
df1 %>%
complete(Class, month = min(month):max(month)) %>%
mutate(Value = ifelse(is.na(Value),0,Value))

Resources