Match and Remove Rows Based on Condition R [duplicate] - r

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I've got an interesting one for you all.
I'm looking to first: Look through the ID column and identify duplicate values. Once those are identified, the code should go through the income of the duplicated values and keep the row with the larger income.
So if there are three ID values of 2, it will look for the one with the highest income and keep that row.
ID Income
1 98765
2 3456
2 67
2 5498
5 23
6 98
7 5645
7 67871
9 983754
10 982
10 2374
10 875
10 4744
11 6853
I know its as easy as subsetting based on a condition, but I don't know how to remove the rows based on if the income in one cell is greater than the other.(Only done if the id's match)
I was thinking of using an ifelse statement to create a new column to identify duplicates (through subsetting or not) then use the new column's values to ifelse again to identify the larger income. From there I can just subset based on the new columns I have created.
Is there a faster, more efficient way of doing this?
The outcome should look like this.
ID Income
1 98765
2 5498
5 23
6 98
7 67871
9 983754
10 4744
11 6853
Thank you

We can slice the rows by checking the highest value in 'Income' grouped by 'ID'
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.max(Income))
Or using data.table
library(data.table)
setDT(df1)[, .SD[which.max(Income)], by = ID]
Or with base R
df1[with(df1, ave(Income, ID, FUN = max) == Income),]
# ID Income
#1 1 98765
#4 2 5498
#5 5 23
#6 6 98
#8 7 67871
#9 9 983754
#13 10 4744
#14 11 6853
data
df1 <- structure(list(ID = c(1L, 2L, 2L, 2L, 5L, 6L, 7L, 7L, 9L, 10L,
10L, 10L, 10L, 11L), Income = c(98765L, 3456L, 67L, 5498L, 23L,
98L, 5645L, 67871L, 983754L, 982L, 2374L, 875L, 4744L, 6853L)),
class = "data.frame", row.names = c(NA,
-14L))

order with duplicated( Base R)
df=df[order(df$ID,-df$Income),]
df[!duplicated(df$ID),]
ID Income
1 1 98765
4 2 5498
5 5 23
6 6 98
8 7 67871
9 9 983754
13 10 4744
14 11 6853

Here is another dplyr method. We can arrange the column and then slice the data frame for the first row.
library(dplyr)
df2 <- df %>%
arrange(ID, desc(Income)) %>%
group_by(ID) %>%
slice(1) %>%
ungroup()
df2
# # A tibble: 8 x 2
# ID Income
# <int> <int>
# 1 1 98765
# 2 2 5498
# 3 5 23
# 4 6 98
# 5 7 67871
# 6 9 983754
# 7 10 4744
# 8 11 6853
DATA
df <- read.table(text = "ID Income
1 98765
2 3456
2 67
2 5498
5 23
6 98
7 5645
7 67871
9 983754
10 982
10 2374
10 875
10 4744
11 6853",
header = TRUE)

Group_by and summarise from dplyr would work too
df1 %>%
group_by(ID) %>%
summarise(Income=max(Income))
ID Income
<int> <dbl>
1 1 98765.
2 2 5498.
3 5 23.
4 6 98.
5 7 67871.
6 9 983754.
7 10 4744.
8 11 6853.

Using sqldf: Group by ID and select the corresponding max Income
library(sqldf)
sqldf("select ID,max(Income) from df group by ID")
Output:
ID max(Income)
1 1 98765
2 2 5498
3 5 23
4 6 98
5 7 67871
6 9 983754
7 10 4744
8 11 6853

Related

How to use R to replace missing values with the sum of previous 4 values in a column?

I have a dataframe that contains (among other things) three columns that have missing values every 5 rows. These missing values need to be replaced with the sum of the previous 4 values in their respective column.
For example, let's say my dataframe looked like this:
id category1 category2 category3
123 5 10 10
123 6 11 15
123 6 12 23
123 4 10 6
123 NA NA NA
567 24 17 15
Those NAs need to represent a "total" based on the sum of the previous 4 values in their column, and this needs to repeat throughout the entire dataframe because the NAs occur every 5 rows. For instance, the three NAs in the mock example above should be replaced with 21, 43, and 54. 5 rows later, the same process will need to be repeated. How can I achieve this?
Another possible solution:
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(everything(), ~ if_else(is.na(.x), sum(.x, na.rm = T), .x))) %>%
ungroup
#> # A tibble: 6 × 4
#> id category1 category2 category3
#> <int> <int> <int> <int>
#> 1 123 5 10 10
#> 2 123 6 11 15
#> 3 123 6 12 23
#> 4 123 4 10 6
#> 5 123 21 43 54
#> 6 567 24 17 15
The following should work if there are no occurrences of NA values within the first 4 rows and I am assuming that the NA values appear in all columns at the same time.
for(i in 1:nrow(data)){
if(is.na(data[i, 2])){
data[i, 2] <- sum(data[seq(i-5, i-1), 2])
data[i, 3] <- sum(data[seq(i-5, i-1), 3])
data[i, 4] <- sum(data[seq(i-5, i-1), 4])
}
}
If the NAs appear at the end row for each 'id', we may remove it and do a group by summarise to create a row
library(dplyr)
df1 <- df1 %>%
na.omit %>%
group_by(id) %>%
summarise(across(everything(), ~ c(.x, sum(.x))), .groups = 'drop')
-output
df1
# A tibble: 7 × 4
id category1 category2 category3
<int> <int> <int> <int>
1 123 5 10 10
2 123 6 11 15
3 123 6 12 23
4 123 4 10 6
5 123 21 43 54
6 567 24 17 15
7 567 24 17 15
Or another approach would be to replace the NA with the sum using na.aggregate from zoo
library(zoo)
df1 %>%
group_by(id) %>%
mutate(across(everything(), na.aggregate, FUN = sum)) %>%
ungroup
# A tibble: 6 × 4
id category1 category2 category3
<int> <int> <int> <int>
1 123 5 10 10
2 123 6 11 15
3 123 6 12 23
4 123 4 10 6
5 123 21 43 54
6 567 24 17 15
data
df1 <- structure(list(id = c(123L, 123L, 123L, 123L, 123L, 567L),
category1 = c(5L,
6L, 6L, 4L, NA, 24L), category2 = c(10L, 11L, 12L, 10L, NA, 17L
), category3 = c(10L, 15L, 23L, 6L, NA, 15L)),
class = "data.frame", row.names = c(NA,
-6L))

Replacing NAs with existing data when merging two dataframes in R

I would like to merge two dataframes. There are some shared variables and some different variables and there are different numbers of rows in each dataframe. The dataframes share some rows, but not all. And both dataframes have missing data that the other my have.
DF1:
name
age
weight
height
Tim
7
54
112
Dave
5
50
NA
Larry
NA
42
73
Rob
1
30
43
DF2:
name
age
weight
height
grade
Tim
7
NA
112
2
Dave
NA
50
103
1
Larry
3
NA
73
NA
Rob
1
30
NA
NA
John
6
60
NA
1
Tom
8
61
112
2
I want to merge these two dataframes together by the shared columns (name, age, weight, and height). However, I want NAs to be overridden, such that if one of the two dataframes has a value where the other has NA, I want the value to be carried through into the third dataframe. Ideally, the last dataframe should only have NAs when both DF1 and DF2 had NAs in that same location.
Ideal Data Frame
name
age
weight
height
grade
Tim
7
54
112
2
Dave
5
50
103
1
Larry
3
42
73
NA
Rob
1
30
43
NA
John
6
60
NA
1
Tom
8
61
112
2
I've been using full_join and left_join, but I don't know how to merge these in such a way that NAs are replaced with actual data (if it is present in one of the dataframes). Is there a way to do this?
This is a typical case that rows_patch() from dplyr can treat.
library(dplyr)
rows_patch(df2, df1, by = "name")
name age weight height grade
1 Tim 7 54 112 2
2 Dave 5 50 103 1
3 Larry 3 42 73 NA
4 Rob 1 30 43 NA
5 John 6 60 NA 1
6 Tom 8 61 112 2
Data
df1 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob"), age = c(7L,
5L, NA, 1L), weight = c(54L, 50L, 42L, 30L), height = c(112L,
NA, 73L, 43L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob", "John",
"Tom"), age = c(7L, NA, 3L, 1L, 6L, 8L), weight = c(NA, 50L,
NA, 30L, 60L, 61L), height = c(112L, 103L, 73L, NA, NA, 112L),
grade = c(2L, 1L, NA, NA, 1L, 2L)), class = "data.frame", row.names = c(NA, -6L))
I like the powerjoin package suggested as an answer to the question in the first comment, which I had never heard of before.
However, if you want to avoid using extra packages, you can do it in base R. This approach also avoids having to explicitly name each column - the dplyr approaches suggested in the comments do not do that, although perhaps could be modified.
# Load data
df1 <- read.table(text = "name age weight height
Tim 7 54 112
Dave 5 50 NA
Larry NA 42 73
Rob 1 30 43", header=TRUE)
df2 <- read.table(text = "name age weight height grade
Tim 7 NA 112 2
Dave NA 50 103 1
Larry 3 NA 73 NA
Rob 1 30 NA NA
John 6 60 NA 1
Tom 8 61 112 2", header=TRUE)
df3 <- merge(df1, df2, by = "name", all = TRUE, sort=FALSE)
# Coalesce the common columns
common_cols <- names(df1)[names(df1)!="name"]
df3[common_cols] <- lapply(common_cols, function(col) {
coalesce(df3[[paste0(col, ".x")]], df3[[paste0(col, ".y")]])
})
# Select desired columns
df3[names(df2)]
# name age weight height grade
# 1 Tim 7 54 112 2
# 2 Dave 5 50 103 1
# 3 Larry 3 42 73 NA
# 4 Rob 1 30 43 NA
# 5 John 6 60 NA 1
# 6 Tom 8 61 112 2
There are advantages to using base R, but powerjoin looks like an interesting package too.
Another possible solution:
library(tidyverse)
df2 %>%
bind_rows(df1) %>%
group_by(name) %>%
fill(age:grade, .direction = "updown") %>%
ungroup %>%
distinct
#> # A tibble: 6 x 5
#> name age weight height grade
#> <chr> <int> <int> <int> <int>
#> 1 Tim 7 54 112 2
#> 2 Dave 5 50 103 1
#> 3 Larry 3 42 73 NA
#> 4 Rob 1 30 43 NA
#> 5 John 6 60 NA 1
#> 6 Tom 8 61 112 2

How to fill missing values grouped on id and based on time period from index date

I want to fill in missing values for a data.frame based on a period of time within groups of ID.
For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.
Sample version of my data:
ID registration_dat value1 value2
1 2020-03-04 NA NA
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 NA NA
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 NA NA
8 2019-10-31 NA 43
8 2019-08-12 32 46
Desired output:
ID registration_dat value1 value2
1 2020-03-04 33 25
1 2019-05-06 33 25
1 2019-01-02 32 21
3 2021-10-31 NA NA
3 2018-10-12 33 NA
3 2018-10-10 25 35
4 2020-01-02 32 83
4 2019-10-31 32 83
4 2019-09-20 33 56
8 2019-12-12 32 43
8 2019-10-31 NA 43
8 2019-08-12 32 46
I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million+ rows so I would need it to work on a much bigger data.frame than the one shown as an example.
I'd appreciate any ideas!
You can use across() to manipulate multiple columns at the same time. Note that I use date1 - years(1) <= date2 rather than date1 - 365 <= date2 to identify if a date is within 1 year of the latest one, which can take a leap year (366 days) into account.
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
arrange(desc(registration_dat), .by_group = TRUE) %>%
mutate(across(starts_with("value"),
~ if_else(row_number() == 1 & is.na(.x) & registration_dat - years(1) <= registration_dat[which.max(!is.na(.x))],
.x[which.max(!is.na(.x))], .x))) %>%
ungroup()
# # A tibble: 12 x 4
# ID registration_dat value1 value2
# <int> <date> <int> <int>
# 1 1 2020-03-04 33 25
# 2 1 2019-05-06 33 25
# 3 1 2019-01-02 32 21
# 4 3 2021-10-31 NA NA
# 5 3 2018-10-12 33 NA
# 6 3 2018-10-10 25 35
# 7 4 2020-01-02 32 83
# 8 4 2019-10-31 32 83
# 9 4 2019-09-20 33 56
# 10 8 2019-12-12 32 43
# 11 8 2019-10-31 NA 43
# 12 8 2019-08-12 32 46
Data
df <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L, 3L, 4L, 4L, 4L, 8L,
8L, 8L), registration_dat = structure(c(18325, 18022, 17898,
18931, 17816, 17814, 18263, 18200, 18159, 18242, 18200, 18120
), class = "Date"), value1 = c(NA, 33L, 32L, NA, 33L, 25L, NA,
32L, 33L, NA, NA, 32L), value2 = c(NA, 25L, 21L, NA, NA, 35L,
NA, 83L, 56L, NA, 43L, 46L)), class = "data.frame", row.names = c(NA,-12L))
You could make a small function (f, below) to handle each value column.
Make a grouped ID, and generate a rowid (this is only to retain your original order)
dat <- dat %>%
mutate(rowid = row_number()) %>%
arrange(registration_dat) %>%
group_by(ID)
Make a function that takes a df and val column, and returns and updated df with val fixed
f <- function(df, val) {
bind_rows(
df %>% filter(is.na({{val}}) & row_number()!=n()),
df %>% filter(!is.na({{val}}) | row_number()==n()) %>%
mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
)
}
Apply the function to the columns of interest
dat = f(dat,value1)
dat = f(dat,value2)
If you want, recover the original order
dat %>% arrange(rowid) %>% select(-rowid)
Output:
ID registration_dat value1 value2
<int> <date> <int> <int>
1 1 2020-03-04 33 25
2 1 2019-05-06 33 25
3 1 2019-01-02 32 21
4 3 2021-10-31 NA NA
5 3 2018-10-12 33 NA
6 3 2018-10-10 25 35
7 4 2020-01-02 32 83
8 4 2019-10-31 32 83
9 4 2019-09-20 33 56
10 8 2019-12-12 32 46
11 8 2019-10-31 NA 43
12 8 2019-08-12 32 46
Update:
The OP wants the final row (i.e the last registration_dat) per ID. With 3 million rows and 14 value columns, I would use data.table and do something like this:
library(data.table)
f <- function(df) {
df = df[df[1,registration_dat]-registration_dat<=365]
df[1,value:=df[2:.N][!is.na(value)][1,value]][1]
}
dcast(
melt(setDT(dat), id=c("ID", "registration_dat"))[order(-registration_dat),f(.SD), by=.(ID,variable)],
ID+registration_dat~variable, value.var="value"
)
Output:
ID registration_dat value1 value2
<int> <Date> <int> <int>
1: 1 2020-03-04 33 25
2: 3 2021-10-31 NA NA
3: 4 2020-01-02 32 83
4: 8 2019-12-12 32 43

Convert data from wide format to long format with multiple measure columns [duplicate]

This question already has answers here:
wide to long multiple measures each time
(5 answers)
Closed 1 year ago.
I want to do this but the exact opposite. So say my dataset looks like this:
ID
X_1990
X_2000
X_2010
Y_1990
Y_2000
Y_2010
A
1
4
7
10
13
16
B
2
5
8
11
14
17
C
3
6
9
12
15
18
but with a lot more measure variables (i.e. also Z_1990, etc.). How can I get it so that the year becomes a variable and it will keep the different measures, like this:
ID
Year
X
Y
A
1990
1
10
A
2000
4
13
A
2010
7
16
B
1990
2
11
B
2000
5
14
B
2010
8
17
C
1990
3
12
C
2000
3
15
C
2010
9
18
You may use pivot_longer with names_sep argument.
tidyr::pivot_longer(df, cols = -ID, names_to = c('.value', 'Year'), names_sep = '_')
# ID Year X Y
# <chr> <chr> <int> <int>
#1 A 1990 1 10
#2 A 2000 4 13
#3 A 2010 7 16
#4 B 1990 2 11
#5 B 2000 5 14
#6 B 2010 8 17
#7 C 1990 3 12
#8 C 2000 6 15
#9 C 2010 9 18
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c("A", "B", "C"), X_1990 = 1:3, X_2000 = 4:6,
X_2010 = 7:9, Y_1990 = 10:12, Y_2000 = 13:15, Y_2010 = 16:18),
row.names = c(NA, -3L), class = "data.frame")

How to reorder column values ascending order that are seperated by "," and only keep first value in R

I have a column in a df that consists of values like so:
ID
2
NA
1
3
4
5,7
9,6,10
12
15
16
17
NA
19
22,23
I would like to reorder every row based on ascending order. Note - this column is a "character" based field and some rows are already in the correct order.
From there, I only want to keep the first value and remove the others.
Desired output:
ID
2
NA
1
3
4
5
6
12
15
16
17
NA
19
22
You can split the data on comma, sort them and extract the 1st value.
df$ID <- sapply(strsplit(df$ID, ','), function(x) sort(as.numeric(x))[1])
# ID
#1 2
#2 NA
#3 1
#4 3
#5 4
#6 5
#7 6
#8 12
#9 15
#10 16
#11 17
#12 NA
#13 19
#14 22
A couple of tidyverse alternatives.
library(tidyverse)
#1.
#Same as base R but in tidyverse
df %>% mutate(ID = map_dbl(str_split(ID, ','), ~sort(as.numeric(.x))[1]))
#2.
df %>%
mutate(row = row_number()) %>%
separate_rows(ID, sep = ',', convert = TRUE) %>%
group_by(row) %>%
summarise(ID = min(ID)) %>%
select(-row)
Here is another tidyverse solution: Making use of (dyplr, purrr, stringr and readr
library(tidyverse)
df %>%
mutate(ID = map_chr(str_split(ID, ","), ~
toString(sort(as.numeric(.x)))),
ID = parse_number(ID))
)
output:
ID
1 2
2 NA
3 1
4 3
5 4
6 5
7 6
8 12
9 15
10 16
11 17
12 NA
13 19
14 22
We may use the minimum instead of sorting / extracting:
DF <- transform(DF, ID=sapply(strsplit(ID, ','), \(x) min(as.double(x))))
DF
# ID
# 1 2
# 2 NA
# 3 1
# 4 3
# 5 4
# 6 5
# 7 6
# 8 12
# 9 15
# 10 16
# 11 17
# 12 NA
# 13 19
# 14 22
We could use str_extract
library(stringr)
library(dplyr)
df1 %>%
mutate(ID = as.numeric(str_extract(ID, '\\d+')))
-output
ID
1 2
2 NA
3 1
4 3
5 4
6 5
7 9
8 12
9 15
10 16
11 17
12 NA
13 19
14 22
data
df1 <- structure(list(ID = c("2", NA, "1", "3", "4", "5,7", "9,6,10",
"12", "15", "16", "17", NA, "19", "22,23")), class = "data.frame", row.names = c(NA,
-14L))

Resources