Rank within groups in R with special NA handling - r

I have a dataframe like this:
df <- data.frame(
A = rep(c("A", "B", "C", "D"), each = 3),
B = rep(c("V1", "V2", "V3"), 4),
C = c(1,2,3,5,2,NA,4,6,7,3,7,8)
)
# Output
A B C
1 A V1 1
2 A V2 2
3 A V3 3
4 B V1 5
5 B V2 2
6 B V3 NA
7 C V1 4
8 C V2 6
9 C V3 7
10 D V1 3
11 D V2 7
12 D V3 8
My goal is to reveice the ranks grouped by column B on the values in column C. If there is an NA value, this should not be part of the ranking at all. The RANK column should be filled with NA, NULL or something like this then. Ties should end in averages.
The result should look like:
A B C RANK
1 A V1 1 4
2 A V2 2 3.5
3 A V3 3 3
4 B V1 5 1
5 B V2 2 3.5
6 B V3 NA NA
7 C V1 4 2
8 C V2 6 2
9 C V3 7 2
10 D V1 3 3
11 D V2 7 1
12 D V3 8 1

We can group by 'B', rank on 'C', specify the i with a logical condition to select only the non-NA elements from 'C' and assign (:=) the rank values to create the 'RANK' column. By default, the rows that are not used i.e. NA will be NA in the new column
library(data.table)
setDT(df)[!is.na(C), RANK := rank(-C) , B]
df
# A B C RANK
# 1: A V1 1 4.0
# 2: A V2 2 3.5
# 3: A V3 3 3.0
# 4: B V1 5 1.0
# 5: B V2 2 3.5
# 6: B V3 NA NA
# 7: C V1 4 2.0
# 8: C V2 6 2.0
# 9: C V3 7 2.0
#10: D V1 3 3.0
#11: D V2 7 1.0
#12: D V3 8 1.0

Using the ave() function from baseR for ranking the C values within the groups B
First approach:(an improved version of the second approach) Credit: Henrik
df$Rank <- with(df, ave(C, B, FUN=function(x) rank(-x, na.last = "keep",
ties.method = "average")))
Second approach:
df$Rank <- with(df, ave(C, B, FUN=function(x) rank(-x, ties.method = "average")))
df$Rank[is.na(df$C)] <- NA
Output for both approaches:
df
# A B C Rank
# 1 A V1 1 4.0
# 2 A V2 2 3.5
# 3 A V3 3 3.0
# 4 B V1 5 1.0
# 5 B V2 2 3.5
# 6 B V3 NA NA
# 7 C V1 4 2.0
# 8 C V2 6 2.0
# 9 C V3 7 2.0
# 10 D V1 3 3.0
# 11 D V2 7 1.0
# 12 D V3 8 1.0
Finally, the dplyr approach with same output
df %>% group_by(B) %>% mutate(rank = rank(-C, na.last = "keep",
ties.method = "average"))

Related

Join specific columns of matching rows

I have this data frame:
patientcA 1 2 NA NA b c
patientcB NA NA 3 4 b c
patientdA 3 3 NA NA d e
patientdB NA NA 5 6 d e
How can I join columns 2,3,4 and 5 for those rows which match in column 1 except for the last character. In this case, first two rows match except for the last character; and last two rows do the same. So my expected output would be:
patientcA 1 2 3 4 b c
patientcB 1 2 3 4 b c
patientdA 3 3 5 6 d e
patientdB 3 3 5 6 d e
I have tried something like this, but I don't know what to write as else argument. Moreover I think this is not the best approach:
new_data$first_column<-ifelse(grepl('A$', original_data$first), original_data$first, ?)
Maybe you might consider a tidyverse approach that uses separate to put the last character of column 1 into a new column, and fill to replace NA with values for the same patient.
library(tidyverse)
df %>%
separate(V1, into = c("patient", "letter"), sep = -1) %>%
group_by(patient) %>%
fill(V2:V5, .direction = "downup")
Output
patient letter V2 V3 V4 V5 V6 V7
<chr> <chr> <int> <int> <int> <int> <chr> <chr>
1 patientc A 1 2 3 4 b c
2 patientc B 1 2 3 4 b c
3 patientd A 3 3 5 6 d e
4 patientd B 3 3 5 6 d e
You could write a vectorized function like CC() below, that completes columns, then split-apply-combine with by.
CC <- Vectorize(function(x) if (any(is.na(x))) rep(x[!is.na(x)], length(x)) else x)
res <- do.call(rbind.data.frame, by(dat, substr(dat$V1, 8, 8), CC))
res
# V1 V2 V3 V4 V5 V6 V7
# c.1 patientcA 1 2 3 4 b c
# c.2 patientcB 1 2 3 4 b c
# d.1 patientdA 3 3 5 6 d e
# d.2 patientdB 3 3 5 6 d e

Updating rows of a data.frame with the rows of another data.frame

I want to update the rows of data.frame df1 with the rows of data.frame df2. Any hint?
df1 <-
data.frame(
"V1" = LETTERS[1:4]
, "V2" = 1:4
, "V3" = 7:10
)
df1
V1 V2 V3
1 A 1 7
2 B 2 8
3 C 3 9
4 D 4 10
df2 <-
data.frame(
"V1" = c("A","D")
, "V2" = c(5, 7)
, "V3" = c(12, 15)
)
df2
V1 V2 V3
1 A 5 12
2 D 7 15
Required Output
V1 V2 V3
1 A 5 12
2 B 2 8
3 C 3 9
4 D 7 15
use dplyr 1.0.0
rows_update(df1, df2)
Matching, by = "V1"
V1 V2 V3
1 A 5 12
2 B 2 8
3 C 3 9
4 D 7 15
Try this:
df1[df1$V1 %in% df2$V1,c('V2','V3')] <- df2[df2$V1 %in% df1$V1,c('V2','V3')]
V1 V2 V3
1 A 5 12
2 B 2 8
3 C 3 9
4 D 7 15

How to fill NA rows by conditions from columns in R

Here is an example:
df<-data.frame(v1=rep(1:2, 4),
v2=rep(c("a", "b"), each=4),
v3=paste0(rep(1:2, each=4), rep(c("m", "n", "o", "p"), each=2)),
v4=c(1,2, NA, NA, 3,4, NA,NA),
v5=c(5,6, NA, NA, 7,8, NA,NA),
v6=c(9,10, NA, NA, 11,12, NA,NA))
df
v1 v2 v3 v4 v5 v6
1 1 a 1m 1 5 9
2 2 a 1m 2 6 10
3 1 a 1n NA NA NA
4 2 a 1n NA NA NA
5 1 b 2o 3 7 11
6 2 b 2o 4 8 12
7 1 b 2p NA NA NA
8 2 b 2p NA NA NA
What I wanted is, if column v1+v2+v3 are same by ignore the last letter of v3, fill the NAs from the rows that are not NA . In this case, row3's NA should be filled by row1 due to same 1a1 by ignoring m. So a desired output would be:
v1 v2 v3 v4 v5 v6
1 1 a 1m 1 5 9
2 2 a 1m 2 6 10
3 1 a 1n 1 5 9
4 2 a 1n 2 6 10
5 1 b 2o 3 7 11
6 2 b 2o 4 8 12
7 1 b 2p 3 7 11
8 2 b 2p 4 8 12
I don't know but I think this is a simpler way of producing your results
library(tidyverse)
df %>%
group_by(v1,v2) %>%
fill(v4:v6)
Adding the v3 logic
df %>%
mutate(v7 = v3 %>% as.character() %>% parse_number()) %>%
group_by(v1,v2,v7) %>%
fill(v4:v6) %>%
select(-v7)
Here is a solution that recodes v3 into a variable that only takes into account the numeric part.
library(dplyr)
library(stringr)
#Extract numeric part of the string in v3
df$v7<-str_extract(df$v3,"[[:digit:]]+")
df %>%
group_by(v1,v2,v7) %>%
fill(v4:v6)
Here's a solution using data.table and zoo which ignores v3 column's last letter:
library(data.table)
setDT(df)[, match_cols := paste0(v1, v2, substr(v3, 1, nchar(as.character(v3)) - 1))][, id := .GRP, by = match_cols][, v4 := zoo::na.locf(v4, na.rm = F), by = id][, v5 := zoo::na.locf(v5, na.rm = F), by = id][, v6 := zoo::na.locf(v6, na.rm = F), by = id][ , c("match_cols", "id") := NULL]
df
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 2 a 1m 2 6 10
#3: 1 a 1n 1 5 9
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 2 b 2o 4 8 12
#7: 1 b 2p 3 7 11
#8: 2 b 2p 4 8 12
Using na.locf from zoo
library(zoo)
library(data.table)
setDT(df)[, na.locf(.SD),.(v1, v2)]
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 1 a 1n 1 5 9
#3: 2 a 1m 2 6 10
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 1 b 2p 3 7 11
#7: 2 b 2o 4 8 12
#8: 2 b 2p 4 8 12
If we want to add the condition in 'v3'
setDT(df)[, names(df)[4:6] := na.locf(.SD),.(v1, v2, sub("\\D+", "", v3))][]
# v1 v2 v3 v4 v5 v6
#1: 1 a 1m 1 5 9
#2: 2 a 1m 2 6 10
#3: 1 a 1n 1 5 9
#4: 2 a 1n 2 6 10
#5: 1 b 2o 3 7 11
#6: 2 b 2o 4 8 12
#7: 1 b 2p 3 7 11
#8: 2 b 2p 4 8 12

data.table merge() with NA in by column

I'm trying to join two tables where the column that is joined on has some NA values such that when the NA is encountered the record is padded with NA's i.e.
Given:
> x = data.table(c(1,2,3,NA,5), c("a","b","c","d","e"))
> x
V1 V2
1: 1 a
2: 2 b
3: 3 c
4: NA d
5: 5 e
> y = data.table(c(NA,2,3,4,5), c("A","B","C","D","E"))
> y
V1 V2
1: NA A
2: 2 B
3: 3 C
4: 4 D
5: 5 E
I want my output to be:
> z = data.table(c(NA,NA,1,2,3,4,5),c("d",NA,"a","b","c",NA,"e"),c(NA,"A",NA,"B","C","D","E"))
> z
V1 V2 V3
1: NA d NA
2: NA NA A
3: 1 a NA
4: 2 b B
5: 3 c C
6: 4 NA D
7: 5 e E
I thought merge() could be used to do this. But I can't get it to produce the output I expect:
> merge(x,y, by=c("V1"), all=TRUE)
V1 V2.x V2.y
1: NA d A
2: 1 a NA
3: 2 b B
4: 3 c C
5: 4 NA D
6: 5 e E
I really don't like that it merges based on the NA value as if it was a match, and when I do this in a larger table with several NA's, it seems to iterate over all possible combinations of column values for V1 and V2 given an NA key. Any help would be appreciated.
The dataframe method of merge has a incomparables-argument, which the data.table version of merge doesn't have.
So, using the dataframe method:
merge.data.frame(x, y, by = "V1", all = TRUE, incomparables = NA)
gives the intended result:
V1 V2.x V2.y
1 1 a <NA>
2 2 b B
3 3 c C
4 4 <NA> D
5 5 e E
6 NA d <NA>
7 NA <NA> A
NOTE: According to this GitHub-issue, the data.table developers are planning to include an incomparables-argument in merge.data.table in the future.

Replace na´s in specific columns by the median of the same columns

I would like to replace the na´s in v1 to v4 with the median of the same columns
Here are some example data
id <- c(1,2,3,4)
v1 <- c(1,3,0,2)
v2 <- c(NA,1,NA,2)
v3 <- c(2,4,1,2)
v4 <- c(NA,1,0,2)
v5 <- c(5,1,NA,2)
v6 <- c(7,1,9,NA)
df <- data.frame(id, v1, v2, v3,v4,v5,v6)
df_pre <- df %>% group_by(id) %>% mutate(Median_v1_v4 = median(c(v1,v2,v3,v4), na.rm=TRUE))
This is what data looks like now:
id v1 v2 v3 v4 v5 v6 Median_v1_v4
1 1 NA 2 NA 5 7 1.5
2 3 1 4 1 1 1 2.0
3 0 NA 1 0 NA 9 0.0
4 2 2 2 2 2 NA 2.0
This is what i want the data to look like
id v1 v2 v3 v4 v5 v6 Median_v1_v4
1 1 1.5 2 1.5 5 7 1.5
2 3 1.0 4 1.0 1 1 2.0
3 0 0.0 1 0.0 NA 9 0.0
4 2 2.0 2 2.0 2 NA 2.0
What about this solution:
df[,2:5] <- t( apply(df[,2:5], 1, function(x) {
x[is.na(x)] <- median(x,na.rm=T)
return(x)}
) )
df
id v1 v2 v3 v4 v5 v6
1 1 1 1.0 2 1 5 7
2 2 3 1.0 4 1 1 1
3 3 0 0.5 1 0 NA 9
4 4 2 2.0 2 2 2 NA
Adjusted from: Replace NA values by row means
PS: Saw the comment too late (#Sai Saran), this is an adjustment of the solution in the link above.
You can try
library(tidyverse)
df %>%
gather(k, v, -id) %>%
group_by(id) %>%
mutate(Median=median(v[k %in% c("v1", "v2", "v3","v4")], na.rm = T)) %>%
mutate(v=ifelse(is.na(v) & k %in% c("v1", "v2", "v3","v4"), Median, v)) %>%
spread(k, v)
# A tibble: 4 x 8
# Groups: id [4]
id Median v1 v2 v3 v4 v5 v6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.5 1 1.5 2 1.5 5 7
2 2 2 3 1 4 1 1 1
3 3 0 0 0 1 0 NA 9
4 4 2 2 2 2 2 2 NA
Take a look into this code.
library(tidyverse)
id <- c(1,2,3,4)
v1 <- c(1,3,0,2)
v2 <- c(NA,1,NA,2)
v3 <- c(2,4,1,2)
v4 <- c(NA,1,0,2)
v5 <- c(5,1,NA,2)
v6 <- c(7,1,9,NA)
df <- data.frame(id, v1, v2, v3,v4,v5,v6)
df_pre <- df %>%
group_by(id) %>%
mutate(Median_v1_v4 = median(c(v1,v2,v3,v4), na.rm=TRUE))
df_pre %>%
mutate_at(vars(v1,v2,v3,v4),
funs(replace(., is.na(.), Median_v1_v4))) -> df_pre

Resources