Join specific columns of matching rows - r

I have this data frame:
patientcA 1 2 NA NA b c
patientcB NA NA 3 4 b c
patientdA 3 3 NA NA d e
patientdB NA NA 5 6 d e
How can I join columns 2,3,4 and 5 for those rows which match in column 1 except for the last character. In this case, first two rows match except for the last character; and last two rows do the same. So my expected output would be:
patientcA 1 2 3 4 b c
patientcB 1 2 3 4 b c
patientdA 3 3 5 6 d e
patientdB 3 3 5 6 d e
I have tried something like this, but I don't know what to write as else argument. Moreover I think this is not the best approach:
new_data$first_column<-ifelse(grepl('A$', original_data$first), original_data$first, ?)

Maybe you might consider a tidyverse approach that uses separate to put the last character of column 1 into a new column, and fill to replace NA with values for the same patient.
library(tidyverse)
df %>%
separate(V1, into = c("patient", "letter"), sep = -1) %>%
group_by(patient) %>%
fill(V2:V5, .direction = "downup")
Output
patient letter V2 V3 V4 V5 V6 V7
<chr> <chr> <int> <int> <int> <int> <chr> <chr>
1 patientc A 1 2 3 4 b c
2 patientc B 1 2 3 4 b c
3 patientd A 3 3 5 6 d e
4 patientd B 3 3 5 6 d e

You could write a vectorized function like CC() below, that completes columns, then split-apply-combine with by.
CC <- Vectorize(function(x) if (any(is.na(x))) rep(x[!is.na(x)], length(x)) else x)
res <- do.call(rbind.data.frame, by(dat, substr(dat$V1, 8, 8), CC))
res
# V1 V2 V3 V4 V5 V6 V7
# c.1 patientcA 1 2 3 4 b c
# c.2 patientcB 1 2 3 4 b c
# d.1 patientdA 3 3 5 6 d e
# d.2 patientdB 3 3 5 6 d e

Related

How can I stack my dataset so each observation relates to all other observations but itself?

I would like to stack my dataset so all observations relate to all other observations but itself.
Suppose I have the following dataset:
df <- data.frame(id = c("a", "b", "c", "d" ),
x1 = c(1,2,3,4))
df
id x1
1 a 1
2 b 2
3 c 3
4 d 4
I would like observation a to be related to b, c, and d. And the same for every other observation. The result should look like something like this:
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
So observation a is related to b,c,d. Observation b is related to a, c,d. And so on. Any ideas?
Another option:
library(dplyr)
left_join(df, df, by = character()) %>%
filter(id.x != id.y)
Or
output <- merge(df, df, by = NULL)
output = output[output$id.x != output$id.y,]
Thanks #ritchie-sacramento, I didn't know the by = NULL option for merge before, and thanks #zephryl for the by = character() option for dplyr joins.
tidyr::expand_grid() accepts data frames, which can then be filtered to remove rows that share the id:
library(tidyr)
library(dplyr)
expand_grid(df, df, .name_repair = make.unique) %>%
filter(id != id.1)
# A tibble: 12 × 4
id x1 id.1 x1.1
<chr> <dbl> <chr> <dbl>
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
You can use combn() to get all combinations of row indices, then assemble your dataframe from those:
rws <- cbind(combn(nrow(df), 2), combn(nrow(df), 2, rev))
df2 <- cbind(df[rws[1, ], ], df[rws[2, ], ])
# clean up row and column names
rownames(df2) <- 1:nrow(df2)
colnames(df2) <- c("id", "x1", "id2", "x2")
df2
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 c 3
5 b 2 d 4
6 c 3 d 4
7 b 2 a 1
8 c 3 a 1
9 d 4 a 1
10 c 3 b 2
11 d 4 b 2
12 d 4 c 3

how to remove rows in a dataframe that contains all zeros or NAs or in combination of zeros and NAs in R

I have a large data frame with 10000000 rows and 150 columns. in the dataset, there are specific rows that contain all zeros or all NAs or a combination of zeros and NAs. the sample dataframe is shown below
df <- data.frame(x = c('q', 'w', 'e', 'r','t', 'y'), a = c('a','b','c','d','e','f'), b =
c(0,1,2,3,0,5), c= c(0,3,2,4,0,'NA'), d=c(0,2,5,7,'NA',5), e = c(0,5,'NA',3,0,'NA'), f =
c(0,7,4,3,'NA',7))
the desired output is as follows
df1 <- data.frame(x = c('w', 'e', 'r','y'), a = c('b','c','d','f'), b = c(1,2,3,5), c=
c(3,2,4,'NA'), d=c(2,5,7,5), e = c(5,'NA',3,'NA'), f = c(7,4,3,7))
i.e.
df <-
w b 1 3 2 5 7
e c 2 2 5 NA 4
r d 3 4 7 3 3
y f 5 NA 5 NA 7
I tried multiple possible solutions in the stackover flow such as
df %>%
filter(if_all(everything(), ~ !is.na(.x)))
or
df %>%
filter_if(is.numeric,
~ !is.na(.))
but could not solve the problem
You can use apply() rowwise, combining all() and na.omit()
df[apply(df[,-c(1,2)],1,\(r) all(na.omit(r)!=0)),]
Output:
x a b c d e f
2 w b 1 3 2 5 7
3 e c 2 2 5 NA 4
4 r d 3 4 7 3 3
6 y f 5 NA 5 NA 7
We may use vectorized operations as it is a big dataset
library(dplyr)
df %>%
filter(!if_all(where(is.numeric), ~ is.na(.x)|.x %in% 0))
-output
x a b c d e f
1 w b 1 3 2 5 7
2 e c 2 2 5 NA 4
3 r d 3 4 7 3 3
4 y f 5 NA 5 NA 7
Or with data.table
library(data.table)
setDT(df)[df[, !Reduce(`&`, lapply(.SD, \(x) is.na(x)| x %in% 0)),
.SDcols = is.numeric]]
-output
x a b c d e f
<char> <char> <num> <char> <char> <char> <char>
1: w b 1 3 2 5 7
2: e c 2 2 5 NA 4
3: r d 3 4 7 3 3
4: y f 5 NA 5 NA 7

Keeping all NAs in dplyr distinct function

I have a data.frame (the eBird basic dataset) where many observers may upload a record from a same sighting to a database, in this case, the event is given a "group identifier"; when not from a group session, a NA will appear in the database; so I'm trying to filter out all those duplicates from group events and keep all NAs, I'm trying to do this without splitting the dataframe in two:
library(dplyr)
set.seed(1)
df <- tibble(
x = sample(c(1:6, NA), 30, replace = T),
y = sample(c(letters[1:4]), 30, replace = T)
)
df %>% count(x,y)
gives:
> df %>% count(x,y)
# A tibble: 20 x 3
x y n
<int> <chr> <int>
1 1 a 1
2 1 b 2
3 2 a 1
4 2 b 1
5 2 c 1
6 2 d 3
7 3 a 1
8 3 b 1
9 3 c 4
10 4 d 1
11 5 a 1
12 5 b 2
13 5 c 1
14 5 d 1
15 6 a 1
16 6 c 2
17 NA a 1
18 NA b 2
19 NA c 2
20 NA d 1
I want no NA at x to be grouped together, as here happened with "NA b" and "NA c" combinations; distinct function has no information on not taking NAs into the computation; is splitting the dataframe the only solution?
With distinct an option is to create a new column based on the NA elements in 'x'
library(dplyr)
df %>%
mutate(x1 = row_number() * is.na(x)) %>%
distinct %>%
select(-x1)
Or we can use duplicated with an OR (|) condition to return all NA elements in 'x' with filter
df %>%
filter(is.na(x)|!duplicated(cur_data()))
# A tibble: 20 x 2
# x y
# <int> <chr>
# 1 1 b
# 2 4 b
# 3 NA a
# 4 1 d
# 5 2 c
# 6 5 a
# 7 NA d
# 8 3 c
# 9 6 b
#10 2 b
#11 3 b
#12 1 c
#13 5 d
#14 2 d
#15 6 d
#16 2 a
#17 NA c
#18 NA a
#19 1 a
#20 5 b

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Keep rows in dataframe for the last n appearances of a value in a column

I have a fairly large dataframe with about ~1M columns, and I need to remove many rows from them. It is difficult to describe in the title alone, but easier to show an example and then explain:
temp = data.frame(a = c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,3), b = LETTERS[1:15])
temp
a b
1 1 A
2 1 B
3 1 C
4 1 D
5 1 E
6 2 F
7 2 G
8 2 H
9 2 I
10 3 J
11 3 K
12 3 L
13 3 M
14 3 N
15 3 O
With this, I want to keep only the rows corresponding to the last 3 appearances of each unique number in column a. that is, I am trying to obtain a dataframe that looks like this:
my_final_df
a b
1 1 C
2 1 D
3 1 E
4 2 G
5 2 H
6 2 I
7 3 M
8 3 N
9 3 0
For my full dataframe, data anywhere besides the last 3 rows for a certain number in the 'a' column is noise, which is why I want to remove them. I think I need to create a boolean vector of some sort to do this, and then subset my_df with the boolean vector, but not sure how.
We can do this compactly in data.table
library(data.table)
setDT(temp)[, tail(.SD, 3) , a]
# a b
#1: 1 C
#2: 1 D
#3: 1 E
#4: 2 G
#5: 2 H
#6: 2 I
#7: 3 M
#8: 3 N
#9: 3 O
Or an option using tidyverse with top_n
library(tidyverse)
temp %>%
group_by(a) %>%
top_n( 3, rank(row_number()))
# a b
# <dbl> <fctr>
#1 1 C
#2 1 D
#3 1 E
#4 2 G
#5 2 H
#6 2 I
#7 3 M
#8 3 N
#9 3 O
With dplyr we can group by a and select the last 3 rows using slice and tail.
library(dplyr)
temp %>%
group_by(a) %>%
slice(tail(1:n(), 3))
# a b
# <dbl> <fctr>
#1 1 C
#2 1 D
#3 1 E
#4 2 G
#5 2 H
#6 2 I
#7 3 M
#8 3 N
#9 3 O
You can split by a and then keep last three rows for each sub group
do.call(rbind, lapply(split(temp, temp$a), function(x) tail(x,3)))
# a b
#1.3 1 C
#1.4 1 D
#1.5 1 E
#2.7 2 G
#2.8 2 H
#2.9 2 I
#3.13 3 M
#3.14 3 N
#3.15 3 O

Resources