Remove duplicate elements by row in a data frame - r

I need to replace duplicate elements to NA by row from a data frame. I will take base, tidyverse or data.table solutions. Thank you. Example:
library(tibble)
#input data.frame
tribble(
~x, ~y, ~z,
1, 2, 3,
1, 1, NA,
4, 1, 4,
2, 2, 3
)
#desired output
tribble(
~x, ~y, ~z,
1, 2, 3,
1, NA, NA,
4, 1, NA,
2, 3, NA
)

Here is a base R option where we loop through the rows, replace the duplicated elements with NA and concatenate (c) the non-NA elements with the NA elements, transpose (t) and assign the output back to the original dataset
df1[] <- t(apply(df1, 1, function(x) {
x1 <- replace(x, duplicated(x), NA)
c(x1[!is.na(x1)], x1[is.na(x1)])
}))
df1
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 1 NA NA
#3 4 1 NA
#4 2 3 NA

Related

Get last entry of a range with identical numbers in R, vectorized

I’ve got this data:
tribble(
~ranges, ~last,
0, NA,
1, NA,
1, NA,
1, NA,
1, NA,
2, NA,
2, NA,
2, NA,
3, NA,
3, NA
)
and I want to fill the last column only at the row index at the last entry of the number by the ranges column. That means, it should look like this:
tribble(
~ranges, ~last,
0, 0,
1, NA,
1, NA,
1, NA,
1, 1,
2, NA,
2, NA,
2, 2,
3, NA,
3, 3
)
So far I came up with a row-wise approach:
for (r in seq.int(max(tmp$ranges))) {
print(r)
range <- which(tmp$ranges == r) |> max()
tmp$last[range] <- r
}
The main issue is that it is terribly slow. I am looking for a vectorized approach to this issue. Any creative solution out there?
Here's a dplyr solution:
library(dplyr)
tmp %>%
group_by(ranges) %>%
mutate(
last = case_when(row_number() == n() ~ ranges, TRUE ~ NA_real_)
) %>%
ungroup()
# # A tibble: 10 × 2
# ranges last
# <dbl> <dbl>
# 1 0 0
# 2 1 NA
# 3 1 NA
# 4 1 NA
# 5 1 1
# 6 2 NA
# 7 2 NA
# 8 2 2
# 9 3 NA
# 10 3 3
Or we could do something clever with base R for the same result. Here we calculate the difference of ranges to identify when the next row is different (i.e., the last of a group). We then stick a TRUE on the end so the last row is included. This assumes your data is already sorted by ranges.
tmp$last = ifelse(c(diff(tmp$ranges) != 0, TRUE), tmp$ranges, NA)
Using replace:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = replace(last, n(), ranges[n()]))
Using ifelse:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = ifelse(row_number() == n(), ranges, NA))
Using tail:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = c(last[-n()], tail(ranges, 1)))
output
ranges last
<dbl> <dbl>
1 0 0
2 1 NA
3 1 NA
4 1 NA
5 1 1
6 2 NA
7 2 NA
8 2 2
9 3 NA
10 3 3

Different cells between two data frames

I need differences between two data frames. setdiff() gives me modyfied and new rows. But it shows a whole modified row, but I want only different cells. How to do this? I assume the number of columns is the same.
Input data:
df1 <- data.frame(ID = c(1, 2, 3),
A = c(1, 2, 3),
B = c(1, 2, NA))
df2 <- data.frame(ID = c(1, 2, 3, 4),
A = c(1, 2, 3, 4),
B = c(1, 2, 3, NA))
newdata = setdiff(df2,df1) # don't give results as my expectation
As a result it should be such dataframe:
result <- data.frame(ID = c(3, 4),
A = c(NA, 4),
B = c(3, NA))
Column ID should be preserved and always should contain value.
Summary:
Output should contain only new, or modified rows from df2.
In modified rows should be displayed only modified or new cells.
Values in ID column should be displayed even they are not modified.
compare, compare_df? How to do this?
You can do this in separate steps since you are applying different logic to different columns (ID vs A), but can't be achieved as a set of all columns.
df1 <- data.frame(ID = c(1, 2, 3),
A = c(1, 2, 3),
B = c(1, 2, NA))
df2 <- data.frame(ID = c(1, 2, 3, 4),
A = c(1, 2, 3, 4),
B = c(1, 2, 3, NA))
newdata = setdiff(df2,df1)
newdata
ID A B
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
You can apply your logic to cols A & B, and not apply it to ID,
newdata$A[which(df2$A == df1$A)] <- NA
newdata$B[which(df2$B == df1$B)] <- NA
newdata
ID A B
1 1 NA NA
2 2 NA NA
3 3 NA 3
4 4 4 NA
newdata[3:4,]
There are wizards far better than me that might opine, but I see no way to do this in one pass with the ID restriction.

Concatenate tibbles with different columns

I want to concatenate tibbles,
there are few common columns
few columns with same name but different values
one different column
I have create an example below, can someone please create the desired tibble
Thanks
library(tidyverse)
# common columns in both tibble
x <- c(1, 2, 3)
y <- c(2, 3, 4)
# common column name and different value for each tibble
v <- c(15, 10, 20)
# specific column to tibble
t_a <- c(4, 5, 6)
tbl_a <- tibble(x, y, v, t_a)
# common column name and different value for each tibble
v <- c(7, 11, 13)
# specific column to tibble
t_b<- c(9, 14, 46)
tbl_b <- tibble(x, y, v, t_b)
# concatenate tbl such output looks like this
x <- c(1, 2, 3, 1, 2, 3)
y <- c(2, 3, 4, 2, 3, 4)
v <- c(15, 10, 20, 7, 11, 13)
t <- c(4, 5, 6, 9, 14, 46)
name <- c("a", "a", "a", "b", "b", "b")
# desired output
tbl <- tibble(x, y, v, t, name)
Here, we can bind the datasets together and use pivot_longer
library(dplyr)
library(tidyr)
bind_rows(tbl_a, tbl_b) %>%
pivot_longer(cols = c(t_a, t_b), names_to = c('.value', 'name'),
names_sep="_", values_to = 't', values_drop_na = TRUE)
-output
# A tibble: 6 x 5
# x y v name t
# <dbl> <dbl> <dbl> <chr> <dbl>
#1 1 2 15 a 4
#2 2 3 10 a 5
#3 3 4 20 a 6
#4 1 2 7 b 9
#5 2 3 11 b 14
#6 3 4 13 b 46

populating variable elements based on the value of another variable element value in r

I have a dataframe with some missing values. I want to fill these missing values based on the value of another variable in my dataframe but am not able to work out the code.
library(tidyr)
farm<- c(1, 1, 2, 3, 3, 3, 4)
region<- c(NA, NA, NA, NA, NA, NA, 'Woods')
test<- c('x', 'y', 'x', 'x', 'y', 'y',
'x')
i=1:2
j=3
df = data.frame(farm, region, test)
df
here is the result
farm region test
1 1 <NA> x
2 1 <NA> y
3 2 <NA> x
4 3 <NA> x
5 3 <NA> y
6 3 <NA> y
7 4 Woods x
I would like to populate region with "mac" if region = 1 or 2 and alternatively populate region with "sto" if region = 3. I have tried the following code:
df <- transform(df,region=if (df$farm==i) "mac" else NA)
df
to get started but am getting:
farm
<dbl>
region
<chr>
test
<fctr>
1 mac x
1 mac y
2 mac x
3 mac x
3 mac y
3 mac y
4 mac x
As you can see it is populating "mac" beyond the 1 or 2 variable elements for farm. Any advice would be much appreciated.
Using base r
library(tidyr)
farm<- c(1, 1, 2, 3, 3, 3, 4)
region<- c(NA, NA, NA, NA, NA, NA, 'Woods')
test<- c('x', 'y', 'x', 'x', 'y', 'y',
'x')
df = data.frame(farm, region, test)
df$region <- ifelse(df$farm == 1|df$farm ==2,"mac",
ifelse(df$farm == 3, "sto", as.character(df$region)))
df
You can use ifelse to create an extra column
df$region <- ifelse(df$farm == 1 | df$farm == 2,'mac',ifelse(df$farm == 3, 'sto',NA))
Using case_when() from dplyr:
library(dplyr)
df$region <- case_when(df$farm==1 | df$farm==2 ~ "mac",
df$farm==3 ~ "sto",
TRUE ~ as.character(df$region))

Subsetting data in R to remove rows if values for two variables are NA

I want to remove all rows from my dataset that are NA in two columns. If a row has a non-NA value in either column, I want to keep it. How do I do this?
you can do this
library(tidyverse)
df <- data.frame(a = c(2, 4, 6, NA, 3, NA),
b = c(5, 4, 8, NA, 6, 7))
df1 <- df %>%
filter(is.na(a) == FALSE | is.na(b) == FALSE)
and you get:
> df1
a b
1 2 5
2 4 4
3 6 8
4 3 6
5 NA 7
Here are a couple of base R suggestions. Loop through the columns of datasets, convert it to a logical vector, and collapse the logical vectors by comparing each corresponding element with Reduce, negate the output and subset the dataset
df[!Reduce(`&`, lapply(df, is.na)),]
Or converting the logical matrix (!is.na(df)) to a logical vector to subset the dataset
df[rowSums(!is.na(df))>0,]
data
df <- data.frame(a = c(2, 4, 6, NA, 3, NA),
b = c(5, 4, 8, NA, 6, 7))

Resources