How to remove duplicate pair-wise columns [duplicate] - r

This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 7 years ago.
Consider the following dataframe:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
How can I remove duplicate pair-wise columns so that the output looks like:
# V1 V2 n
#1 A B 1
#2 A C 3
#3 B C 2
I tried unique() and duplicated() to no avail.

Not sure if this is the simplest way of doing it (transposing can be computationally expensive) but this would work with your data frame:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
First, sort the data frame row-wise, so your value-pairs become true duplicates.
df <- data.frame(t(apply(df, 1, sort)))
Then you can just apply the unique function.
df <- unique(df)
If your column names and order are important, you'll have to re-establish those.
names(df) <- c("n", "V1", "V2")
df <- df[, c("V1", "V2", "n")]

Another option would be to reshape (xtabs(n~..)) the dataset ('df') to wide format, set the lower triangular matrix to 0, and remove the rows with "Freq" equal to 0.
m1 <- xtabs(n~V1+V2, df)
m1[lower.tri(m1)] <- 0
subset(as.data.frame(m1), Freq!=0)
# V1 V2 Freq
#4 A B 1
#7 A C 3
#8 B C 2

Related

Replace Value in column based on another column With R [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
Closed 2 years ago.
I'm trying to replace the value of a column based on the data in a different column, but it's not working. Here's some example data.
df <- data.frame(Col1 = 1:10,
Col2 = c("a", "a", "a", "b", "b", "c", "c", "d", "d", "d"),
Col3 = c("11%", "12%", "13%", "14%", "15%", "16%", "17%", "18%", "19%", "20%"))
If the value of Col2 is b, I need to change the value of Col3 to NA or 0 (NA is more accurate, but for what I'm doing, a 0 will also work). Column 3 is percents, I know I used strings here.
I tried doing this a few ways, most recently of which is the loop I have listed below. I'm open to any solution on this though. Is my loop not working because I'm not defining a pattern?
for(i in df){
if(df$Col2 == "b"){
str_replace(df$Col3, replacement = NA)
}
}
print(df)
Here's a base R solution:
df$Col3[df$Col2 == 'b'] <- NA
Here's a dplyr/tidyverse solution:
library(dplyr)
df %>% mutate(Col3 = ifelse(Col2 == 'b',NA_character_,Col3))
(Original, but less efficient case_when solution)
df %>%
mutate(Col3 = case_when(Col2 == 'b' ~ NA_character_,
TRUE ~ Col3))
This gives us:
Col1 Col2 Col3
1 1 a 11%
2 2 a 12%
3 3 a 13%
4 4 b <NA>
5 5 b <NA>
6 6 c 16%
7 7 c 17%
8 8 d 18%
9 9 d 19%
10 10 d 20%
A base dplyr solution, using ifelse() instead of case_when():
library(dplyr)
df <- data.frame(Col1 = 1:10,
Col2 = c("a", "a", "a", "b", "b", "c", "c", "d", "d", "d"),
Col3 = seq(.11, .2, by = .1))
df %>%
mutate(Col3 = ifelse(Col2 == 'b', NA, Col2))
pkpto39,
Try this:
library('tidyverse')
df <- data.frame(Col1 = 1:10,
Col2 = c("a", "a", "a", "b", "b", "c", "c", "d", "d", "d")
Col3 = c("11%", "12%", "13%", "14%", "15%", "16%", "17%", "18%", "19%", "20%"), stringsAsFactors = FALSE)
df <- df %>% mutate(Col3 = ifelse(Col2 == "b", NA, Col3))

How to delete all the duplicates row based on two columns?

I have a data frame where I want to delete duplicates rows, but I want to delete them only if a value from another column is the same for all the rows. (To be more clear I want to delete the duplicates rows which have the same "Number" value for all rows)
There is a example of my data frame :
df <- data.frame("Name" = c("a", "a", "b", "b", "b", "c", "c", "c"),
"Number" = c(1, 1, 1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
And the result I expect is :
result <- data.frame("Name" = c("b", "b", "b", "c", "c", "c"),
"Number" = c(1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
We can group_by Name and remove groups which have more than 1 row and have only one distinct value.
library(dplyr)
df %>%
group_by(Name) %>%
filter(!(n_distinct(Number) == 1 & n() > 1))
# Name Number
# <chr> <dbl>
#1 b 2
#2 b 2
#3 b 3
and using base R ave, the same logic can be written as
df[with(df, !as.logical(ave(Number, Name, FUN = function(x)
length(unique(x)) == 1 & length(x) > 1))), ]
Here is a solution with data.table
library("data.table")
df <- data.table("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3))
df[, if (uniqueN(Number)!=1 || .N==1) .SD, Name]
and here is a solution with base R:
df <- data.frame("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3), stringsAsFactors = FALSE)
df[as.logical(ave(df$Number, df$Name, FUN=function(x) length(unique(x))!=1 || length(x)==1)),]
We can use data.table methods
library(data.table)
setDT(df)[, .SD[uniqueN(Number) > 1] , Name]
# Name Number
#1: b 1
#2: b 2
#3: b 3
#4: c 4
#5: c 5
#6: c 5

Conditional search, match and replace values between data frames

I have two dataframes as shown below. I would like to replace text (cells) in dataframe 1 with corresponding values taken from dataframe 2 when there is a match. I have tried to give a simple example below.
I have some limited experience with R but cant think of an easy solution right away. Any help/suggestions will be much appreciated.
input_1 = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c("A", "B", "C", "D"),
col3 = c("B", "E", "F", "D"))
input_2 = data.frame(colx = c("A", "B", "C", "D", "E", "F"),
coly = c(1, 2, 3, 4, 5, 6))
output = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c(1, 2, 3, 4),
col3 = c(2, 5, 6, 4))
Here's a tidyverse solution :
library(tidyverse)
mutate_at(input_1, -1, ~deframe(input_2)[as.character(.)])
# col1 col2 col3
# 1 ex1 1 2
# 2 ex2 2 5
# 3 ex3 3 6
# 4 ex4 4 4
deframe builds a named vector from a data frame, more convenient in this case.
as.character is necessary as you have factor columns
Example using tidyverse. My solution involved merging twice to input_2, but matching different columns. The last pipe cleans the data frame and renames the columns.
library(tidyverse)
input_1 = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c("A", "B", "C", "D"),
col3 = c("B", "E", "F", "D"))
input_2 = data.frame(colx = c("A", "B", "C", "D", "E", "F"),
coly = c(1, 2, 3, 4, 5, 6))
output = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c(1, 2, 3, 4),
col3 = c(2, 5, 6, 4))
input_1 %>% inner_join(input_2, by = c("col2" = "colx")) %>%
inner_join(input_2, by = c("col3" = "colx")) %>%
select(col1, coly.x, coly.y) %>%
magrittr::set_colnames(c("col1", "col2", "col3"))
One approach using base R would be to loop over columns where we want to change values using lapply, match the values with input_2$colx and get the corresponding coly value.
input_1[-1] <- lapply(input_1[-1], function(x) input_2$coly[match(x, input_2$colx)])
input_1
# col1 col2 col3
#1 ex1 1 2
#2 ex2 2 5
#3 ex3 3 6
#4 ex4 4 4
Actually, you could go away without using lapply, you could directly unlist the values and match
input_1[-1] <- input_2$coly[match(unlist(input_1[-1]), input_2$colx)]

Extract list of values from column based upon other column

The following code:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1,6)
)
Results in the following dataframe:
letter score
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
I want to get the scores for a sequence of letters, for example the scores of c("f", "a", "d", "e"). It should result in c(6, 1, 4, 5).
What's more, I want to get the scores for c("c", "o", "f", "f", "e", "e"). Now the o is not in the letter column so it should return NA, resulting in c(3, NA, 6, 6, 5, 5).
What is the best way to achieve this? Can I use dplyr for this?
We can use match to create an index and extract the corresponding 'score' If there is no match, then by default it gives NA
df$score[match(v1, df$letter)]
#[1] 3 NA 6 6 5 5
df$score[match(v2, df$letter)]
#[1] 6 1 4 5
data
v1 <- c("c", "o", "f", "f", "e", "e")
v2 <- c("f", "a", "d", "e")
If you want to use dplyr I would use a join:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1:6)
)
library(dplyr)
df2 <- data.frame(letter = c("c", "o", "f", "f", "e", "e"))
left_join(df2, df, by = "letter")
letter score
1 c 3
2 o NA
3 f 6
4 f 6
5 e 5
6 e 5

Mean by factor level for last three rows

I have a dataframe, which looks like this (but has more factor levels and values)
ID <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C")
Value <- rep(1:5)
test <- cbind.data.frame(ID, Value)
I would like to calculate the mean of the first 3 and last 3 values (rows) of each factor level.
For the first 3 values I used ddply:
library(plyr)
mean_start <- ddply(test, .(ID), summarise, mean_start = mean(Value[1:3]))
This works great. But how can I refer to the last 3 rows, given that each factor level has a different amount of rows?
Using headand tail:
library(plyr)
(means <- ddply(test, .(ID), summarise, mean_start = mean(head(Value, 3)), mean_end = mean(tail(Value, 3))))
# ID mean_start mean_end
# 1 A 2.000000 4
# 2 B 2.000000 3
# 3 C 2.666667 4

Resources