Conditional search, match and replace values between data frames - r

I have two dataframes as shown below. I would like to replace text (cells) in dataframe 1 with corresponding values taken from dataframe 2 when there is a match. I have tried to give a simple example below.
I have some limited experience with R but cant think of an easy solution right away. Any help/suggestions will be much appreciated.
input_1 = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c("A", "B", "C", "D"),
col3 = c("B", "E", "F", "D"))
input_2 = data.frame(colx = c("A", "B", "C", "D", "E", "F"),
coly = c(1, 2, 3, 4, 5, 6))
output = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c(1, 2, 3, 4),
col3 = c(2, 5, 6, 4))

Here's a tidyverse solution :
library(tidyverse)
mutate_at(input_1, -1, ~deframe(input_2)[as.character(.)])
# col1 col2 col3
# 1 ex1 1 2
# 2 ex2 2 5
# 3 ex3 3 6
# 4 ex4 4 4
deframe builds a named vector from a data frame, more convenient in this case.
as.character is necessary as you have factor columns

Example using tidyverse. My solution involved merging twice to input_2, but matching different columns. The last pipe cleans the data frame and renames the columns.
library(tidyverse)
input_1 = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c("A", "B", "C", "D"),
col3 = c("B", "E", "F", "D"))
input_2 = data.frame(colx = c("A", "B", "C", "D", "E", "F"),
coly = c(1, 2, 3, 4, 5, 6))
output = data.frame(col1 = c("ex1", "ex2", "ex3", "ex4"),
col2 = c(1, 2, 3, 4),
col3 = c(2, 5, 6, 4))
input_1 %>% inner_join(input_2, by = c("col2" = "colx")) %>%
inner_join(input_2, by = c("col3" = "colx")) %>%
select(col1, coly.x, coly.y) %>%
magrittr::set_colnames(c("col1", "col2", "col3"))

One approach using base R would be to loop over columns where we want to change values using lapply, match the values with input_2$colx and get the corresponding coly value.
input_1[-1] <- lapply(input_1[-1], function(x) input_2$coly[match(x, input_2$colx)])
input_1
# col1 col2 col3
#1 ex1 1 2
#2 ex2 2 5
#3 ex3 3 6
#4 ex4 4 4
Actually, you could go away without using lapply, you could directly unlist the values and match
input_1[-1] <- input_2$coly[match(unlist(input_1[-1]), input_2$colx)]

Related

Replace Value in column based on another column With R [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
Closed 2 years ago.
I'm trying to replace the value of a column based on the data in a different column, but it's not working. Here's some example data.
df <- data.frame(Col1 = 1:10,
Col2 = c("a", "a", "a", "b", "b", "c", "c", "d", "d", "d"),
Col3 = c("11%", "12%", "13%", "14%", "15%", "16%", "17%", "18%", "19%", "20%"))
If the value of Col2 is b, I need to change the value of Col3 to NA or 0 (NA is more accurate, but for what I'm doing, a 0 will also work). Column 3 is percents, I know I used strings here.
I tried doing this a few ways, most recently of which is the loop I have listed below. I'm open to any solution on this though. Is my loop not working because I'm not defining a pattern?
for(i in df){
if(df$Col2 == "b"){
str_replace(df$Col3, replacement = NA)
}
}
print(df)
Here's a base R solution:
df$Col3[df$Col2 == 'b'] <- NA
Here's a dplyr/tidyverse solution:
library(dplyr)
df %>% mutate(Col3 = ifelse(Col2 == 'b',NA_character_,Col3))
(Original, but less efficient case_when solution)
df %>%
mutate(Col3 = case_when(Col2 == 'b' ~ NA_character_,
TRUE ~ Col3))
This gives us:
Col1 Col2 Col3
1 1 a 11%
2 2 a 12%
3 3 a 13%
4 4 b <NA>
5 5 b <NA>
6 6 c 16%
7 7 c 17%
8 8 d 18%
9 9 d 19%
10 10 d 20%
A base dplyr solution, using ifelse() instead of case_when():
library(dplyr)
df <- data.frame(Col1 = 1:10,
Col2 = c("a", "a", "a", "b", "b", "c", "c", "d", "d", "d"),
Col3 = seq(.11, .2, by = .1))
df %>%
mutate(Col3 = ifelse(Col2 == 'b', NA, Col2))
pkpto39,
Try this:
library('tidyverse')
df <- data.frame(Col1 = 1:10,
Col2 = c("a", "a", "a", "b", "b", "c", "c", "d", "d", "d")
Col3 = c("11%", "12%", "13%", "14%", "15%", "16%", "17%", "18%", "19%", "20%"), stringsAsFactors = FALSE)
df <- df %>% mutate(Col3 = ifelse(Col2 == "b", NA, Col3))

How to aggregate the data by filtering it out?

I have a data set something like this:
df_A <- tribble(
~product_name, ~position, ~cat_id, ~pr,
"A", 1, 1, "X",
"A", 4, 2, "X",
"A", 3, 3, "X",
"B", 4, 5, NA,
"B", 6, 6, NA,
"C", 3, 1, "Y",
"C", 5, 2, "Y",
"D", 6, 2, "Z",
"D", 4, 8, "Z",
"D", 3, 9, "Z",
)
Now, I want to look up 1 and 2 in the cat_id, and find their position in the position for each product_name. If there is no 1 or 2 in the cat_id, then only these three variable will be returned to NA. Please see my desired data set to get a better understanding:
desired <- tribble(
~product_name, ~position_1, ~position_2, ~pr,
"A", 1, 4, "X",
"B", NA, NA, NA,
"C", 3, 5, "Y",
"D", NA, 6, "Z",
)
How can I get it?
We can filter the rows based on the 'cat_id', then if some of the 'product_name' are missing, use complete to expand the dataset and use pivot_wider to reshape into 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
df_A %>%
filter(cat_id %in% 1:2) %>%
mutate(cat_id = str_c('position_', cat_id)) %>%
complete(product_name = unique(df_A$product_name)) %>%
pivot_wider(names_from = cat_id, values_from = position) %>%
select(-`NA`)
# A tibble: 4 x 4
# product_name pr position_1 position_2
# <chr> <chr> <dbl> <dbl>
#1 A X 1 4
#2 B <NA> NA NA
#3 C Y 3 5
#4 D Z NA 6
Or using reshape/subset from base R
reshape(merge(data.frame(product_name = unique(df_A$product_name)),
subset(df_A, cat_id %in% 1:2), all.x = TRUE),
idvar = c('product_name', 'pr'), direction = 'wide', timevar = 'cat_id')[-5]

How to delete all the duplicates row based on two columns?

I have a data frame where I want to delete duplicates rows, but I want to delete them only if a value from another column is the same for all the rows. (To be more clear I want to delete the duplicates rows which have the same "Number" value for all rows)
There is a example of my data frame :
df <- data.frame("Name" = c("a", "a", "b", "b", "b", "c", "c", "c"),
"Number" = c(1, 1, 1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
And the result I expect is :
result <- data.frame("Name" = c("b", "b", "b", "c", "c", "c"),
"Number" = c(1, 2, 3, 4, 5, 5), stringsAsFactors = FALSE)
We can group_by Name and remove groups which have more than 1 row and have only one distinct value.
library(dplyr)
df %>%
group_by(Name) %>%
filter(!(n_distinct(Number) == 1 & n() > 1))
# Name Number
# <chr> <dbl>
#1 b 2
#2 b 2
#3 b 3
and using base R ave, the same logic can be written as
df[with(df, !as.logical(ave(Number, Name, FUN = function(x)
length(unique(x)) == 1 & length(x) > 1))), ]
Here is a solution with data.table
library("data.table")
df <- data.table("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3))
df[, if (uniqueN(Number)!=1 || .N==1) .SD, Name]
and here is a solution with base R:
df <- data.frame("Name" = c("a", "a", "b", "b", "b"),
"Number" = c(1, 1, 2, 2, 3), stringsAsFactors = FALSE)
df[as.logical(ave(df$Number, df$Name, FUN=function(x) length(unique(x))!=1 || length(x)==1)),]
We can use data.table methods
library(data.table)
setDT(df)[, .SD[uniqueN(Number) > 1] , Name]
# Name Number
#1: b 1
#2: b 2
#3: b 3
#4: c 4
#5: c 5
#6: c 5

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Aggregating by unique identifier and concatenating related values into a string [duplicate]
(4 answers)
Closed 3 years ago.
I have one data frame, I want to find the rows where both columns A and B are duplicated, and then combine the rows by combing the elements in C column together.
My example:
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
My expected result:
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
Thanks a lot
Without packages:
DF <- aggregate(C ~ A + B, FUN = function(x) paste(x, collapse = "; "), data = DF)
Output:
A B C
1 1 a M
2 2 a X
3 1 b N
4 3 c M; N
Or with data.table:
setDT(DF)[, .(C = paste(C, collapse = "; ")), by = .(A, B)]
This is a tidyverse based solution where you can use paste with collapse after grouping it.
library(dplyr)
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
DF %>%
group_by(A,B) %>%
summarise(C = paste(C, collapse = ";"))
#> # A tibble: 4 x 3
#> # Groups: A [3]
#> A B C
#> <dbl> <fct> <chr>
#> 1 1 a M
#> 2 1 b N
#> 3 2 a X
#> 4 3 c M;N
Created on 2019-03-19 by the reprex package (v0.2.1)

How to remove duplicate pair-wise columns [duplicate]

This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 7 years ago.
Consider the following dataframe:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
How can I remove duplicate pair-wise columns so that the output looks like:
# V1 V2 n
#1 A B 1
#2 A C 3
#3 B C 2
I tried unique() and duplicated() to no avail.
Not sure if this is the simplest way of doing it (transposing can be computationally expensive) but this would work with your data frame:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
First, sort the data frame row-wise, so your value-pairs become true duplicates.
df <- data.frame(t(apply(df, 1, sort)))
Then you can just apply the unique function.
df <- unique(df)
If your column names and order are important, you'll have to re-establish those.
names(df) <- c("n", "V1", "V2")
df <- df[, c("V1", "V2", "n")]
Another option would be to reshape (xtabs(n~..)) the dataset ('df') to wide format, set the lower triangular matrix to 0, and remove the rows with "Freq" equal to 0.
m1 <- xtabs(n~V1+V2, df)
m1[lower.tri(m1)] <- 0
subset(as.data.frame(m1), Freq!=0)
# V1 V2 Freq
#4 A B 1
#7 A C 3
#8 B C 2

Resources