Note and recode duplicates

Note and recode duplicates - r

I have a dataframe that's similar to what's below:
num <- c(1, 2, 3, 4)
name <- c("A", "B", "C", "A")
df <- cbind(num, name)
I'm looking to essentially turn this into:
num <- c(1, 2, 3, 4)
name <- c("A1", "B", "C", "A2")
df <- cbind(num, name)
How would I do this automatically, since my actual data is much larger?

Puginablanket,
See below for two solutions, one using the plyr package and the other using base R's by and do.call functions.
eg <- data.frame(num = c(1, 2, 3, 4, 5),
name = c("A", "B", "C", "A", "B"),
stringsAsFactors = FALSE)
do.call(rbind, by(eg, eg$name, function(x) {
x$name2 <- paste0(x$name, 1:nrow(x))
x
}))
plyr::ddply(eg, "name", function(x) {
x$name2 <- paste0(x$name, 1:nrow(x))
x
})
Depending on your application, it might make sense to create a separate column which tracks this duplication (so that you're not using string parsing at a later step to pull it back apart).

It might be worth considering the built-in make.unique(), although it doesn't do exactly what the OP wants (it doesn't label the first duplicated value, so that it can be run multiple times in succession). A little bit of extra trickiness is also required since name is a factor:
df <- data.frame(num = c(1, 2, 3, 4),
name = c("A", "B", "C", "A"))
df <- transform(df, name=factor(make.unique(
as.character(name),sep="")))
## num name
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 A1

I converted your matrix to a dataframe
df <- data.frame(num, name)
#Get duplicat names
ext <- as.numeric(ave(as.character(df$name) , df$name,
FUN=function(x) cumsum(duplicated(x))+1))
nms <- df$name[ext > 1]
#add into data
df$newname <- ifelse( df$name %in% nms, paste0(df$name, ext), as.character(df$name))

Here's a one-line solution, assuming you really do have a data.frame rather than a matrix (a matrix is what is returned by your cbind() command):
df <- data.frame(num=1:4, name=c('A','B','C','A') );
transform(df,name=paste0(name,ave(c(name),name,FUN=function(x) if (length(x) > 1) seq_along(x) else '')));
## num name
## 1 1 A1
## 2 2 B
## 3 3 C
## 4 4 A2

Related

Canonical tidyverse method to update some values of a vector from a look-up table

I frequently need to recode some (not all!) values in a data frame column based off of a look-up table. I'm not satisfied by the ways I know of to solve the problem. I'd like to be able to do it in a clear, stable, and efficient way. Before I write my own function, I'd want to make sure I'm not duplicating something standard that's already out there.
## Toy example
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
## desired result
# id x
# 1 1 a
# 2 2 a
# 3 3 B
# 4 4 C
# 5 5 d
# 6 6 AA
# 7 7 !
I can do it with a join, coalesce, unselect as below, but this isn't as clear as I'd like - too many steps.
## This works, but is more steps than I want
library(dplyr)
data %>%
left_join(lookup, by = c("x" = "old")) %>%
mutate(x = coalesce(new, x)) %>%
select(-new)
It can also be done with dplyr::recode, as below, converting the lookup table to a named lookup vector. I prefer lookup as a data frame, but I'm okay with the named vector solution. My concern here is that recode is the Questioning lifecycle phase, so I'm worried that this method isn't stable.
lookup_v = pull(lookup, new) %>% setNames(lookup$old)
data %>%
mutate(x = recode(x, !!!lookup_v))
It could also be done with, say, stringr::str_replace, but using regex for whole-string matching isn't efficient. I suppose there is forcats::fct_recode is a stable version of recode, but I don't want a factor output (though mutate(x = as.character(fct_recode(x, !!!lookup_v))) is perhaps my favorite option so far...).
I had hoped that the new-ish rows_update() family of dplyr functions would work, but it is strict about column names, and I don't think it can update the column it's joining on. (And it's Experimental, so doesn't yet meet my stability requirement.)
Summary of my requirements:
A single data column is updated based off of a lookup data frame (preferably) or named vector (allowable)
Not all values in the data are included in the lookup--the ones that are not present are not modified
Must work on character class input. Working more generally is a nice-to-have.
No dependencies outside of base R and tidyverse packages (though I'd also be interested in seeing a data.table solution)
No functions used that are in lifecycle phases like superseded or questioning. Please note any experimental lifecycle functions, as they have future potential.
Concise, clear code
I don't need extreme optimization, but nothing wildly inefficient (like regex when it's not needed)

A direct data.table solution, without %in%.
Depending on the length of the lookup / data tables, adding keys could improve performance substantially, but this isn't the case on this simple example.
library(data.table)
setDT(data)
setDT(lookup)
## If needed
# setkey(data,x)
# setkey(lookup,old)
data[lookup, x:=new, on=.(x=old)]
data
id x
1: 1 a
2: 2 a
3: 3 B
4: 4 C
5: 5 d
6: 6 AA
7: 7 !

Benchmarking
Expanding the original dataset to 10M rows, 15 runs using microbenchmark gave the follow results on my computer:
Note that forcats::fct_recode and dplyr::recode solutions mentioned by the OP have also been included. Neither works with the updated data because the named vector that resolves to . = ! will throw an error, which is why results are tested on the original dataset.
data = data.frame(
id = 1:5,
x = c("A", "A", "B", "C", "D")
)
lookup = data.frame(
old = c("A", "D"),
new = c("a", "d")
)
set.seed(1)
data <- data[sample(1:5, 1E7, replace = T),]
dt_lookup <- data.table::copy(lookup)
dplyr_coalesce <- function(){
library(dplyr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = coalesce(lookupV[ x ], x))
}
datatable_in <- function(){
library(data.table)
lookupV <- setNames(lookup$new, lookup$old)
setDT(dt_data)
dt_data[ x %in% names(lookupV), x := lookupV[ x ] ]
}
datatable <- function(){
library(data.table)
setDT(dt_data)
setDT(dt_lookup)
## If needed
# setkey(data,x)
# setkey(lookup,old)
dt_data[dt_lookup, x:=new, on =.(x=old)]
}
purrr_modify_if <- function(){
library(dplyr)
library(purrr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = modify_if(x, x %in% lookup$old, ~ lookupV[.x]))
}
stringr_str_replace_all_update <- function(){
library(dplyr)
library(stringr)
lookupV <- setNames(lookup$new, do.call(sprintf, list("^\\Q%s\\E$", lookup$old)))
data %>%
dplyr::mutate(x = str_replace_all(x, lookupV))
}
base_named_vector <- function(){
lookupV <- c(with(lookup, setNames(new, old)), rlang::set_names(setdiff(unique(data$x), lookup$old)))
lookupV[data$x]
}
base_ifelse <- function(){
lookupV <- setNames(lookup$new, lookup$old)
with(data, ifelse(x %in% lookup$old, lookup$new, x))
}
plyr_mapvalues <- function(){
library(plyr)
data %>%
dplyr::mutate(x = plyr::mapvalues(x, lookup$old, lookup$new, warn_missing = F))
}
base_match <- function(){
tochange <- match(data$x, lookup$old, nomatch = 0)
data$x[tochange > 0] <- lookup$new[tochange]
}
base_local_safe_lookup <- function(){
lv <- structure(lookup$new, names = lookup$old)
safe_lookup <- function(val) {
new_val <- lv[val]
unname(ifelse(is.na(new_val), val, new_val))
}
safe_lookup(data$x)
}
dplyr_recode <- function(){
library(dplyr)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = recode(x, !!!lookupV))
}
base_for <- function(){
for (i in seq_len(nrow(lookup))) {
data$x[data$x == lookup$old[i]] = lookup$new[i]
}
}
datatable_for <- function(){
library(data.table)
setDT(dt_data)
for (i in seq_len(nrow(lookup))) {
dt_data[x == lookup$old[i], x := lookup$new[i]]
}
}
forcats_fct_recode <- function(){
library(dplyr)
library(forcats)
lookupV <- setNames(lookup$new, lookup$old)
data %>%
dplyr::mutate(x = as.character(fct_recode(x, !!!lookupV)))
}
datatable_set <- function(){
library(data.table)
setDT(dt_data)
tochange <- dt_data[, chmatch(x, lookup$old, nomatch = 0)]
set(dt_data, i = which(tochange > 0), j = "x", value = lookup$new[tochange])
}
library(microbenchmark)
bench <- microbenchmark(dplyr_coalesce(),
datatable(),
datatable_in(),
datatable_for(),
base_for(),
purrr_modify_if(),
stringr_str_replace_all_update(),
base_named_vector(),
base_ifelse(),
plyr_mapvalues(),
base_match(),
base_local_safe_lookup(),
dplyr_recode(),
forcats_fct_recode(),
datatable_set(),
times = 15L,
setup = dt_data <- data.table::copy(data))
bench$expr <- forcats::fct_rev(forcats::fct_reorder(bench$expr, bench$time, mean))
ggplot2::autoplot(bench)
Thanks to #Waldi and #nicola for advice implementing data.table solutions in the benchmark.

Combination of a named vector and coalesce:
# make lookup vector
lookupV <- setNames(lookup$new, lookup$old)
data %>%
mutate(x = coalesce(lookupV[ x ], x))
# id x
# 1 1 a
# 2 2 a
# 3 3 B
# 4 4 C
# 5 5 d
Or data.table:
library(data.table)
setDT(data)
data[ x %in% names(lookupV), x := lookupV[ x ] ]
This post might have a better solution for data.table - "update on merge":
R data table: update join

A base R option using %in% and match - thanks to #LMc & #nicola
tochange <- match(data$x, lookup$old, nomatch = 0)
data$x[tochange > 0] <- lookup$new[tochange]
One more data.table option using set() and chmatch
library(data.table)
setDT(data)
tochange <- data[, chmatch(x, lookup$old, nomatch = 0)]
set(data, i = which(tochange > 0), j = "x", value = lookup$new[tochange])
Result
data
# id x
#1 1 a
#2 2 a
#3 3 B
#4 4 C
#5 5 d
#6 6 AA
#7 7 !

modify_if
You could use purrr::modify_if to only apply the named vector to values that exist in it. Though not a specified requirement, it has the benefit of the .else argument, which allows you to apply a different function to values not in your lookup.
I also wanted to include the use of tibble::deframe here to create the named vector. It is slower than setNames, though.
lookupV <- deframe(lookup)
data %>%
mutate(x = modify_if(x, x %in% lookup$old, ~ lookupV[.x]))
str_replace_all
Alternatively, you could use stringr::str_replace_all, which can take a named vector for the replacement argument.
data %>%
mutate(x = str_replace_all(x, lookupV))
Update
To accommodate the change to your edited example, the named vector used in str_replace_all needs to be modified. In this way, the entire literal string needs to be match so that "A" does not get substituted in "AA", or "." does not replace everything:
lookupV <- setNames(lookup$new, do.call(sprintf, list("^\\Q%s\\E$", lookup$old)))
data %>%
mutate(x = str_replace_all(x, lookupV))
left_join
Using dplyr::left_join this is very similar to OP solution, but uses .keep argument of mutate so it has less steps. This argument is currently in the experimental lifecycle and so it is not included in the benchmark (though it is around the middle of posted solutions).
left_join(data, lookup, by = c("x" = "old")) %>%
mutate(x = coalesce(new, x) , .keep = "unused")
Base R
Named Vector
Create a substitution value for every unique value in your dataframe.
lookupV <- c(with(lookup, setNames(new, old)), setNames(nm = setdiff(unique(data$x), lookup$old)))
data$x <- lookupV[data$x]
ifelse
with(data, ifelse(x %in% lookup$old, lookupV[x], x))

Another option that is clear is to use a for-loop with subsetting to loop through the rows of the lookup table. This will almost always be quicker with data.table because of auto indexing, or if you set the key (i.e., ?data.table::setkey()) ahead of time. Also, it will--of course--get slower as the lookup table gets longer. I would guess an update-join would be preferred if there is a long lookup table.
Base R:
for (i in seq_len(nrow(lookup))) {
data$x[data$x == lookup$old[i]] <- lookup$new[i]
}
data$x
# [1] "a" "a" "B" "C" "d" "AA" "!"
Or the same logic with data.table:
library(data.table)
setDT(data)
for (i in seq_len(nrow(lookup))) {
data[x == lookup$old[i], x := lookup$new[i]]
}
data$x
# [1] "a" "a" "B" "C" "d" "AA" "!"
Data:
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)

Another base solution, with a lookup vector:
## Toy example
data = data.frame(
id = 1:5,
x = c("A", "A", "B", "C", "D"),
stringsAsFactors = F
)
lookup = data.frame(
old = c("A", "D"),
new = c("a", "d"),
stringsAsFactors = F
)
lv <- structure(lookup$new, names = lookup$old)
safe_lookup <- function(val) {
new_val <- lv[val]
unname(ifelse(is.na(new_val), val, new_val))
}
data$x <- safe_lookup(data$x)

dplyr+plyr solution that is in order with all ur bulletpoints (if u consider plyr in the the tidyverse):
data <- data %>%
dplyr::mutate(
x = plyr::mapvalues(x, lookup$old, lookup$new) #Can add , F to remove warnings
)

I basically share the same problem. Although dplyr::recode is in the "questioning" life cycle I don't expect it to become deprecated. At some point it might be superseded, but even in this case it should still be usable. Therefore I'm using a wrapper around dplyr::recode which allows the use of named vectors and or two vectors (which could be a lookup table).
library(dplyr)
library(rlang)
recode2 <- function(x, new, old = NULL, .default = NULL, .missing = NULL) {
if (!rlang::is_named(new) && !is.null(old)) {
new <- setNames(new, old)
}
do.call(dplyr::recode,
c(.x = list(x),
.default = list(.default),
.missing = list(.missing),
as.list(new)))
}
data = data.frame(
id = 1:7,
x = c("A", "A", "B", "C", "D", "AA", ".")
)
lookup = data.frame(
old = c("A", "D", "."),
new = c("a", "d", "!")
)
# two vectors new / old
data %>%
mutate(x = recode2(x, lookup$new, lookup$old))
#> id x
#> 1 1 a
#> 2 2 a
#> 3 3 B
#> 4 4 C
#> 5 5 d
#> 6 6 AA
#> 7 7 !
# named vector
data %>%
mutate(x = recode2(x, c("A" = "a",
"D" = "d",
"." = "!")))
#> id x
#> 1 1 a
#> 2 2 a
#> 3 3 B
#> 4 4 C
#> 5 5 d
#> 6 6 AA
#> 7 7 !
Created on 2021-04-21 by the reprex package (v0.3.0)

Is there an R function to get the unique edges in an undirected (not directed) network?

I want to count the number of the unique edges in an undirected network, e.g, net
x y
1 A B
2 B A
3 A B
There should be only one unique edge for this matrix, because edges A-B and B-A are same for the undirected network.
For the directed network I can get the number of unique edges by:
nrow(unique(net[,c("x","y"]))
But this doesn't work for the undirected network.

Given that you are working with networks, an igraph solution:
library(igraph)
as_data_frame(simplify(graph_from_data_frame(dat, directed=FALSE)))
Then use nrow
Explanantion
dat %>%
graph_from_data_frame(., directed=FALSE) %>% # convert to undirected graph
simplify %>% # remove loops / multiple edges
as_data_frame # return remaining edges

Try this,
df <- data.frame(x=c("A", "B", "A"), y = c("B", "A", "B"))
unique(apply(df, 1, function(x) paste(sort(unlist(strsplit(x, " "))),collapse = " ")))
[1] "A B"
So how does this work?
We are applying a function to each row of the data frame, so we can take each row at a time.
Take the second row of the df,
df[2,]
x y
1 B A
We then split (strsplit) this, and unlist into a vector of each letter, (We use as.matrix to isolate the elements)
unlist(strsplit(as.matrix(df[2,]), " "))
[1] "B" "A"
Use the sort function to put into alphabetical order, then paste them back together,
paste(sort(unlist(strsplit(as.matrix(df[2,]), " "))), collapse = " ")
[1] "A B"
Then the apply function does this for all the rows, as we set the index to 1, then use the unique function to identify unique edges.
Extension
This can be extended to n variables, for example n=3,
df <- data.frame(x=c("A", "B", "A"), y = c("B", "A", "B"), z = c("C", "D", "D"))
unique(apply(df, 1, function(x) paste(sort(unlist(strsplit(x, " "))),collapse = " ")))
[1] "A B C" "A B D"
If more letters are needed, just combine two letters like the following,
df <- data.frame(x=c("A", "BC", "A"), y = c("B", "A", "BC"))
df
x y
1 A B
2 BC A
3 A BC
unique(apply(df, 1, function(x) paste(sort(unlist(strsplit(x, " "))),collapse = " ")))
[1] "A B" "A BC"
Old version
Using the tidyverse package, create a function called rev that can order our edges, then use mutate to create a new column combining the x and y columns, in such a way it works well with the rev function, then run the new column through the function and find the unique pairs.
library(tidyverse)
rev <- function(x){
unname(sapply(x, function(x) {
paste(sort(trimws(strsplit(x[1], ',')[[1]])), collapse=',')} ))
}
df <- data.frame(x=c("A", "B", "A"), y = c("B", "A", "B"))
rows <- df %>%
mutate(both = c(paste(x, y, sep = ", ")))
unique(rev(rows$both))

Here is a solution without the intervention of igraph, all inside one pipe:
df = tibble(x=c("A", "B", "A"), y = c("B", "A", "B"))
It is possible to use group_by() and then sort() combinations of values and paste() them in the new column via mutate(). unique() is utilized if you have "true" duplicates (A-B, A-B will get into one group).
df %>%
group_by(x, y) %>%
mutate(edge_id = paste(sort(unique(c(x,y))), collapse=" "))
When you have properly sorted edge names in a new column, it's quite straightforward to count unique values or filter duplicates out of your data frame.
If you have additional variables for edges, just add them into grouping.

If you're not using{igraph} or just want know how to do it cleanly without any dependencies...
Here's your data...
your_edge_list <- data.frame(x = c("A", "B", "A"),
y = c("B", "A", "B"),
stringsAsFactors = FALSE)
your_edge_list
#> x y
#> 1 A B
#> 2 B A
#> 3 A B
and here's a step-by-step breakdown...
`%>%` <- magrittr::`%>%`
your_edge_list %>%
apply(1L, sort) %>% # sort dyads
t() %>% # transpose resulting matrix to get the original shape back
unique() %>% # get the unique rows
as.data.frame() %>% # back to data frame
setNames(names(your_edge_list)) # reset column names
#> x y
#> 1 A B
If we drop the pipes, the core of it looks like this...
unique(t(apply(your_edge_list, 1, sort)))
#> [,1] [,2]
#> [1,] "A" "B"
And we can wrap it up in a function that 1) handles both directed and undirected, 2) handles data frames and (the more common) matrices, and 3) can drop loops...
simplify_edgelist <- function(el, directed = TRUE, drop_loops = TRUE) {
stopifnot(ncol(el) == 2)
if (drop_loops) {
el <- el[el[, 1] != el[, 2], ]
}
if (directed) {
out <- unique(el)
} else {
out <- unique(t(apply(el, 1, sort)))
}
colnames(out) <- colnames(el)
if (is.data.frame(el)) {
as.data.frame(out, stringsAsFactors = FALSE)
} else {
out
}
}
el2 <- rbind(your_edge_list,
data.frame(x = c("C", "C"), y = c("C", "A"), stringsAsFactors = FALSE))
el2
#> x y
#> 1 A B
#> 2 B A
#> 3 A B
#> 4 C C
#> 5 C A
simplify_edgelist(el2, directed = FALSE)
#> x y
#> 1 A B
#> 5 A C

How to change the df column name within a list

I have a list of dfs. The dfs all have the same column names. I would like to:
(1) Change one of the column names to the name of the df within the list
(2) full_join all the dfs after name change
Example of my list:
my_list <- list(one = data.frame(Type = c(1,2,3), Class = c("a", "a", "b")),
two = data.frame(Type = c(1,2,3), Class = c("a", "a", "b")))
Output that I want:
data.frame(Type = c(1,2,3),
one = c("a", "a", "b"),
two = c("a", "a", "b"))
Type one two
1 a a
2 a a
3 b b

You could possible use dplyr::bind_rows combined with tidyr::spread to achieve the same result (if you are happy to consider alternative approaches). For example:
library(tidyverse)
my_list %>% bind_rows(.id = "groups") %>% spread(groups, Class)
#> Type one two
#> 1 1 a a
#> 2 2 a a
#> 3 3 b b

The first step can be tricky, but it's simple if you iterate over names(my_list).
transformed <- sapply(names(my_list), function(name) {
df <- my_list[[name]]
colnames(df)[colnames(df) == 'Class'] <- name
df
}, simplify = FALSE, USE.NAMES = TRUE)
With purrr::reduce and dplyr::full_join the result can be obtained:
purrr::reduce(transformed, dplyr::full_join)
# Type one two
# 1 1 a a
# 2 2 a a
# 3 3 b b

R pairwise operations

I have a data frame and would like to perform some specific operations on it.
dat <- data.frame(Name = LETTERS[1:3],
Val1 = rnorm(3),
Val2 = rnorm(3))
# > dat
# Name Val1 Val2
# 1 A -1.055050 0.4499766
# 2 B 0.414994 -0.5999369
# 3 C -1.311374 -0.3967634
I would like to do the following:
Pair-wise divide Val1 across the Names, e.g.
AB1 <- dat[dat$Name == "A", "Val1"] / dat[dat$Name == "B", "Val1"]
AC1 <- dat[dat$Name == "A", "Val1"] / dat[dat$Name == "C", "Val1"]
BC1 <- dat[dat$Name == "B", "Val1"] / dat[dat$Name == "C", "Val1"]
Pair-wise divide Val2 across the Names, e.g.
AB2 <- dat[dat$Name == "A", "Val2"] / dat[dat$Name == "B", "Val2"]
AC2 <- dat[dat$Name == "A", "Val2"] / dat[dat$Name == "C", "Val2"]
BC2 <- dat[dat$Name == "B", "Val2"] / dat[dat$Name == "C", "Val2"]
Subtract 2 from 1, e.g.
AB3 <- AB1 - AB2
AC3 <- AC1 - AC2
BC3 <- BC1 - BC2
The above works fine but I'd like to implement this in a smarter and scalable way (e.g many more Names and Vals), as well as storing the output in a data.frame where it is easier to programmatically extract values.
Finally, an even better solution would do this for the following data
dat2 <- data.frame(Region = rep(LETTERS[24:26], each=3),
Name = rep(LETTERS[1:3], 3),
Val1 = rep(rnorm(3), 3),
Val2 = rep(rnorm(3), 3))
> dat2
# Region Name Val1 Val2
# 1 X A 2.1098629 0.5779044
# 2 X B 0.5937334 0.1410554
# 3 X C 0.2819461 -1.1769578
# 4 Y A 2.1098629 0.5779044
# 5 Y B 0.5937334 0.1410554
# 6 Y C 0.2819461 -1.1769578
# 7 Z A 2.1098629 0.5779044
# 8 Z B 0.5937334 0.1410554
# 9 Z C 0.2819461 -1.1769578
Where the operations are the same as above but grouped by Region, so the output would be something like
> output
# Region AB3 AC3 BC3
# 1 X ? ? ?
# 2 Y ? ? ?
# 3 Z ? ? ?
where the ? are the actual results.

combn is a work-horse here, which can be used to generate unique pairwise combinations:
combn(as.character(dat$Name), 2, simplify=FALSE)
#[[1]]
#[1] "A" "B"
#
#[[2]]
#[1] "A" "C"
#
#[[3]]
#[1] "B" "C"
You can also pass the results of these pairwise combinations to a function then:
# set.seed(1)
##for reproducibility
combn(
as.character(dat$Name),
2,
FUN=function(x) do.call(`-`, dat[dat$Name == x[1], -1] / dat[dat$Name == x[2], -1])
)
#[1] -8.2526585 2.6940335 0.1818427
AB3
#[1] -8.252659
AC3
#[1] 2.694033
BC3
#[1] 0.1818427

With data.table, you can do it using the code below:
library(data.table)
dat <- data.table(Region = rep(LETTERS[24:26], each=3),
Name = rep(LETTERS[1:3], 3),
Val1 = rep(rnorm(3), 3),
Val2 = rep(rnorm(3), 3))
dat2 <- merge(dat, dat, by="Region", allow.cartesian = T)[Name.x < Name.y]
dat2[, Val1Ratio := Val1.x / Val1.y]
dat2[, Val2Ratio := Val2.x / Val2.y]
dat2[, Diff := Val1Ratio - Val2Ratio]

#thelatemail's answer worked well on the first part of my question. I originally tried to add the below information as an add-on edit to their answer to address the second part of my question. This edit was rejected so I'm putting it here as an answer:
One way to use #thelatemail's answer to address the second part of the question (relating to dat2 and grouping by Region) is the following:
library("dplyr")
Regions <- unique(dat2$Region)
out <- data.frame(Region = Regions, AB3 = NA, AC3 = NA, BC3 = NA)
for (i in 1:length(Regions)){
dat2temp <- dat2 %>% filter(Region==Regions[i])
out[i,2:4] <- combn(
as.character(dat2temp$Name),
2,
FUN = function(x) do.call(`-`, dat2temp[dat2temp$Name == x[1], -(1:2)] / dat2temp[dat2temp$Name == x[2], -(1:2)])
)
}
> out
# Region AB3 AC3 BC3
# 1 X -4.368693 -0.4772375 0.3004291
# 2 Y -4.368693 -0.4772375 0.3004291
# 3 Z -4.368693 -0.4772375 0.3004291
There are probably better solutions that avoid a for-loop, and I'd love to hear them. I expect there's a cleaner solution that uses dplyr::group_by or cut or similar.

Assign the value of a variable based on another variable which is related to column names of a dataframe

I have a data frame with the following variables:
df <- data.frame(ID = seq(1:5),
Price.A = c(10,12,14,16,18),
Price.B = c(6,7,9,8,5),
Price.C = c(27,26,25,24,23),
Choice = c("A", "A", "B", "B", "C"))
I want to create a variable called Expenditure, which picks the value from Price.A, Price.B or Price.C depending on the value of the variable Choice.
I tried to create it with the following code:
df$Expenditure <- with(df, get(paste("Price.", Choice, sep ="")))
However, that returns the value of Price.A for all observations.
In my real application, instead of A, B and C, I have hundreds of names, so an ifelse command is not feasible.
Does anyone knows how to do that?

It would probably make more sense to reshape your data. Currently your data is not in a "tidy" format
library(dplyr)
library(tidyr)
df %>% gather(Price, Expendeture, -ID, -Choice) %>%
filter(Price == paste0("Price.", Choice)) %>%
select(-Price)
Otherwise you could do matrix-indexing of a matrix
cols <- grep("Price", names(df), value=T)
mm <- as.matrix(df[, cols])
colidx <- match(paste0("Price.", df$Choice), cols)
df$Expenditure <- mm[cbind(1:length(colidx), colidx)]

df$Expenditure[df$Choice=="A"] <- df$Price.A[df$Choice=="A"]
df$Expenditure[df$Choice=="B"] <- df$Price.B[df$Choice=="B"]
df$Expenditure[df$Choice=="C"] <- df$Price.C[df$Choice=="C"]
Here's how to scale it up with a loop:
df$Expenditure <- NA
for(i in unique(df$Choice)){
j <- paste0("Price.",i)
df$Expenditure[df$Choice==i] <- df[df$Choice==i,colnames(df) == j]
}
ID Price.A Price.B Price.C Choice Expenditure
1 1 10 6 27 A 10
2 2 12 7 26 A 12
3 3 14 9 25 B 9
4 4 16 8 24 B 8
5 5 18 5 23 C 23
You could easily wrap this into a function and use apply if you prefer.
There are also lots of more overly complicated ways to do this, though I think it's a terrible practice to use some 3rd party package to do this when base R does a wonderful job. Here's one:
df <- data.frame(ID = seq(1:5),
PriceA = c(10,12,14,16,18),
PriceB = c(6,7,9,8,5),
PriceC = c(27,26,25,24,23),
Choice = c("A", "A", "B", "B", "C"))
require(sqldf)
df$Expenditure <- unname(sqldf("SELECT
CASE
WHEN Choice == 'A' THEN PriceA
WHEN Choice == 'B' THEN PriceB
WHEN Choice == 'C' THEN PriceC
END
from df"))

Here are a couple of *apply based approaches:
df$Expenditure <- sapply(seq_along(df[[1]]), function(i) {
df[i, sprintf("Price.%s", df$Choice[i])]
})
df$Expenditure <- mapply(function(x, y) {
df[x, sprintf("Price.%s", y)]
}, row.names(df), df$Choice
)
The second one assumes your object has the default row.names of 1:nrow(df).

How about
for (i in 1:nrow(df)) {
df$Expenditure[i] <- with(df[i, ], get(paste("Price.", Choice, sep="")))
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Note and recode duplicates - r

Related

Canonical tidyverse method to update some values of a vector from a look-up table

Is there an R function to get the unique edges in an undirected (not directed) network?

How to change the df column name within a list

R pairwise operations

Assign the value of a variable based on another variable which is related to column names of a dataframe

Categories

Resources