Setdiff within mutate function - r

I have a data frame with three columns. Each row contains three unique numbers between 1 and 5 (inclusive).
df <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5))
I want to use mutate to create two additional columns that, for each row, contain the two numbers between 1 and 5 that do not appear in the initial three columns in ascending order. The desired data frame in the example would be:
df2 <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5),
d=c(2,2,3),
e=c(4,5,4))
I tried to use the below mutate function utilizing setdiff to accomplish this, but returned NAs rather than the values I was looking for:
df <- df %>% mutate(d=setdiff(c(a,b,c),c(1:5))[1],
e=setdiff(c(a,b,c),c(1:5))[2])
I can get around this by looping through each row (or using an apply function) but would prefer a mutate approach if possible.
Thank you for your help!

Base R:
cbind(df, t(apply(df, 1, setdiff, x = 1:5)))
# a b c 1 2
# 1 1 5 3 2 4
# 2 4 3 1 2 5
# 3 2 1 5 3 4
Warning: if there are any non-numerical columns, apply will happily up-convert things (converting to a matrix internally).

We can use pmap to loop over the rows, create a list column and then unnest it to create two new columns
library(dplyr)
librayr(purrr)
library(tidyr)
df %>%
mutate(out = pmap(., ~ setdiff(1:5, c(...)) %>%
as.list%>%
set_names(c('d', 'e')))) %%>%
unnest_wider(c(out))
# A tibble: 3 x 5
# a b c d e
# <dbl> <dbl> <dbl> <int> <int>
#1 1 5 3 2 4
#2 4 3 1 2 5
#3 2 1 5 3 4
Or using base R
df[c('d', 'e')] <- do.call(rbind, lapply(asplit(df, 1), function(x) setdiff(1:5, x)))

Related

Matching old and new column names in R [duplicate]

I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J

Rename columns using values from separate dataframe

I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J

How to keep one instance or more of the values in one column when removing duplicate rows?

I'm trying to remove rows with duplicate values in one column of a data frame. I want to make sure that all the existing values in that column are represented, appearing more than once if its values in one other column are not duplicated and non-missing, and only once if the values in that other column are all missing. Take for example the following data frame:
toy <- data.frame(Group = c(1,1,2,2,2,3,3,4,5,5,6,7,7), Class = c("a",NA,"a","b",NA,NA,NA,NA,"a","b","a","a","a"))
I would like to end up with this:
ideal <- data.frame(Group = c(1,2,2,3,4,5,5,6,7), Class = c("a","a","b",NA,NA,"a","b","a","a"))
I tried transforming the data frame into a data table and follow the advice here, like this:
library(data.table)
toy.dt <- as.data.table(toy)
toy.dt[, .(Class = if(all(is.na(Class))) NA_character_ else na.omit(Class)), by = Group]
but duplicates weren't handled as needed: value 7 in the column 'Group' should appear only once in the resulting data.
It would be a bonus if the solution doesn't require transforming the data into a data table.
Here is one way using base R. We first drop NA rows in toy and select only unique rows. We can then left join it with unique Group values to get the rows which are NA for the group.
df1 <- unique(na.omit(toy))
merge(unique(subset(toy, select = Group)), df1, all.x = TRUE)
# Group Class
#1 1 a
#2 2 a
#3 2 b
#4 3 <NA>
#5 4 <NA>
#6 5 a
#7 5 b
#8 6 a
#9 7 a
Same logic using dplyr functions :
library(dplyr)
toy %>%
na.omit() %>%
distinct() %>%
right_join(toy %>% distinct(Group))
If you would like to try a tidyverse approach:
library(tidyverse)
toy %>%
group_by(Group) %>%
filter(!(is.na(Class) & sum(!is.na(Class)) > 0)) %>%
distinct()
Output
# A tibble: 9 x 2
# Groups: Group [7]
Group Class
<dbl> <chr>
1 1 a
2 2 a
3 2 b
4 3 NA
5 4 NA
6 5 a
7 5 b
8 6 a
9 7 a

Product of several columns on a data frame by a vector using dplyr

I would like to multiply several columns on a dataframe by the values of a vector (all values within the same column should be multiplied by the same value, which will be different according to the column), while keeping the other columns as they are.
Since I'm using dplyr extensively I thought that it might be useful to use mutate_each function, so I can modify all columns at the same time, but I am completely lost on the syntax on the fun() part.
On the other hand, I've read this solution which is simple and works fine, but only works for all columns instead of the selected ones.
That's what I've done so far:
Imagine that I want to multiply all columns in df but letters by weight_df vector as follows:
df = data.frame(
letters = c("A", "B", "C", "D"),
col1 = c(3, 3, 2, 3),
col2 = c(2, 2, 3, 1),
col3 = c(4, 1, 1, 3)
)
> df
letters col1 col2 col3
1 A 3 2 4
2 B 3 2 1
3 C 2 3 1
4 D 3 1 3
>
weight_df = c(1:3)
If I use select before applying mutate_each I get rid of letters columns (as expected), and that's not what I want (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
select(-letters) %>%
mutate_each(funs(. * weight_df))
> df
col1 col2 col3
1 3 2 4
2 6 4 2
3 6 9 3
4 3 1 3
But if I don't select any particular columns, all values within letters are removed (which makes a lot of sense, by the way), but that's not what I want, neither (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
mutate_each(funs(. * issb_weight))
> df
letters col1 col2 col3
1 NA 3 2 4
2 NA 6 4 2
3 NA 6 9 3
4 NA 3 1 3
(Please note that this is a very simple dataframe and the original one has way more rows and columns -which unfortunately are not labeled in such an easy way and no patterns can be obtained)
The problem here is that you are basically trying to operate over rows, rather columns, hence methods such as mutate_* won't work. If you are not satisfied with the many vectorized approaches proposed in the linked question, I think using tydeverse (and assuming that letters is unique identifier) one way to achieve this is by converting to long form first, multiply a single column by group and then convert back to wide (don't think this will be overly efficient though)
library(tidyr)
library(dplyr)
df %>%
gather(variable, value, -letters) %>%
group_by(letters) %>%
mutate(value = value * weight_df) %>%
spread(variable, value)
#Source: local data frame [4 x 4]
#Groups: letters [4]
# letters col1 col2 col3
# * <fctr> <dbl> <dbl> <dbl>
# 1 A 3 4 12
# 2 B 3 4 3
# 3 C 2 6 3
# 4 D 3 2 9
using dplyr. This filters numeric columns only. Gives flexibility for choosing columns. Returns the new values along with all the other columns (non-numeric)
index <- which(sapply(df, is.numeric) == TRUE)
df[,index] <- df[,index] %>% sweep(2, weight_df, FUN="*")
> df
letters col1 col2 col3
1 A 3 4 12
2 B 3 4 3
3 C 2 6 3
4 D 3 2 9
try this
library(plyr)
library(dplyr)
df %>% select_if(is.numeric) %>% adply(., 1, function(x) x * weight_df)

translate data frame of observations into ranks

I have a data set like this:
df <- data.frame(situation1=rnorm(30),
situation2=rnorm(30),
situation3=rnorm(30),
models=c(rep("A",10), rep("B",10), rep("C", 10)))
where I compare three models (A,B,C) in three situations. I have 10 measurements for each model.
I now want to summarise this into ranks, i.e. how often each models wins in each situtation. Win is defined by the highest value.
A final output could be something like this:
model situation1 situtation2 situtation3
A 4 3 3
B 7 1 2
C 1 4 5
In base R:
table(df$models,colnames(df[-4])[max.col(df[-4])])
# situation1 situation2 situation3
# A 2 4 4
# B 4 5 1
# C 2 4 4
Results may change from your OP, since you didn't set a seed.
Here is an option using data.table
library(data.table)
setDT(df)[, lapply(Map(`==`, .SD, list(do.call(pmax, .SD))), sum), models]
Here's a dplyr option:
df %>%
group_by(models) %>%
mutate_all(funs(. == pmax(situation1, situation2, situation3))) %>%
summarise_all(sum)
Or possibly a little more efficient:
df %>%
mutate_at(vars(-models), funs(. == pmax(situation1, situation2, situation3))) %>%
group_by(models) %>%
summarise_all(sum)
## A tibble: 3 × 4
# models situation1 situation2 situation3
# <chr> <int> <int> <int>
#1 A 3 3 3
#2 B 3 5 1
#3 C 6 1 2
If you're looking for the minimum, use pmin instead of pmax. And in case there may be NAs, use the na.rm-argument in pmax/pmin.
Final note: the result doesn't match OP's because the sample data was generated without setting a seed.

Resources