Merge 3 columns based on unique values? - r

I am trying to do a merge on 3 columns to a single one. The column values are separated by ";" and the new column need to unzip all the 3 column values and put the unique values. I know how to perform the merge column. But I am struggling to do unzipping the row value in 3 columns and finding unique value and putting in another column.
Here is the dummy data
n = c(2, 3, 5,10)
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn")
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA")
t = c("aa;bb;cc", "bb;dd", "kk","NA")
df = data.frame(n, s, b,t)
> df
n s b t
1 2 aa;bb;cc aa;bb;cc aa;bb;cc
2 3 bb;dd;aa bb;dd;cc bb;dd
3 5 NA zz;bb;yy kk
4 10 xx;nn NA NA
The expected output is
> df
n finalcol
1 2 aa;bb;cc
2 3 bb;dd;aa;cc
3 5 zz;bb;yy;kk
4 10 xx;nn
What I have to perform a simple merge
dff = df %>% unite(finalcol, c(s,b,t), sep = ";", remove = TRUE)

Since you mentioned unite, I want to show a solution using separate, the complement of unite.
This solution keeps it within the tidyverse, which makes it easy to understand what's going on step-by-step. #d.b's answer in the comment works perfectly, is compact, and probably runs faster, but has a steeper learning curve to understand what's going on. With a piped tidyverse solution, you can run each line and see what's going on.
This solution first separates the terms, then converts the data from wide to long data format with gather, so that we can do operations such as check for and handle NAs and "NA"s, drop_na, and then distinct, to get unique values only (per group with the same "id" i.e. items from the same original line). Then, it uses summarise and paste to go back to the original format, but could also use spread then unite. (Note that na.rm=TRUE is an upcoming feature of unite https://github.com/tidyverse/tidyr/issues/203)
Sources: I used these handy dplyr and tidyr reference sheets:
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf and I also worked out the solution based on the comments, questions, and answers here: How do I remove NAs with the tidyr::unite function?
# Load packages and data
library(tidyverse)
df = data.frame(n = c(2, 3, 5,10),
s = c("aa;bb;cc", "bb;dd;aa", "NA","xx;nn"),
b = c("aa;bb;cc", "bb;dd;cc", "zz;bb;yy","NA"),
t = c("aa;bb;cc", "bb;dd", "kk", NA))
# Solution
dff <- df %>%
separate(col = "s", into = c("s1", "s2", "s3")) %>%
separate(col = "b", into = c("b1", "b2", "b3")) %>%
separate(col = "t", into = c("t1", "t2", "t3")) %>% # Solution here could be enhanced to take in n columns and put them into however many columns as needed, using map or apply.
rowid_to_column('id') %>%
gather(key, value, -(id:n)) %>%
mutate_at(vars(value), na_if, "NA") %>%
drop_na(value) %>%
group_by(id) %>%
distinct(value, .keep_all = TRUE) %>%
summarise(n = first(n), finalcol = paste(value, collapse = ';')) %>%
ungroup() %>%
select(-id)
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3,
#> 4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [2,
#> 3].
dff
#> # A tibble: 4 x 2
#> n finalcol
#> <dbl> <chr>
#> 1 2 aa;bb;cc
#> 2 3 bb;dd;aa;cc
#> 3 5 zz;bb;yy;kk
#> 4 10 xx;nn
Created on 2019-03-26 by the reprex package (v0.2.1)

Related

How to join multiple columns together on blanks of one column in R

This is my dataframe:
df <- data.frame(option_1 = c("Box 1", "", ""), option_2 = c("", 4, ""), Width = c("","",3))
I want to get this data frame:
option_1
1 Box 1
2 4
3 3
I'm doing this on a much bigger dataframe with 5+ columns I'm merging on blanks with respect to the option_1 column. I have tried using coalesce, but some of the columns won't "merge" on the blanks. For example:
df %>%
mutate(option_value_1 = coalesce(option_value_1, option_value_2, option_value_3, option_value_4, option_value_5, option_value_6, option_value_7))
option_value_5 wouldn't come together with option_value_1 on the blanks, but the other option values did. Should I put the vectors in a list then use coalesce?
We convert the blank ("") to NA and coalesce with the bang-bang (!!!) operator. According to ?"!!!"
The big-bang operator !!! forces-splice a list of objects. The elements of the list are spliced in place, meaning that they each become one single argument.
library(dplyr)
df %>%
na_if("") %>%
transmute(option_1 = coalesce(!!! .))
-output
option_1
1 Box 1
2 4
3 3
If we are interested only in the 'option' columns, subset the columns (also can use invoke with coalesce
library(purrr)
df %>%
na_if("") %>%
mutate(option_1 = invoke(coalesce,
across(starts_with("option"))), .keep = "unused")
With a base R approach:
df <- data.frame(option_1 = apply(df, 1, \(x) paste(x, collapse = "")))
df
#> option_1
#> 1 Box 1
#> 2 4
#> 3 3
Or using tidyverse:
df %>%
rowwise %>%
transmute(option_1 = str_c(c_across(everything()), collapse = "")) %>%
ungroup

Sum duplicated columns in dataframe in R

Hello i have the following dataframe :
colnames(tv_viewing time) <-c("channel_1", "channel_2", "channel_1", "channel_2")
Each row gives a the viewing time for an individual on channel 1 and channel 2, for instance for individual 1 i get :
tv_viewing_time[1,] <- c(1,2,4,5)
What I would like is actually a dataframe that sums up the values of duplicated columns.
I.e. I would get
colnames(tv_viewing time) <-c("channel_1", "channel_2")
Where for instance for individual 1 i would get :
tv_viewing_time[1,] <- c(5,7)
As all two row entries are summed when they correspond to duplicated column names.
I have looked for an answer but all suggested on other threads did not work for my dataframe case.
Note that there are many more duplicated columns, so i am looking for a solution that can be efficiently applied to all my duplicates.
We could use split.default with rowSums
sapply(split.default(tv_viewing_time,
sub("\\.\\d+$", "", names(tv_viewing_time))), rowSums)
-output
# channel_1 channel_2
# 5 7
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
tv_viewing_time %>%
pivot_longer(cols = everything()) %>%
group_by(name = str_remove(name, "\\.\\d+$")) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 1 x 2
# channel_1 channel_2
# <dbl> <dbl>
#1 5 7
data
tv_viewing_time <- data.frame(channel_1 = 1, channel_2 = 2,
channel_1 = 4, channel_2 = 5)

Map readr::type_convert to specific columns only

readr::type_convert guesses the class of each column in a data frame. I would like to apply type_convert to only some columns in a data frame (to preserve other columns as character). MWE:
# A data frame with multiple character columns containing numbers.
df <- data.frame(A = letters[1:10],
B = as.character(1:10),
C = as.character(1:10))
# This works
df %>% type_convert()
Parsed with column specification:
cols(
A = col_character(),
B = col_double(),
C = col_double()
)
A B C
1 a 1 1
2 b 2 2
...
However, I would like to only apply the function to column B (this is a stylised example; there may be multiple columns to try and convert). I tried using purrr::map_at as well as sapply, as follows:
# This does not work
map_at(df, "B", type_convert)
Error in .f(.x[[i]], ...) : is.data.frame(df) is not TRUE
# This does not work
sapply(df["B"], type_convert)
Error in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE
Is there a way to apply type_convert selectively to only some columns of a data frame?
Edit: #ekoam provides an answer for type_convert. However, applying this answer to many columns would be tedious. It might be better to use the base::type.convert function, which can be mapped:
purrr::map_at(df, "B", type.convert) %>%
bind_cols()
# A tibble: 10 x 3
A B C
<chr> <int> <chr>
1 a 1 1
2 b 2 2
Try this:
df %>% type_convert(cols(B = "?", C = "?", .default = "c"))
Guess the type of B; any other character column stays as is. The tricky part is that if any column is not of a character type, then type_convert will also leave it as is. So if you really have to type_convert, maybe you have to first convert all columns to characters.
type_convert does not seem to support it. One trick which I have used a few times is using combination of select & bind_cols as shown below.
df %>%
select(B) %>%
type_convert() %>%
bind_cols(df %>% select(-B))

adding values using rowSums and tidyverse

I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!
Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4

Return column names based on condition

I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)

Resources