Collapse rows based on another column duplicate value in R - r

I have a df:
A <- c("A", "A123", "A123", "B123", "B123", "B")
B <- c("NA", "as", "bp", "df", "kl", "c")
df <- data.frame(A, B)
and I would like to create a df in which the output would be
A <- c("A", "A123", "B123", "B")
C <- c("NA", "as;bp", "df;kl", "c")
df2 <- data.frame(A,C)
This new column is based on if there is a duplicate in column A, then combine the values in column B to make a new column, all other unique values in column B that correspond single/unique values in A would be carried over to column C.
Any help in generating a code where you get column C would be appreciated as I don't even know where to begin in coding for this.
thank you!

Use tidyverse with reframe to paste the non-missing 'B' values for each 'A' group - if all values are missing, return the B column
library(dplyr)
library(stringr)
df %>%
reframe(C = if(all(is.na(B))) B else
str_c(B[complete.cases(B)], collapse = ";"), .by = "A")
-output
A C
1 A NA
2 A123 as;bp
3 B123 df;kl
4 B c

Using aggregate and paste
setNames(aggregate(. ~ A, df, paste0, collapse=";"), c("A", "C"))
A C
1 A NA
2 A123 as;bp
3 B c
4 B123 df;kl

Related

Merge two datasets using multiple column checks

I do have two dataframes with one ID-Variable in the first df ("ID") and three in the second df ("SIC","Ur","Sonst"). Now I am trying to merge these two datasets by checking if the "ID" in the first df either matches with SIC, Ur, or Sonst in the respective row. Here is my reproducable example:
df1 <- data.frame(ID = c("A", "B", "C","D"),
Value=c(1:4))
df2 <- data.frame(SIC = c("B", NA,NA,NA,NA,NA),
Ur = c(NA, "C", NA,NA,NA,NA),
Sonst=c(NA,NA,"A",NA,NA,NA),
Age=c(14:19))
Now I want a final df only with IDs and all information of the first df (as it is the target df) plus the corresponding age information, if ID either matches with SIC, Ur or Sonst. I have tried dplyr and merge function approaches but did not come up with a proper solution. I'm thankful for any suggestions.
An approach using dplyr and left_join with tidyrs unite
library(dplyr)
library(tidyr)
left_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15
4 D 4 NA
or an inner_join if you only want A, B and C to show up
inner_join(df1, df2 %>% unite("ID", SIC:Sonst, na.rm=T))
Joining with `by = join_by(ID)`
ID Value Age
1 A 1 16
2 B 2 14
3 C 3 15
The convenient way is perhaps to do with the tidyverse family as was nicely indicated by Andre Wildberg. You can do it also using base R merge() function but in your case we need to create an ifelse() function to put all non-missing values of the three columns from df2 into a single column:
df2$ID <- ifelse(!is.na(df2$SIC), df2$SIC,
ifelse(!is.na(df2$Ur), df2$Ur, df2$Sonst))
merge the two dfs:
df3 <- merge(df1, df2, by= "ID", all.x = TRUE)
Discard unwanted columns from merged data (df3):
df3 <- df3[, c("ID", "Value", "Age")]
df3
ID Value Age
A 1 16
B 2 14
C 3 15
D 4 NA

R coalesce two columns but keep both values if not NA

I have a dataframe with two columns of related data. I want to create a third column that combines them, but there are lots of NAs in one or both columns. If both columns have a non-NA value, I want the new third column to paste both values. If either of the first two columns has an NA, I want the third column to contain just the non-NA value. An example with a toy data frame is below:
x <- c("a", NA, "c", "d")
y <- c("l", "m", NA, "o")
df <- data.frame(x, y)
# this is the new column I want to produce from columns x and y above
df$z <- c("al", "m", "c", "do")
I thought coalesce would solve my problem, but I can't find a way to keep both values if there is a value in both columns. Thanks in advance for any assistance.
One posible solution:
df$z <- gsub("NA", "",paste0(df$x, df$y))
Another possible solution:
library(dplyr)
df %>%
mutate(z = ifelse(is.na(x) | is.na(y), coalesce(x,y), paste0(x,y)))
An option with unite
library(tidyr)
library(dplyr)
df %>%
unite(z, everything(), na.rm = TRUE, sep = "", remove = FALSE)
z x y
1 al a l
2 m <NA> m
3 c c <NA>
4 do d o

mutate and if_any with condition over multiple columns

I tried to combine mutate, case_when and if_any to create a variable = 1 if any of the variables whose name begins with "string" is equal to a specific string.
I can't figure out what I'm missing in the combination of these conditions.
I'm trying:
df <-data.frame(string1= c("a","b", "c"), string2= c("d", "a", "f"), string3= c("a", "d", "c"), id= c(1,2,3))
df <- df%>%
mutate(cod = case_when(if_any(starts_with("string") == "a" ~1 )))
The syntax was slightly wrong, but you were close. Note that if_any works like across, so like this if_any(columns, condition), and you should use function, \ or ~ to specify the condition.
df %>%
mutate(cod = case_when(if_any(starts_with("string"), ~ .x == "a") ~ 1))
string1 string2 string3 id cod
1 a d a 1 1
2 b a d 2 1
3 c f c 3 NA

How to rename multiple Columns in R?

My goal is to get a concise way to rename multiple columns in a data frame. Let's consider a small data frame df as below:
df <- data.frame(a=1, b=2, c=3)
df
Let's say we want to change the names from a, b, and c to Y, W, and Z respectively.
Defining a character vector containing old names and new names.
df names <- c(Y = "a", Z ="b", E = "c")
I would use this to rename the columns,
rename(df, !!!names)
df
suggestions?
One more !:
df <- data.frame(a=1, b=2, c=3)
df_names <- c(Y = "a", Z ="b", E = "c")
library(dplyr)
df %>% rename(!!!df_names)
## Y Z E
##1 1 2 3
A non-tidy way might be through match:
names(df) <- names(df_names)[match(names(df), df_names)]
df
## Y Z E
##1 1 2 3
You could try:
sample(LETTERS[which(LETTERS %in% names(df) == FALSE)], size= length(names(df)), replace = FALSE)
[1] "S" "D" "N"
Here, you don't really care what the new names are as you're using sample. Otherwise a straight forward names(df) < c('name1', 'name2'...

extract a column in dataframe based on condition for another column R

I want to extract a column from a dataframe in R based on a condition for another column in the same dataframe, the dataframe is given below.
b <- c(1,2,3,4)
g <- c("a", "b" ,"b", "c")
df <- data.frame(b,g)
row.names(df) <- c("aa", "bb", "cc" , "dd")
I want to extract all values for column b as a dataframe (with rownames) where column g has value 'b',
My required output is given below:
df
b
cc 3
dd 4
I have tried several methods like which or subset but it does not work. I have also tried to find the answer to this question on stackoverflow but I was not able to find it. Is there a way to do it?
Thanks,
You can use the subset function in base R -
subset(df, g == 'b', select = b)
# b
#bb 2
#cc 3
Using data.table
library(data.table)
setDT(df, key = 'g')['b', .(b)]
b
1: 2
2: 3
Or with collapse
library(collapse)
sbt(df, g == 'b', b)
b
1 2
2 3
This is the basic way of slicing data in r
df[df$g == 'b',]['b']
Or the tidyverse answer
df %>%
filter(g == 'b') %>%
select(b)

Resources