I am trying to merge two rows by a similar group which I did by looking at different questions on stack overflow (Question1, Qestion2, Question3). All these questions stated what I want but I also have some empty fields in my data frame and I don't want to merge them. I only want to merge the similar/duplicate rows based on Col1 that contain values and not empty or NA. I use below code but it also merges cells that are empty or NA.
merge_my_rows <- df %>%
group_by(Col1) %>%
summarise(Col2 = paste(Col2, collapse = ","))
Below please is the sample df and Output df that I want.
Col1
Col2
F212
ALICE
D23
John
C64
NA
F212
BOB
C64
NA
D23
JohnY
D19
Marquis
Output df
Col1
Col2
F212
ALICE, BOB
D23
John, JohnY
C64
NA
C64
NA
D19
Marquis
You can set a new grouping column, na.grp, that gives each NA in Col2 a unique number and gives those non-NA elements a common number.
library(dplyr)
df %>%
group_by(Col1, na.grp = ifelse(is.na(Col2), cumsum(is.na(Col2)), 0)) %>%
summarise(Col2 = paste(Col2, collapse = ", "), .groups = "drop") %>%
select(-na.grp)
# # A tibble: 5 × 2
# Col1 Col2
# <chr> <chr>
# 1 C64 NA
# 2 C64 NA
# 3 D19 Marquis
# 4 D23 John, JohnY
# 5 F212 ALICE, BOB
Data
df <- read.table(text = "
Col1 Col2
F212 ALICE
D23 John
C64 NA
F212 BOB
C64 NA
D23 JohnY
D19 Marquis", header = TRUE)
Using reframe
library(dplyr)
df1 %>%
reframe(Col2 = if(all(is.na(Col2))) Col2 else toString(Col2[!is.na(Col2)])
, .by = "Col1")
-output
Col1 Col2
1 F212 ALICE, BOB
2 D23 John, JohnY
3 C64 <NA>
4 C64 <NA>
5 D19 Marquis
Related
ID <- c("IDa", "IDb","IDc","IDe","IDd","IDe")
names1 <- c("robin", "bob", "eric", "charlie", "robin", "gabby")
matrix1 <- matrix(names1, 1, 6)
colnames(matrix1) <- c("IDa", "IDb", "IDc","IDe", "IDd", "IDe")
This is the output:
IDa
IDb
IDc
IDe
IDd
IDe
robin
bob
eric
charlie
robin
gabby
But I want it to look like this:
IDa
IDb
IDc
IDe
IDd
robin
bob
eric
charlie
robin
gabby
We may split and then cbind after padding with NA
lst1 <- split(names, ID)
do.call(cbind, lapply(lst1, `length<-`, max(lengths(lst1))))
-output
IDa IDb IDc IDd IDe
[1,] "robin" "bob" "eric" "robin" "charlie"
[2,] NA NA NA NA "gabby"
Another option:
library(reshape2)
library(tidyverse)
melt(matrix1) %>%
select(-Var1) %>%
group_by(Var2) %>%
mutate(id = row_number()) %>%
pivot_wider(
names_from = Var2,
values_from = value
) %>%
select(-id)
IDa IDb IDc IDe IDd
<chr> <chr> <chr> <chr> <chr>
1 robin bob eric charlie robin
2 NA NA NA gabby NA
I want to divide both numeric columns by 3 but not the third character column.
current dataframe:
col1 col2 col3
100 10 cat
200 20 dog
300 30 NA
desired:
col1 col2 col3
10 1 cat
20 2 dog
300 30 NA
my current code that isn't based on col3:
DB <- BD %>% mutate(Col1=Col1/3) %>% mutate(Col2s=Col2/3)
Please help with a solution. Thank you
Here is an idea via dplyr,
library(dplyr)
df %>%
mutate_at(vars(-3), list(~ifelse(!is.na(col3), ./10, .)))
# col1 col2 col3
#1 10 1 cat
#2 20 2 dog
#3 300 30 <NA>
Using base R.
no <- !is.na(dat$col3)
num <- sapply(dat, is.numeric)
dat[na, num] <- dat[na, num]/10
dat
# col1 col2 col3
# 1 10 1 cat
# 2 20 2 dog
# 3 300 30 <NA>
Data:
dat <- read.table(header=T, text="col1 col2 col3
100 10 cat
200 20 dog
300 30 NA")
try it this way
library(tidyverse)
df %>%
pivot_longer(-col3) %>%
mutate(value = ifelse(!is.na(col3), value / 3, value)) %>%
pivot_wider(col3, names_from = name, values_from = value)
In base R, we can directly do the assignment if we have a logical index
dat1[1:2][!is.na(dat1$col3),] <- dat1[1:2, !is.na(dat1$col3)]/10
Or using data.table
library(data.table)
setDT(dat1)[is.na(col3), (1:2) := .SD/10, .SDcols = 1:2]
I would like to make a new column by appending to a list conditional on the values of other columns. If possible, I would like to do so in dplyr. Sample input and desired output is below.
Suppose a dataframe newdata:
col1 col2 col3 col4
dog cat NA NA
NA cat foo bar
dog NA NA NA
NA cat NA NA
Here is my desired output, with the new column newCol:
col1 col2 col3 col4 newCol
dog cat NA NA (dog, cat)
NA cat foo bar (cat, foo, bar)
dog NA NA NA (dog)
NA cat NA bar (cat, bar)
I have tried using ifelse within mutate and case_when within mutate, but both will not allow concatenation to a list. Here is my (unsuccessful) attempt with case_when:
newdata = newdata %>% mutate(
newCol = case_when(
col1 == "dog" ~ c("dog"),
col2 == "cat" ~ c(newCol, "cat"),
col3 == "foo" ~ c(newCol, "foo"),
col4 == "bar" ~ c(newcol, "dog")
)
)
I tried a similar approach with an ifelse statement for each column but also could not append to the list.
In the Note at the end we show the input data used here. It is as in the question except we have added a row of NAs at the end to show that all solutions work in that case too.
We show both list and character column solutions. The question specifically refers to list so this is the assumed desired output but if it was intended that newCol be a character vector then we show that as well.
This is so easy to do using base functions that we show that first; however, we do redo it in tidyverse although it involves significantly more code.
1) base We can use apply like this:
reduce <- function(x) unname(x[!is.na(x)])
DF$newCol <- apply(DF, 1, reduce)
giving the following where newCol is a list whose first component is c("dog", "cat"), etc.
col1 col2 col3 col4 newCol
1 dog cat <NA> <NA> dog, cat
2 <NA> cat foo bar cat, foo, bar
3 dog <NA> <NA> <NA> dog
4 <NA> cat <NA> <NA> cat
5 <NA> <NA> <NA> <NA>
The last line of code could alternately be:
DF$newCol <- lapply(split(DF, 1:nrow(DF)), reduce)
The question refers to concatenating to a list so I assume that a list is wanted for newCol but if a string is wanted then use this for reduce instead:
reduce_ch <- function(x) sprintf("(%s)", toString(x[!is.na(x)]))
apply(DF, 1, reduce_ch)
2) tidyverse or using tpldyr/tidyr/tibble we gather it to long form, remove the NAs, nest it, sort it back to the original order and cbind it back with DF.
library(dplyr)
library(tibble)
library(tidyr)
DF %>%
rownames_to_column %>%
gather(colName, Value, -rowname) %>%
na.omit %>%
select(-colName) %>%
nest(Value, .key = newCol) %>%
arrange(rowname) %>%
left_join(cbind(DF %>% rownames_to_column), .) %>%
select(-rowname)
giving:
col1 col2 col3 col4 newCol
1 dog cat <NA> <NA> dog, cat
2 <NA> cat foo bar cat, foo, bar
3 dog <NA> <NA> <NA> dog
4 <NA> cat <NA> <NA> cat
5 <NA> <NA> <NA> <NA> NULL
If character output is wanted then use this instead:
DF %>%
rownames_to_column %>%
gather(colName, Value, -rowname) %>%
select(-colName) %>%
group_by(rowname) %>%
summarize(newCol = sprintf("(%s)", toString(na.omit(Value)))) %>%
ungroup %>%
{ cbind(DF, .) } %>%
select(-rowname)
giving:
col1 col2 col3 col4 newCol
1 dog cat <NA> <NA> (dog, cat)
2 <NA> cat foo bar (cat, foo, bar)
3 dog <NA> <NA> <NA> (dog)
4 <NA> cat <NA> <NA> (cat)
5 <NA> <NA> <NA> <NA> ()
Note
The input DF in reproducible form:
Lines <- "col1 col2 col3 col4
dog cat NA NA
NA cat foo bar
dog NA NA NA
NA cat NA NA
NA NA NA NA"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Solution using na.omit() and paste() with collapse argument:
apply(newdata, 1,
function(x) paste0("(", paste(na.omit(x), collapse = ", "), ")"))
[1] "(dog, cat)" "(cat, foo, bar)" "(dog)" "(cat)"
Demo
This looks like a use case for tidyr::unite. You'll still need to do some dplyr cleanup at the end, but this should work for now.
library(tibble)
library(dplyr)
library(tidyr)
df <- tribble(~col1, ~col2, ~col3, ~col4,
"dog", "cat", NA, NA,
NA, "cat", "foo", "bar",
"dog", NA, NA, NA,
NA, "cat", NA, NA)
df %>%
unite(newCol, col1, col2, col3, col4,
remove = FALSE,
sep = ', ') %>%
# Replace NAs and "NA, "s with ''
mutate(newCol = gsub('NA[, ]*', '', newCol)) %>%
# Replace ', ' with '' if it is at the end of the line
mutate(newCol = gsub(', $', '', newCol)) %>%
# Add the parentheses on either side
mutate(newCol = paste0('(', newCol, ')'))
#> # A tibble: 4 x 5
#> newCol col1 col2 col3 col4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 (dog, cat) dog cat <NA> <NA>
#> 2 (cat, foo, bar) <NA> cat foo bar
#> 3 (dog) dog <NA> <NA> <NA>
#> 4 (cat) <NA> cat <NA> <NA>
Also for what it's worth, other people are discussing this problem!
Trying to implement exclusive full join in r code.
Implemented the below code which works correctly but is the correct approach since the filter is filled lots of conditions. Since this is the sample code didn't add much columns but in real time scenario we have many columns so adding up the columns to filter would make things difficult.
So any other better approach available ?
library(tidyverse)
persons = data.frame(
name = c("Ponting", "Clarke", "Dave", "Bevan"),
age = c(24, 32, 26, 29),
col1 = c(1,2,3,4),
col2 = c("a", "z", "h", "p")
)
person_sports = data.frame(
name = c("Ponting", "Dave", "Roshan"),
sports = c("soccer", "tennis", "boxing"),
rank = c(8, 4, 1),
col3 = c("usa", "australia", "england"),
col4 = c("a", "f1", "z2")
)
persons %>% full_join(person_sports, by = c("name")) %>%
filter((is.na(age) & is.na(col1) & is.na(col2)) | (is.na(sports) & is.na(rank) & is.na(col3) & is.na(col4)))
Output:
Try using complete.cases. This will return a vector of TRUE/FALSE where FALSE indicates an NA is found on a given row in at least one column.
persons %>% full_join(person_sports, by = c("name")) %>% .[!complete.cases(.), ]
# name age col1 col2 sports rank col3 col4
# 2 Clarke 32 2 z <NA> NA <NA> <NA>
# 4 Bevan 29 4 p <NA> NA <NA> <NA>
# 5 Roshan NA NA <NA> boxing 1 england z2
As an alternative, which works similarly to the above, use filter_all and any_vars from the dplyr package.
persons %>% full_join(person_sports, by = c("name")) %>% filter_all(any_vars(is.na(.)))
# name age col1 col2 sports rank col3 col4
# 1 Clarke 32 2 z <NA> NA <NA> <NA>
# 2 Bevan 29 4 p <NA> NA <NA> <NA>
# 3 Roshan NA NA <NA> boxing 1 england z2
Finally, since you mentioned your actual dataset is much bigger, you might want to compare to a data.table solution and see what works best in your real world data.
library(data.table)
setDT(persons)
setDT(person_sports)
merge(persons, person_sports, by = "name", all = TRUE) %>% .[!complete.cases(.)]
# name age col1 col2 sports rank col3 col4
# 1: Bevan 29 4 p NA NA NA NA
# 2: Clarke 32 2 z NA NA NA NA
# 3: Roshan NA NA NA boxing 1 england z2
Suppose I have the following df
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
col1 col2 col3
1 1 2 <NA>
2 3 4 <NA>
3 1 2 c
My goal is to delete all duplicate rows based on col1 and col2 such that the longer row "survives". In this case, the first row should be deleted. I tried
df[duplicated(df[, 1:2]), ]
but this gives me only the third row (and not the third and the second one). How to do it properly?
EDIT: The real df has 15 columns, of which the first 13 are used for identifying duplicates. In the last two columns roughly 2/3 of the rows are filled with NAs (the first 13 columns do not contain any NAs). Thus, my example df was misleading in the sense that there are two columns to be excluded for identifying the duplicates. I am sorry for that.
You can try this:
library(dplyr)
df %>% group_by(col1,col2) %>%
slice(which.min(is.na(col3)))
or this :
df %>%
group_by(col1,col2) %>%
arrange(col3) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: col1, col2 [2]
# col1 col2 col3
# <dbl> <dbl> <fctr>
# 1 1 2 c
# 2 3 4 NA
A GENERAL SOLUTION
with the most general solution there can be only one row per value of col1, see comment below to add col2 to the grouping variables. It assumes all NAs are on the right.
df %>% mutate(nna = df %>% is.na %>% rowSums) %>%
group_by(col1) %>% # or group_by(col1,col2)
slice(which.min(nna)) %>%
select(-nna)
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
3 1 2 c
2 3 4 <NA>
EDIT: Keep all non-NA rows
df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
1 1 2 a
5 1 2 b
3 1 2 c
2 3 4 <NA>
You can sort NAs to the top or bottom before dropping dupes:
# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]
# col1 col2 col3
# 3 1 2 c
# 2 3 4 <NA>
# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)
# col1 col2 col3
# 1: 1 2 c
# 2: 3 4 NA
This approach cannot be taken with dplyr, which doesn't offer "sort by all columns" in arrange, nor fromLast in distinct.