Find all indices of duplicates and write them in new columns - r

I have a data.frame with a single column, a vector of strings.
These strings have duplicate values.
I want to find the character strings that have duplicates in this vector and write their index of position in a new column.
So for example consider I have:
DT<- data.frame(string=A,B,C,D,E,F,A,C,F,Z,A)
I want to get:
string match2 match2 match3 matchx....
A 1 7 11
B 2 NA NA
C 3 8 NA
D 4 NA NA
E 5 NA NA
F 6 9 NA
A 1 7 11
C 3 8 NA
F 6 9 NA
Z 10 NA NA
A 1 7 11
The string is ways longer than in this example and I do not know the amount of maximum columns I need.
What will be the most effective way to do this?
I know that there is the duplicate function but I am not exactly sure how to combine it to the result I want to get here.
Many thanks!

Here's one way of doing this. I'm sure a data.table one liner follows.
DT<- data.frame(string=c("A","B","C","D","E","F","A","C","F","Z","A"))
# find matches
rbf <- sapply(DT$string, FUN = function(x, DT) which(DT %in% x), DT = DT$string)
# fill in NAs to have a pretty matrix
out <- sapply(rbf, FUN = function(x, mx) c(x, rep(NA, length.out = mx - length(x))), max(sapply(rbf, length)))
# bind it to the original data
cbind(DT, t(out))
string 1 2 3
1 A 1 7 11
2 B 2 NA NA
3 C 3 8 NA
4 D 4 NA NA
5 E 5 NA NA
6 F 6 9 NA
7 A 1 7 11
8 C 3 8 NA
9 F 6 9 NA
10 Z 10 NA NA
11 A 1 7 11

Here is one option with data.table. After grouping by 'string', get the sequence (seq_len(.N)) and row index (.I), then dcast to 'wide' format and join with the original dataset on the 'string'
library(data.table)
dcast(setDT(DT)[, .(seq_len(.N),.I), string],string ~ paste0("match", V1))[DT, on = "string"]
# string match1 match2 match3
# 1: A 1 7 11
# 2: B 2 NA NA
# 3: C 3 8 NA
# 4: D 4 NA NA
# 5: E 5 NA NA
# 6: F 6 9 NA
# 7: A 1 7 11
# 8: C 3 8 NA
# 9: F 6 9 NA
#10: Z 10 NA NA
#11: A 1 7 11
Or another option would be to split the sequence of rows with 'string', pad the list elements with NA for length that are less, and merge with the original dataset (using base R methods)
lst <- split(seq_len(nrow(DT)), DT$string)
merge(DT, do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))),
by.x = "string", by.y = "row.names")
data
DT<- data.frame(string=c("A","B","C","D","E","F","A","C",
"F","Z","A"), stringsAsFactors=FALSE)

And here's one that uses tidyverse tools ( not quite a one-liner ;) ):
library( tidyverse )
DT %>% group_by( string ) %>%
do( idx = which(DT$string == unique(.$string)) ) %>%
ungroup %>% unnest %>% group_by( string ) %>%
mutate( m = stringr::str_c( "match", 1:n() ) ) %>%
spread( m, idx )

Related

R - Merging rows with numerous NA values to another column

I would like to ask the R community for help with finding a solution for my data, where any consecutive row with numerous NA values is combined and put into a new column.
For example:
df <- data.frame(A= c(1,2,3,4,5,6), B=c(2, "NA", "NA", 5, "NA","NA"), C=c(1,2,"NA",4,5,"NA"), D=c(3,"NA",5,"NA","NA","NA"))
A B C D
1 1 2 1 3
2 2 NA 2 NA
3 3 NA NA 5
4 4 5 4 NA
5 5 NA 5 NA
6 6 NA NA NA
Must be transformed to this:
A B C D E
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA
I would like to do the following:
Identify consecutive rows that have more than 1 NA value -> combine entries from those consecutive rows into a single combined entiry
Place the above combined entry in new column "E" on the prior row
This is quite complex (for me!) and I am wondering if anyone can offer any help with this. I have searched for some similar problems, but have been unable to find one that produces a similar desired output.
Thank you very much for your thoughts--
Using tidyr and dplyr:
Concatenate values for each row.
Keep the concatenated values only for rows with more than one NA.
Group each “good” row with all following “bad” rows.
Use a grouped summarize() to concatenate “bad” row values to a single string.
df %>%
unite("E", everything(), remove = FALSE, sep = " ") %>%
mutate(
E = if_else(
rowSums(across(!E, is.na)) > 1,
E,
""
),
new_row = cumsum(E == "")
) %>%
group_by(new_row) %>%
summarize(
across(A:D, first),
E = trimws(paste(E, collapse = " "))
) %>%
select(!new_row)
# A tibble: 2 × 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

Appending data frames in R based on column names

I am relatively new to R, so bear with me. I have a list of data frames that I need to combine into one data frame. so:
dfList <- list(
df1 = data.frame(x=letters[1:2],y=1:2),
df2 = data.frame(x=letters[3:4],z=3:4)
)
comes out as:
$df1
x y
1 a 1
2 b 2
$df2
x z
1 c 3
2 d 4
and I want them to combine common columns and add anything not already there. the result would be:
final result
x y z
1 a 1
2 b 2
3 c 3
4 d 4
Is this even possible?
Yep, it's pretty easy, actually:
library(dplyr)
df_merged <- bind_rows(dfList)
df_merged
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
And if you don't want NA in the empty cells, you can replace them like this:
df_merged[is.na(df_merged)] <- 0 # or whatever you want to replace NA with
Just using do.call with rbind.fill
do.call(rbind.fill,dfList)
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
You could do that with base function merge():
merge(dfList$df1, dfList$df2, by = "x", all = TRUE)
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
Or with dplyr package with function full_join:
dplyr::full_join(dfList$df1, dfList$df2, by = "x")
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
They both join everything that is in both data.frames.
Hope that works for you.

R: Combine columns ignoring NAs

I have a dataframe with a few columns, where for each row only one column can have a non-NA value. I want to combine the columns into one, keeping only the non-NA value, similar to this post:
Combine column to remove NA's
However, in my case, some rows may contain only NAs, so in the combined column, we should keep an NA, like this (adapted from the post I mentioned):
data <- data.frame('a' = c('A','B','C','D','E','F'),
'x' = c(1,2,NA,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA,NA),
'z' = c(NA,NA,NA,4,5,NA))
So I would have
a x y z
1 A 1 NA NA
2 B 2 NA NA
3 C NA 3 NA
4 D NA NA 4
5 E NA NA 5
6 F NA NA NA
And I would to get
'a' 'mycol'
A 1
B 2
C 3
D 4
E 5
F NA
The solution from the post mentioned above does not work in my case because of row F, it was:
cbind(data[1], mycol = na.omit(unlist(data[-1])))
Thanks!
Using base R...
data$mycol <- apply(data[,2:4], 1, function(x) x[!is.na(x)][1])
data
a x y z mycol
1 A 1 NA NA 1
2 B 2 NA NA 2
3 C NA 3 NA 3
4 D NA NA 4 4
5 E NA NA 5 5
6 F NA NA NA NA
One option is coalesce from dplyr
library(tidyverse)
data %>%
transmute(a, mycol = coalesce(!!! rlang::syms(names(.)[-1])))
# a mycol
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5
#6 F NA
Or we can use max.col from base R
cbind(data[1], mycol= data[-1][cbind(1:nrow(data),
max.col(!is.na(data[-1])) * NA^!rowSums(!is.na(data[-1]))+1)])
# a mycol
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5
#6 F NA
Or only with rowSums
v1 <- rowSums(data[-1], na.rm = TRUE)
cbind(data[1], mycol = v1 * NA^!v1)
Or another option is pmax
cbind(data[1], mycol = do.call(pmax, c(data[-1], na.rm = TRUE)))
or pmin
cbind(data[1], mycol = do.call(pmin, c(data[-1], na.rm = TRUE)))

R- Perform operations on column and place result in a different column, with the operation specified by the output column's name

I have a dataframe with 3 columns- L1, L2, L3- of data and empty columns labeled L1+L2, L2+L3, L3+L1, L1-L2, etc. combinations of column operations. Is there a way to check the column name and perform the necessary operation to fill that new column with data?
I am thinking:
-use match to find the appropriate original columns and using a for loop to iterate over all of the columns in this search?
so if the column I am attempting to fill is L1+L2 I would have something like:
apply(dataframe[,c(i, j), 1, sum)
It seems strange that you would store your operations in your column names, but I suppose it is possible to achieve:
As always, sample data helps.
## Creating some sample data
mydf <- setNames(data.frame(matrix(1:9, ncol = 3)),
c("L1", "L2", "L3"))
## The operation you want to do...
morecols <- c(
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "+")),
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "-"))
)
## THE FINAL SAMPLE DATA
mydf[, morecols] <- NA
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 NA NA NA NA NA NA
# 2 2 5 8 NA NA NA NA NA NA
# 3 3 6 9 NA NA NA NA NA NA
One solution could be to use eval(parse(...)) within lapply to perform the calculations and store them to the relevant column.
mydf[morecols] <- lapply(names(mydf[morecols]), function(x) {
with(mydf, eval(parse(text = x)))
})
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 5 8 11 -3 -6 -3
# 2 2 5 8 7 10 13 -3 -6 -3
# 3 3 6 9 9 12 15 -3 -6 -3
dfrm <- data.frame( L1=1:3, L2=1:3, L3=3+1, `L1+L2`=NA,
`L2+L3`=NA, `L3+L1`=NA, `L1-L2`=NA,
check.names=FALSE)
dfrm
#------------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 NA NA NA NA
2 2 2 4 NA NA NA NA
3 3 3 4 NA NA NA NA
#-------------
dfrm[, 4:7] <- lapply(names(dfrm[, 4:7]),
function(nam) eval(parse(text=nam), envir=dfrm) )
dfrm
#-----------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 2 5 5 0
2 2 2 4 4 6 6 0
3 3 3 4 6 7 7 0
I chose to use eval(parse(text=...)) rather than with, since the use of with is specifically cautioned against in its help page. I'm not sure I can explain why the eval(..., target_dfrm) form should be any safer, though.

Resources