R Compare non side-by-side duplicates in 2 columns - r

There are many similar questions but I'd like to compare 2 columns and delete all the duplicates in both columns so that all that is left is the unique observations in each column. Note: Duplicates are not side-by-side. If possible, I would also like a list of the duplicates (not just TRUE/FALSE). Thanks!
C1 C2
1 a z
2 c d
3 f a
4 e c
would become
C1 C2
1 f z
2 e d
with duplicate list
duplicates: a, c

Here is another answer
where_dupe <- which(apply(df, 2, duplicated), arr.ind = T)
Gives you the location of the duplicated elements within your original data frame.
col_unique <- setdiff(1:ncol(df), where_dupe)
Gives you which columns had no duplicates
You can find out the values by indexing.
df[,col_unique]

Here is a base R method using duplicated and lapply.
temp <- unlist(df)
# get duplicated elements
myDupeVec <- unique(temp[duplicated(temp)])
# get list without duplicates
noDupesList <- lapply(df, function(i) i[!(i %in% myDupeVec)])
noDupesList
$C1
[1] "f" "e"
$C2
[1] "z" "d"
data
df <- read.table(header=T, text=" C1 C2
1 a z
2 c d
3 f a
4 e c ", as.is=TRUE)
Note that this returns a list. This is much more flexible structure, as there is generally a possibility that a level may be repeated more than once in a particular variable. If this is not the case, you can use do.call and data.frame to put the result into a rectangular structure.
do.call(data.frame, noDupesList)
C1 C2
1 f z
2 e d

Related

How to find the longest sequence of non-NA rows in R?

I have an ordered dataframe with many variables, and am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column. Is there an easy way to do this? I have tried the na.contiguous() function but my data is not formatted as a time series.
My intuition is to create a running counter which determines whether a row has NA or not, and then will determine the count for the number of consecutive rows without an NA. I would then put this in an if statement to keep restarting every time an NA is encountered, outputting a dataframe with the lengths of every sequence of non-NAs, which I could use to find the longest such sequence. This seems very inefficient so I'm wondering if there is a better way!
If I understand this phrase correctly:
[I] am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column
You have a column of interest, call it WANT, and are looking to isolate all columns from the single row of data with the highest consecutive non-NA values in WANT.
Example data
df <- data.frame(A = LETTERS[1:10],
B = LETTERS[1:10],
C = LETTERS[1:10],
WANT = LETTERS[1:10],
E = LETTERS[1:10])
set.seed(123)
df[sample(1:nrow(df), 2), 4] <- NA
# A B C WANT E
#1 A A A A A
#2 B B B B B
#3 C C C <NA> C
#4 D D D D D
#5 E E E E E
#6 F F F F F
#7 G G G G G
#8 H H H H H
#9 I I I I I # want to isolate this row (#9) since most non-NA in WANT
#10 J J J <NA> J
Here you would want all I values as it is the row with the longest running non-NA values in WANT.
If my interpretation of your question is correct, we can extend the excellent answer found here to your situation. This creates a data frame with a running tally of consecutive non-NA values for each column.
The benefit of using this is that it will count consecutive non-NA runs across all columns (of any type, ie character, numeric), then you can index on whatever column you want using which.max()
# from #jay.sf at https://stackoverflow.com/questions/61841400/count-consecutive-non-na-items
res <- as.data.frame(lapply(lapply(df, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
# index using which.max()
want_data <- df[which.max(res$WANT), ]
#> want_data
# A B C WANT E
#9 I I I I I
If this isn't correct, please edit your question for clarity.

Split column into vectors by group R - independent of column order

Edit
This question seems to be a duplicate of the question How to group a vector into a list of vectors?, and the answer split(df$b, df$id) was suggested. First happy with the solution, I realized that the given answers do not fully address my question. In the below question, I would like to obtain a list in which the vector elements are assigned to the value of a third column (in my example df$a). This is important, as otherwise the order of df$b plays a role. I mean obviously I can arrange by df$a and then call split(), but maybe there is another way of doing that.
My sample df:
df <- data_frame(id = paste0('id',rep(1:2, each = 5)), a = rep(letters[1:5],2),b=c(1:5,5:1))
Df should be grouped by ID (in df$id). I would like to create a list of vectors for each group (id) element that contains the values of df$b. My approach
require(tidyr)
spread_df <- df %>% spread(id,b) #makes new columns for each id
#loop over spread_df
for (i in 1:length(spread_df)) {
list_group_elements [i]<- list(spread_df[[i]])
#I want each vector to be identified by the identifier of column df$a
#therefore:
names(list_group_elements[[i]]) <- list_group_elements[[1]]
}
This results in :
list_group_elements
[[1]]
a b c d e
"a" "b" "c" "d" "e"
[[2]]
a b c d e
1 2 3 4 5
[[3]]
a b c d e
5 4 3 2 1
I don't need the first element of the list, but the rest is basically what I need. I have the peculiar impression that my approach is somewhat not ideal and if someone has an idea to improve this, (e.g., with dplyr?) this would be highly appreciated. Why do I want this: I made a function that uses vectors as arguments and I would like to run this function over certain columns from dataframes - but only using the grouped values as arguments and not the entire column.
You may make df$b a named vector using setNames, and then split it into a list:
split(setNames(df$b, df$a), df$id)
# $id1
# a b c d e
# 1 2 3 4 5
#
# $id2
# a b c d e
# 5 4 3 2 1
One way is
lapply(levels(df$id), function(L) df$b[df$id == L])
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 5 4 3 2 1
Consider by, object-oriented wrapper of tapply, designed to split dataframe by factor(s):
by(df, df$id, FUN=function(i) i$b)

Extracting one row from multiple data tables and combining into a new data table while keeping column names

Three text files are in the same directory ("data001.txt", "data002.txt", "data003.txt"). I write a loop to read each data file and generate three data tables;
for(i in files) {
x <- read.delim(i, header = F, sep = "\t", na = "*")
setnames(x, 2, i)
assign(i,x)
}
So let's say each individual table looks something like this:
var1 var2 var3
row1 2 1 3
I've used rbind to combine all of the tables...
combined <- do.call(rbind, mget(ls(pattern="^data")))
and get something like this:
var1 var2 var3
row1 2 1 3
var1 var2 var3
row1 3 2 4
var1 var2 var3
row1 1 3 5
leaving me with superfluous column names. At the moment I can get around this by just deleting that specific row containing the column names, but it's a bit clunky.
colnames(combined) = combined[1, ] # make the first row the column names
combined <- combined[-1, ] # delete the now-unnecessary first row
toDelete <- seq(1, nrow(combined), 2) # define which rows to be deleted i.e. every second odd row
combined <- combined[ toDelete ,] # delete them suckaz
This does give me what I want...
var1 var2 var3
row1 2 1 3
row1 3 2 4
row1 1 3 5
But I feel like a better way would simply be to extract the values of "row1" as a vector or as a list or whatever, and combine them all together into one data table. I feel like there is a quick and easy way to do this but I haven't been able to find anything yet. I've had a look here and here and here.
One possibility is to take the second row (that I want), and convert it into a matrix (then transpose it to make it a row instead of column!?) and rbind:
data001.txt <- as.matrix(data001.txt[2,])
data001.txt <- t(data001.txt)
combined <- rbind(data001.txt, data002.txt)
This gives me more or less what I want except without the column name headers (e.g. va1, var2, var3).
v1 v2 v3
2 1 3
3 2 4
Any ideas? Would this second method work well if there is some way to add the column names? I feel like it's less clunky than the first method. Thanks for any input :)
edit - solved in answer below.
Figured it out. Converting to data matrix and using set.names from data.table package required. Say you have a range of text data files like the one that follows, and you want to extract just the seventh column (the one with the numbers, not letters), and combine them together in their own data table including the row names:
chemical1 a b c d e 1 g h i j k l m
chemical2 a b c d e 2 g h i j k l m
chemical3 a b c d e 3 g h i j k l m
chemical4 a b c d e 4 g h i j k l m
chemical5 a b c d e 5 g h i j k l m
setting row.names = 1 and header = F.
setwd("directory")
files <- list.files(pattern = "data") # take all files with 'data' in their name
for(i in files) {
x <- read.delim(i, row.names = 1, header = F, sep = "\t", na = "*")
setnames(x, 6, i) # if the data you want is in column six. Sets data file name as the column name.
x <- as.matrix(x[6]) # just take the sixth column with the numeric data (delete everything else)
x <- t(x) # transform (if you want..)
assign(i,x)
}
combined <- do.call(rbind, mget(ls(pattern="^data"))) # combine the data matrices into one table
write.table(combined, file="filename.csv", sep=",", row.names=T, col.names = NA)

How can I conditionally copy over rows of column A to column B in R?

I want to copy column A to column B, but for certain rows of column A, I want to make a stringsplit change as it copies over to column B. Can I do this without using a for loop (namely, can I do this using mutate in dplyr)?
I want to split on ':' (if found) and take the 2nd element of the strsplit and put it in B.
Sample result:
A B
1 a a
2 b b
3 c:d d
This honestly does not require big guns like dplyr, we split column A and take last element of the result
DF = read.table(text="A B
a a
b b
c:d d",header=TRUE,stringsAsFactors=FALSE)
DF$NewCol=do.call(rbind,lapply(DF[,"A"],function(x) { z=unlist(strsplit(x,":")); z[length(z)] } ))
DF
# A B NewCol
# a a a
# b b b
#c:d d d
You could use a lapply function:
DF = read.table(text="A B
a a
b b
c:d d",header=TRUE,stringsAsFactors=FALSE)
DF$NewCol<-do.call(rbind,lapply(strsplit(DF [,'A'],split=":"), function(x) tail(x,1)))
DF
A B NewCol
1 a a a
2 b b b
3 c:d d d
Hard to beat simple regex here:
df$B = sub('.*:', '', df$A)
Perhaps:
dfrm$B[grepl("[:]", dfrm$A)] <-
sapply( strsplit( as.character(dfrm$A)[grepl("[:]", dfrm$A)], split="[:]"), "[", 2)
Translated into English that says replace only the B items where A has a ":" in it and do so with the second item of the list formed when those A items in the same row are split on the regex character-class containing only the ":" character.
We can use str_extract
library(stringr)
df1$B <- str_extract(df1$A, "\\w$")
df1$B
#[1] "a" "b" "d"

Replacing value of one df column only in specific rows

I have vector index that corresponds to the rows of a df I want to modify for one specific column
index <- c(1,3,6)
dm <- one two three four
x y z k
a b c r
s e t f
e d f a
a e d r
q t j i
Now I want to modify column "three" only for rows 1, 3 and 6 replacing whatever value in it with "A".
Should I use apply?
There is no need for apply. You could simply use the following:
dm$three[index] <- "A"

Resources