Unique Data Frame by Prioritize Value in R - r

I have the following data frame in R:
A<-c(1,0,0,1,0)
B<-c("A","A","B","B","C")
df<-cbind(A,B)
and i want unique this data frame by prioritize a value in column A.
Prioritize a value of 1 rather than a value of 0.
I tried to write the code as follows:
uniq<-unique(subset(df, df[,1]==1))
and result:
A B
[1,] "1" "A"
[2,] "1" "B"
But i want:
A B
[1,] "1" "A"
[2,] "1" "B"
[3,] "0" "C"
How can I achieve this? Thanks before

First your df is actually a matrix so you could start by df <- data.frame(df, stringsAsFactors = FALSE)
Then sort so that A == 1 comes first and finally weed out duplicates
df <- df[order(df[["A"]], decreasing = TRUE), ]
df[!duplicated(df[["B"]]), ]
A B
1 1 A
4 1 B
5 0 C

You can use aggregate, if you make sure you have a data frame and not a matrix:
df<-data.frame(A,B, stringsAsFactor = FALSE)
aggregate(A ~ B, df, max)
# B A
# 1 A 1
# 2 B 1
# 3 C 0
If you want to prioritize a value and simple sorting isn't good enough (because you want to prioritize a character or factor value, or a numeric value that is not a min/max or you want to leave the order of other values intact), you can use :
df2 <- df[order(df$A!=1),]
df2 <- df2[!duplicated(df2[["B"]]), ]
which is a minor twist on #snoram's answer

First sort the data by the first column (decreasing order), then remove rows with duplicate value for second column.
df <- df[order(df[,1], decreasing = T),]
df[duplicated(df[,2])==F,]
A B
[1,] "1" "A"
[2,] "1" "B"
[3,] "0" "C"

I think with the help of Data table you will be able to do it
A<-c(1,0,0,1,0)
B<-c("A","A","B","B","C")
df<-as.data.frame(as.character(cbind(A,B)))
df1<- dplyr::arrange(df,desc(A),B)
library(data.table)
DT <- data.table(df1)
setkey(DT, B)
d<- DT[J(unique(B)), mult = "last"]

tidyverse solution
library(tidyverse)
df %>% as.data.frame( stringsAsFactors = FALSE ) %>%
arrange( B, desc(A) ) %>%
filter( !duplicated(B) )
# A B
# 1 1 A
# 2 1 B
# 3 0 C

Related

What's the R function used to find unique and distinct value in a column? [duplicate]

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"
You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))
We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))
You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

Subsetting a data frame using another data frame

I'm having trouble with something that shouldn't be that hard to come around. What I would like to do is subsetting a data.frame by using another data.frame, and more precisely, by using a certain parameter.
Here goes the example:
df1<- t(data.frame(A=c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"), B=c("0.5","3","0","0","5","0","15"), C=c("0","0","3","15","15","0","0"), D=c("0.5","0.5","0.5","0","0","0","0"), E=c("37.5","37.5","0.5","62.5","0.5","0.5","1")))
df2<- data.frame(A=c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"), B=c("vasc", "vasc","vasc","spha", "moss","moss","moss"), C=c("a", "a", "b", "a", "c","d","a"))
Now, let's say that I want in my df1 only the objects A (here they are species) that are "vasc" in df2 in my df1.
For that I've tried a few things such as:
df3 <- subset(df2, B=="vasc")
df4 <- df1[,c(df1, as.vector(df2))]
But in doing so, I have an error of type:
Error in df1[, c(df1, as.vector(df2))] : invalid subscript type 'list'
Therefore, I've tried to unlist my dataframes but nothing seems to work. I've been on this problem for a while now and I did explore the forum to see if anyone had an elegant solution to my problem but it looks like not.
Another way of doing this subsetting was to do the following bit of code, but it didn't work either even though I felt closer to the solution:
try11 <- list(df2, df1)%>% rbindlist(., fill=T) # with df1 not transposed
df11 <- try11[try11=="vasc",]
I hope the code is good enough and my explanation clear enough.
Thank you!
You might try:
library(data.table)
setDT(df1)
setDT(df2)
dtPruned <- df1[A %in% df2[B == "vasc", A]]
Make sure to remove the t() call in your df1 definition for this to work, however. Basically, what it's doing is selecting the A column in df2 where B = "vasc". It then selects the rows from df1 where A is in those A's from df2.
You can do it with dplyr
library(dplyr)
species <- as.character(df2[df2$B == "vasc",1])
df1 %>%
slice(A %in% species)
## A tibble: 3 x 5
# A B C D E
# <fct> <fct> <fct> <fct> <fct>
#1 ABI 0.5 0 0.5 37.5
#2 ABI 0.5 0 0.5 37.5
#3 ABI 0.5 0 0.5 37.5
PS
Your data contains only factor. Maybe you want yo use number as numeric class.
This should do it. First we create a character vector (x) of all A values where B == vasc in df2. Then we select columns from df1 where A == x:
# Create a character vector of all A values when B == vasc
x <- as.character(df2[df2$B == "vasc", 1])
# Select columns where row A == x
df1[, which(df1[1, ] %in% x)]
[,1] [,2] [,3]
A "ABI" "BET" "ALN"
B "0.5" "3" "0"
C "0" "0" "3"
D "0.5" "0.5" "0.5"
E "37.5" "37.5" "0.5"
If we avoid the t call, we can do:
df1[df1$A %in% df2[df2$B == "vasc", 1], ]
A B C D E
1 ABI 0.5 0 0.5 37.5
2 BET 3 0 0.5 37.5
3 ALN 0 3 0.5 0.5
We could transpose the data frame to retain the same format as above:
t(df1[df1$A %in% df2[df2$B == "vasc", 1], ])
1 2 3
A "ABI" "BET" "ALN"
B "0.5" "3" "0"
C "0" "0" "3"
D "0.5" "0.5" "0.5"
E "37.5" "37.5" "0.5"
Data:
df1 <- t(data.frame(
A = c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"),
B = c("0.5","3","0","0","5","0","15"),
C = c("0","0","3","15","15","0","0"),
D = c("0.5","0.5","0.5","0","0","0","0"),
E = c("37.5","37.5","0.5","62.5","0.5","0.5","1")
)
)
df2 <- data.frame(
A = c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"),
B = c("vasc", "vasc","vasc","spha", "moss","moss","moss"),
C = c("a", "a", "b", "a", "c","d","a")
)

In R, how can I create subset data frame with all duplicate observations? [duplicate]

This question already has answers here:
How to output duplicated rows
(6 answers)
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
Lot's of questions out there touching the topic of duplicate observations but none of them worked for me so far.
In this questions I learned how to select all duplicates from a vector.
# vector
id <- c("a","b","b","c","c","c","d","d","d","d")
#To return ALL duplicated values by specifying fromLast argument:
id[duplicated(id) | duplicated(id, fromLast=TRUE)]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"
#Yet another way to return ALL duplicated values, using %in% operator:
id[id %in% unique(id[duplicated(id)])]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"
Now having a data frame like this one:
dat <- data.frame(x = c(1, 1, 2, 2, 3),
y = c(5, 5, 6, 7, 8),
z = c('a', 'b', 'c', 'd', 'e'))
How could I select all observations that simultaneously have duplicate values of x and y, irrespective of z?
Another option using dplyr
library(dplyr)
dat %>% group_by(x,y) %>% filter(n()>1)
# A tibble: 2 x 3
# x y z
# <dbl> <dbl> <fctr>
#1 1 5 a
#2 1 5 b
You can use data.table like so:
library(data.table)
setDT(dat)
# selects all (x,y) pairs that occur more than once
dat[ , if (.N > 1L) .SD, by = .(x, y)]
In base R
dat[ave(paste(dat$x,dat$y),dat$y,FUN=function(x) length(x))>1,]
x y z
1 1 5 a
2 1 5 b

R - How to replace a set of values with another set of values

I have this problem which I can solve manually but I feel there must be an easier answer.
I want to replace 2 with A, 22 with B, 4 with C, ...
I have all the values I want to replace in an array and another [A,B,C,D,...].
Is there an easy way to perform this replacement?
Thanks,
Miguel
You can use named vectors.
x <- sample(1:10, 8)
x
# [1] 4 9 6 10 8 3 7 5
y <- c("A", "B", "C")
names(y) <- 1:3
x[x%in%names(y)] <- y[x[x%in%names(y)]]
x
# [1] "4" "9" "6" "10" "8" "C" "7" "5"
Use data.table
DT[column1 == "2", column1 := "A"]
If you want to do a group of them at a time, then use setkey from data.table and merge the dataset with the reference dataset.
setkeyv(DT, 'column1')
setkeyv(referenceSet, 'oldVars')
merge(DT, referenceSet, all.x = TRUE)

Split string in each column for several columns

I have this table (data1) with four columns
SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G
I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :
SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C
With the following function I could split all columns at the time but the output is not what I need.
split <- function(x){
x <- as.character(x)
strsplit(as.character(x), split="-")
}
data2=apply(data1[,-1], 2, split)
data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"
$rs7730126
$rs7730126[[1]]
[1] "G" "G"
$rs6576700
$rs6576700[[1]]
[1] "C" "C"
In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)
> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
t.do.call.cbind..l..
rs17054099 T, T
rs7730126 G, G
rs2061700 C, C
If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.
I would like to have the solution in R to make it part of a pipeline.
I forgot to say that I need to apply this to a million columns.
This is straight forward using the splitstackshape::cSplit function. Just specify the column indices within the splitCols parameter, and the separator within to the sep parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE so T values won't become TRUE. The default direction is wide, so you don't need to specify it.
library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
# SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1 G G T T G G
Here's a solution as per the provided link using the tstrsplit function for the devel version of data.table on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data
library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
# SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1 G-G T-T G-G G G T T G G
Here you want to use apply over the rows instead of columns:
df <- rbind(c("SNP", "rs6576700", "rs17054099", "rs7730126"),
c("sample1", "G-G", "T-T", "G-G"),
c("sample2", "C-C", "T-T", "G-C"))
t(apply(df[-1,], 1, function(col) unlist(strsplit(col, "-"))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] "sample1" "G" "G" "T" "T" "G" "G"
#[2,] "sample2" "C" "C" "T" "T" "G" "C"

Resources