Unique on a dataframe with only selected columns - r

I have a dataframe with >100 columns, and I would to find the unique rows by comparing only two of the columns. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.
In the below, I would like to unique only using id and id2:
data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
id id2 somevalue
1 1 x
1 1 y
3 4 z
I would like to obtain either:
id id2 somevalue
1 1 x
3 4 z
or:
id id2 somevalue
1 1 y
3 4 z
(I have no preference which of the unique rows is kept)

Ok, if it doesn't matter which value in the non-duplicated column you select, this should be pretty easy:
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
> dat[!duplicated(dat[,c('id','id2')]),]
id id2 somevalue
1 1 1 x
3 3 4 z
Inside the duplicated call, I'm simply passing only those columns from dat that I don't want duplicates of. This code will automatically always select the first of any ambiguous values. (In this case, x.)

Here are a couple dplyr options that keep non-duplicate rows based on columns id and id2:
library(dplyr)
df %>% distinct(id, id2, .keep_all = TRUE)
df %>% group_by(id, id2) %>% filter(row_number() == 1)
df %>% group_by(id, id2) %>% slice(1)

Using unique():
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])),]

Minor update in #Joran's code.
Using the code below, you can avoid the ambiguity and only get the unique of two columns:
dat <- data.frame(id=c(1,1,3), id2=c(1,1,4) ,somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])), c("id", "id2")]

Related

R - Identifying only strings ending with A and B in a column

I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:
df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))
What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.
The end result of what I'm looking for would look like this:
"S_026" and "S_028" as these are the only ones that have A and B at the end.
All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.
Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.
Any suggestions?
We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring
library(dplyr)
df %>%
group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>%
filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 1
Samples
<chr>
1 S_026A
2 S_026B
3 S_028A
4 S_028B
5 S_026B
You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.
library(dplyr)
library(tidyr)
df %>%
extract(Samples, c('value', 'last'), '(.*)(.)') %>%
group_by(value) %>%
filter(all(c('A', 'B') %in% last)) %>%
ungroup %>%
distinct(value)
# value
# <chr>
#1 S_026
#2 S_028

To create a frequency table with dplyr to count the factor levels and missing values and report it

Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.

Select distinct rows in a data frame with only NA Values in R

I have a data frame with 3 cols.
ID1 <- c(1,1,2,2,3,4)
ID2 <- c(11,NA,12,NA,NA,NA)
Val <- c("A","B","C","D","E","F")
DF <- data.frame(ID1,ID2,Val, stringsAsFactors=FALSE)
Now, I need to extract unique rows which have ID2 as "NA". In this case, desired output will be data frame with two rows i.e. ID1 = 3,4. I tried below subset command which results into all the four rows with NA. Looking for ways to achieve the desired output.
DF2 <- subset(DF , is.na(ID2))
If by unique rows, you mean unique values of ID1, then this code makes the trick:
DF[which(!duplicated(DF$ID1) & is.na(DF$ID2)),]
ID1 ID2 Val
5 3 NA E
6 4 NA F
If you prefer using subset, then this code gives the same output:
subset(DF , !duplicated(ID1) & is.na(ID2))
Try:
library(dplyr)
DF %>%
group_by(ID1) %>%
filter(n() == 1 & is.na(ID2))
Define a function to look up ID1 groups which have all NAs in ID2, and then return the unique rows of them.
library(dplyr)
select_na <- function(df_sub) {
if (any(!is.na(df_sub$ID2))) {
return(df_sub[0,])
}
else {
return(unique(df_sub))
}
}
DF %>%
group_by(ID1) %>%
do(select_na(.))
gives exactly what you want.

Conditionally mutate columns based on column class

My question is based on a previous topic posted here: Mutating multiple columns in a data frame
Suppose I have a tibble as follows:
id char_var_1 char_var_2 num_var_1 num_var_2 ... x_var_n
1 ... ... ... ... ...
2 ... ... ... ... ...
3 ... ... ... ... ...
where id is the key and char_var_x is a character variable and num_var_x is a numerical variable. I have 346 columns in total and I want to write a function that scales all the numerical variables except the id column. I'm looking for an elegant way to mutate these columns using pipes and dplyr functions.
Obviously the following works for all numeric variables:
pre_process_data <- function(dt)
{
# scale numeric variables
dt %>% mutate_if(is.numeric, scale)
}
But I'm looking for a way to exclude id column from scaling and retain the original values and at the same time scale all other numerical variables. Is there an elegant way to do this?
Try below, answer is similar to select_if post:
library(dplyr)
# Using #Psidom's example data: https://stackoverflow.com/a/48408027
df %>%
mutate_if(function(col) is.numeric(col) &
!all(col == .$id), scale)
# id a b
# 1 1 a -1
# 2 2 b 0
# 3 3 c 1
Not a canonical way to do this, but with a little bit hack, you can do this with mutate_at by making the integer indices of columns to mutate using which with manually constructed column selecting conditions:
df = data.frame(id = 1:3, a = letters[1:3], b = 2:4)
df %>%
mutate_at(vars(which(sapply(., is.numeric) & names(.) != 'id')), scale)
# id a b
#1 1 a -1
#2 2 b 0
#3 3 c 1
How about the "make the column your interested a character, then change it back approach?"
dt %>%
mutate(id = as.character(id)) %>%
mutate_if(is.numeric, scale) %>%
mutate(id = as.numeric(id))
you can use dplyr's across
df %>% mutate(across(c(where(is.numeric),-id),scale))

dplyr n_distinct with condition

Using dplyr to summarise a dataset, I want to call n_distinct to count the number of unique occurrences in a column. However, I also want to do another summarise() for all unique occurrences in a column where a condition in another column is satisfied.
Example dataframe named "a":
A B
1 Y
2 N
3 Y
1 Y
a %>% summarise(count = n_distinct(A))
However I also want to add a count of n_distinct(A) where B == "Y"
The result should be:
count
3
when you add the condition the result should be:
count
2
The end result I am trying to achieve is both statements merged into one call that gives me a result like
count_all count_BisY
3 2
What is the appropriate way to go about this with dplyr?
This produces the distinct A counts by each value of B using dplyr.
library(dplyr)
a %>%
group_by(B) %>%
summarise(count = n_distinct(A))
This produces the result:
Source: local data frame [2 x 2]
B count
(fctr) (int)
1 N 1
2 Y 2
To produce the desired output added above using dplyr, you can do the following:
a %>% summarise(count_all = n_distinct(A), count_BisY = length(unique(A[B == 'Y'])))
This produces the result:
count_all count_BisY
1 3 2
An alternative is to use the uniqueN function from data.table inside dplyr:
library(dplyr)
library(data.table)
a %>% summarise(count_all = n_distinct(A), count_BisY = uniqueN(A[B == 'Y']))
which gives:
count_all count_BisY
1 3 2
You can also do everything with data.table:
library(data.table)
setDT(a)[, .(count_all = uniqueN(A), count_BisY = uniqueN(A[B == 'Y']))]
which gives the same result.
Filtering the dataframe before performing the summarise works
a %>%
filter(B=="Y") %>%
summarise(count = n_distinct(A))
We can also use aggregate from base R
aggregate(cbind(count=A)~B, a, FUN=function(x) length(unique(x)))
# B count
#1 N 1
#2 Y 2
Based on the OP's expected output
data.frame(count=length(unique(a$A)),
count_BisY = length(unique(a$A[a$B=="Y"])))

Resources