find duplicates with grouped variables - r

I have a df that looks like this:
I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.
from to group
1 2 metro
2 4 metro
3 4 metro
4 5 train
6 1 train
8 7 train
I want to find the ids which exist in more than one group variable.
The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.
Thank you in advance!

Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect
Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4

We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4
Or using base R we can just table to find the frequency, and get the ids out of it
out <- with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"
data
df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L,
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))

Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.
library(data.table)
library(magrittr)
setDT(df1)
melt(df1, 'group') %>%
.[, .(n = uniqueN(group)), value] %>%
.[n > 1, unique(value)]
# [1] 1 4

Related

Looking for an efficient way of making a new data frame of totals across categories in R

Total R beginner here, looking for the quickest / most sensible way to do this:
I have a data frame that looks similar to this (but much longer):
dataframe:
date
a
b
c
1/1/2021
4
3
2
1/2/2021
2
2
1
1/3/2021
5
3
5
I am attempting to create a new data frame showing totals for a, b, and c (which go on for a while), and don't need the dates. I want to make a data frame that would look this:
letter
total
a
11
b
8
c
8
So far, the closest I have got to this is by writing a pipe like this:
dataframe <- totals %>%
summarize(total_a = sum(a), total_b = sum(b), total_c = sum(c))
which almost gives me what I want, a data frame that looks like this:
|a|b|c|
|:-:|:-:|:-:|
|11|8|8|
Is there a way (besides manually typing out a new data frame for totals) to quickly turn my totals table into the format I'm looking for? Or is there a better way to write the pipe that will give me the table I want? I want to use these totals to make a pie chart but am running into problems when I attempt to make a pie chart out of the table like I have it now. I really appreciate any help in advance and hope I was able to explain what I'm trying to do correctly.
One efficient way is to use colSums from base R, where we get the sums of each column, excluding the date column (hence the reason for the -1 in df[,1]. Then, I use stack to put into long format. The [,2:1] is just changing the order of the column output, so that letter is first and total is second. I wrap this in setNames to rename the column names.
setNames(nm=c("letter", "total"),stack(colSums(df[,-1]))[,2:1])
letter total
1 a 11
2 b 8
3 c 8
Or with tidyverse, we can get the sum of every column, except for date. Then, we can put it into long format using pivot_longer.
df %>%
summarise(across(-date, sum)) %>%
pivot_longer(everything(), names_to = "letter", values_to = "total")
Or another option using data.table:
library(data.table)
dt <- as.data.table(df)
melt(dt[,-1][, lapply(.SD, sum)], id.vars=integer(), variable.name = "letter", value.name = "total")
Data
df <- structure(list(date = c("1/1/2021", "1/2/2021", "1/3/2021"),
a = c(4L, 2L, 5L), b = c(3L, 2L, 3L), c = c(2L, 1L, 5L)),
class = "data.frame", row.names = c(NA, -3L))
Try this :
totals %>% select(a:c) %>% colSums() %>% as.list() %>% as_tibble() %>%
pivot_longer(everything(), names_to = "letter", values_to = "total")
Actually totals %>% select(a:c) %>% colSums() gives what you need as a named vector and the next steps are to turn that into a tibble again. You can skip that part if you don't need it.

Filtering a large data frame based on column values using R

I have a very large dataframe with almost 502493 rows and 261 columns. I want to filter it and need IDs with specific codes (codes starting with 'E'). This is how my data looks like,
IDs
code1
code2
1
C443
E109
2
AX31
M223
1
E341
QWE1
3
E131
M223
My required output is IDs with codes starting with 'E' only.
IDs
code
1
E109
1
E341
3
E131
I am trying to use the 'filter' of dplyr package but not getting the required output.
Thanks in advance
We can reshape to 'long' format with pivot_longer and filter by creating a logical vector from the first character extracted (with substr)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
-output
# A tibble: 3 × 2
IDs code
<int> <chr>
1 1 E109
2 1 E341
3 3 E131
If the data is really big, we may do a filter before the pivot_longer to keep only rows having at least one 'E' in the column
df1 %>%
filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
If it is a very big data, another option is data.table. Convert the data.frame to 'data.table' (setDT), loop across the columns of interest (.SDcols) with lapply, replace the elements that are not starting with "E" to NA, then use fcoalesce to get the first non-NA element for each row using do.call
library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce,
lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E",
NA)))), .SDcols = patterns("code")])
-output
IDs code
1: 1 E109
2: 1 E341
3: 3 E131
data
df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31",
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")),
class = "data.frame", row.names = c(NA,
-4L))

How to find the clusters that produce the maximum colMeans in R?

I have a data frame like
V1 V2 V3
1 1 1 2
2 0 1 0
3 3 0 3
....
and I have a vector of the same length as the number of rows in the data frame (it's the cluster from kmeans, if that matters)
[1] 2 2 1...
From those I can get the colMeans for each cluster, like
cm1 <- colMeans(df[fit$cluster==1,])
cm2 <- colMeans(df[fit$cluster==2,])
(I don't think I should do that part explicitly, but that's how I'm thinking about the problem.)
What I want is to get, for each column of the data frame, the value from the vector for which the colMeans is the maximum. Also I'd like to do (separately is fine) the second-highest, third, etc. So in the example I would want the output to be a vector with one element for each column of the data frame:
1 2 1...
because for the first column of the data frame, the column mean for the first cluster is 3, while the column mean for the second cluster is 0.5.
If the cluster vector is of the same length as the number of rows of 'df', split the data by the 'cluster' column into a list,
lst1 <- lapply(split(df, fit$cluster), function(x) stack(colMeans(x)))
dat <- do.call(rbind, Map(cbind, cluster = names(lst1), lst1))
aggregate(values ~ ind, dat, FUN = which.max)
If we need to subset multiple element based on column means, create the 'cluster' column in the data, reshape to 'long' format (or use summarise/across), grouped by 'cluster', 'name', get the mean of 'value', arrange the column 'name' and the 'value' in descending order, then return the n rows with slice_head
library(dplyr)
library(tidyr)
df %>%
mutate(cluster = fit$cluster) %>%
pivot_longer(cols = -cluster) %>%
group_by(cluster, name) %>%
summarise(value = mean(value), .groups = 'drop') %>%
arrange(name, desc(value)) %>%
group_by(name) %>%
slice_head(n = 2)
data
df <- structure(list(V1 = c(1L, 0L, 3L), V2 = c(1L, 1L, 0L), V3 = c(2L,
0L, 3L)), class = "data.frame", row.names = c("1", "2", "3"))
fit <- structure(list(cluster = c(2, 2, 1)), class = "data.frame",
row.names = c(NA,
-3L))

Keep colums while grouping

I am a beginner in R and was looking for help online, but the examples I found among similar titles don't quite fit my needs, because they only deal with few colums.
I have a data.frame T1 with over 100 columns and what I am looking for is something like a summary, but I want to retain every other column after the summary. I thought about using aggregate but since it's not a function, I am uncertain. The most promising way I think of you can see below.
T2 <- T1 %>% group_by(ID) %>% summarise(AGI = paste(AGI, collapse = "; "))
The summary works the way I want, but I lose any other column.
I definitly appreciate any kind of advice! Thank you very much
Expanding on TTS's comment, if you want to keep any other column you have to use mutate instead of summarise because, as the documentation says, summarise() creates a new data frame.
You should therefore use
T1 %>% group_by(ID) %>% mutate(AGI = paste(AGI, collapse = "; ")) %>% ungroup()
Data
T1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 4L), UniProt_Accession = c("P25702",
"F4HWZ6", "Q9C5M0", "Q9SR37", "Q9LKR7", "Q9FXI7"), AGI = c("AT1G54630",
"AT1G54630", "AT5G19760", "AT3G09260", "AT5G28510", "AT1G19890"
)), class = "data.frame", row.names = c(NA, -6L))
Output
# A tibble: 6 x 3
# ID UniProt_Accession AGI
# <int> <chr> <chr>
# 1 1 P25702 AT1G54630; AT1G54630
# 2 1 F4HWZ6 AT1G54630; AT1G54630
# 3 2 Q9C5M0 AT5G19760
# 4 3 Q9SR37 AT3G09260; AT5G28510
# 5 3 Q9LKR7 AT3G09260; AT5G28510
# 6 4 Q9FXI7 AT1G19890

R: dplyr arrange by row number

I am trying to order a dataset according to values in columns in ascending order.
I have a dataset with 1 row and 3000+ columns. I guess I can just change it to a list and use .[[n]] but I was thinking if there was another way.
data looks something like this only with more columns and values.
structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")
I expect something like this:
b c a
1 -4.1135727372109401e-05 -0.00018142429393043499 -0.00106163456888295
I understand you can arrange by column number by doing the following:
.[[column number]]
for example:
mtcars %>% arrange(.[[2]])
what is the row number equivalent?
If I understand you correctly, you want to order the columns based on the values in the single row.
z <- structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")
Base R:
z[,order(z[1,])]
# a c b
# 1 -0.00106163457 -0.000181424294 -0.0000411357274
Tidyverse:
library(dplyr)
z %>%
select_at(order(.))
Note: I think your expected output might not be correct, as the values are not ordered. Your intended output:
c(-0.000181424293930435, -0.00106163456888295, -4.11357273721094e-05)
# [1] -0.0001814242939 -0.0010616345689 -0.0000411357274
diff(c(-0.000181424293930435, -0.00106163456888295, -4.11357273721094e-05))
# [1] -0.000880210275 0.001020498842
shows the first value is greater than the second, but the second is less than the third. If they were ordered, I would expect the diff to be always-nonnegative; if reverse-ordered, diff should be always-nonpositive.
We can unlist the first row, order and use that in select
library(dplyr)
df1 %>%
select(order(-unlist(.[1,])))
# b c a
#1 -4.113573e-05 -0.0001814243 -0.001061635
It can be also used a general solution i.e if we want to do this based on a particular row
n <- 3
mtcars %>%
select(order(-unlist(.[n,])))
Or reshape to 'long' and then use arrange, get the column names and then select
library(tidyr)
df1 %>%
pivot_longer(everything()) %>%
arrange(desc(value)) %>%
pull(name) %>%
select(df1, .)
# b c a
#1 -4.113573e-05 -0.0001814243 -0.001061635
Or enframe, then do a arrange, pull the 'name' column and use that in select
library(tibble)
as.list(df1) %>%
enframe %>%
unnest(c(value)) %>%
arrange(desc(value)) %>%
pull(name) %>%
select(df1, .)
Or if we want to select the column 'c'
df1 %>%
select(c, everything())
# c a b
#1 -0.0001814243 -0.001061635 -4.113573e-05
In base R, we can do
df1[order(-unlist(df1[1,]))]
data
df1 <- structure(list(a = -0.00106163456888295, b = -4.11357273721094e-05,
c = -0.000181424293930435), row.names = 1L, class = "data.frame")

Resources