R: unique observations conditionally on other variable - from rows into additional columns - r

I am new to R. I struggle to find a suitable solution for the following problem:
My dataframe looks approximately like this:
ID Att
1 a
1 b
1 c
2 d
3 e
3 f
4 g
I would like to convert it into a new df of the following form:
ID Att_1 Att_2 ... Att_n
1 a b c
2 d N/A N/A
3 e f N/A
4 g N/A N/A
Where the number of columns is dependent on max counts of unique 'Att' in 'ID' (here three). The generation of the number of columns in the new dataframe (i.e. 'n') should be automated and dependent on the count of :
max_ID_count <- table(df$ID)
n <- max(max_ID_count)
Thanks a lot!

We can create a sequence column and then spread
library(tidyverse)
df1 %>%
group_by(ID) %>%
mutate(rn = paste0("Att_", row_number())) %>%
spread(rn, Att)
# A tibble: 4 x 4
# Groups: ID [4]
# ID Att_1 Att_2 Att_3
# <int> <chr> <chr> <chr>
#1 1 a b c
#2 2 d <NA> <NA>
#3 3 e f <NA>
#4 4 g <NA> <NA>
Or with dcast from data.table
library(data.table)
dcast(setDT(df1), ID ~ paste0("Att_", rowid(ID)), value.var = "Att")

Related

keep last non missing observation for all variables by group

My data has multiple columns and some of those columns have missing values in different rows. I would like to group (collapse) the data by the variable "g", keeping the last non missing obserbation of each varianle.
Input:
d <- data.table(a=c(1,NA,3,4),b=c(1,2,3,4),c=c(NA,NA,'c',NA),g=c(1,1,2,2))
Desired output
d_g <- data.table(a=c(1,4),b=c(2,4),c=c(NA,'c'),g=c(1,2))
data.table (or dplyr) solution prefered here
OBS:this is related to this question, but the main answers there seem to cause unecessary NAs in some groups
Using data.table :
library(data.table)
d[, lapply(.SD, function(x) last(na.omit(x))), g]
# g a b c
#1: 1 1 2 <NA>
#2: 2 4 4 c
One option using dplyr could be:
d %>%
group_by(g) %>%
summarise(across(everything(), ~ if(all(is.na(.))) NA else last(na.omit(.))))
g a b c
<dbl> <dbl> <dbl> <chr>
1 1 1 2 <NA>
2 2 4 4 c
In base aggregatecould be used.
aggregate(.~g, d, function(x) tail(x[!is.na(x)], 1), na.action = NULL)
# g a b c
#1 1 1 2
#2 2 4 4 c

How to Pass column name in group by from a variable

Want to extract max values of a column of each group of data frame.
I have column name in a variable which i want to pass in group by condition but it is failing.
I have below data frame:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
Column values in Variables below:
columnselected <- c("Value")
groupbycol <- c("Gene")
My Code is :
df %>% group_by(groupbycol) %>% top_n(1, columnselected)
This code is giving error.
Gene Value
A 12
B 6
C 1
D 4
You need to convert column names to symbol using sym and then evaluate them using !!
library(dplyr)
df %>% group_by(!!sym(groupbycol)) %>% top_n(1, !!sym(columnselected))
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
We can use group_by_at and without using an additional package
library(dplyr)
df %>%
group_by_at(groupbycol) %>%
top_n(1, !! as.name(columnselected))
# A tibble: 4 x 2
# Groups: Gene [4]
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
NOTE: There would be many dupes for this post :=)

Drop list columns from dataframe using dplyr and select_if

Is it possible to drop all list columns from a dataframe using dpyr select similar to dropping a single column?
df <- tibble(
a = LETTERS[1:5],
b = 1:5,
c = list('bob', 'cratchit', 'rules!','and', 'tiny tim too"')
)
df %>%
select_if(-is.list)
Error in -is.list : invalid argument to unary operator
This seems to be a doable work around, but was wanting to know if it can be done with select_if.
df %>%
select(-which(map(df,class) == 'list'))
Use Negate
df %>%
select_if(Negate(is.list))
# A tibble: 5 x 2
a b
<chr> <int>
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
There is also purrr::negate that would give the same result.
We can use Filter from base R
Filter(Negate(is.list), df)
# A tibble: 5 x 2
# a b
# <chr> <int>
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5

R - Display unique values in a column rather than count them, within summarize (dplyr pipe)

I would like to reshape my data in a way that district values in one column, related to another column, are displayed in newly created columns
df
A B
1 <NA> <NA>
2 a b
3 a d
4 b c
similar to :
> df %>%
+ group_by(A) %>%
+ summarise(n_distinct(B))
# A tibble: 3 x 2
A `n_distinct(B)`
<chr> <int>
1 a 2
2 b 1
3 NA 1
But instead of counting the occurrences, just display the actual values in a new column?
something like the below:
df
A B
1 <NA> <NA>
2 a b **d**
4 b c
I tried spreading, but It is not working, the below error comes up:
Error: Duplicate identifiers for rows
both my columns are factors, but can be reclassified if need be .
Thank you!
library(dplyr)
library(tidyr)
df %>% group_by(A) %>% summarise(B=paste0(unique(B), collapse = ',')) %>%
separate(B,into = paste0('B',1:2))
# A tibble: 3 x 3
A B1 B2
<chr> <chr> <chr>
1 a b d
2 b c NA
3 NA NA NA
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [2, 3].
Here is an option using spread after creating a sequence column
library(tidyverse)
df %>%
group_by(A) %>%
mutate(n1 = paste0("B", row_number())) %>%
ungroup %>%
spread(n1, B)
# A tibble: 3 x 3
# A B1 B2
# <fct> <fct> <fct>
#1 a b d
#2 b c <NA>
#3 <NA> <NA> <NA>
data
df <- data.frame(A = c(NA, 'a', 'a', 'b'), B = c(NA, 'b', 'd', 'c'))

Grouping dataframe in 12 groups with same column values

I have a large dataset with about 15 columns and more than 3 million rows.
Because the dataset is so big, I would like to use multidplyron it .
Because of the data, it would be impossible to just split my data frame to 12 parts. Lets say that there are columns col1 and col2 which each have several different values but they repeat (in each column separately).
How can I make 12 (or n) similar sized groups which each of them contain rows that have the same value in both col1 and col2?
Example: Lets say one of the possible values in col1 foo and in col2 is bar. Then they would be grouped, all rows with this values would be in one group.
So that the question makes sense, there are always more than 12 unique combinations of col1 and col2.
I would try to do something with for and while loops if this was python but as this is R, there probably is another way.
Try this:
# As you provided no example data, I created some data repeating three times.
# I used dplyr within tidyverse. Then grouped by the columns and sliced
# the data by chance for n=2.
library(tidyverse)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3))
# the data:
df %>%
arrange(a,b) %>%
group_by(a,b) %>%
mutate(n=1:n())
# A tibble: 78 x 3
# Groups: a, b [26]
a b n
<fctr> <fctr> <int>
1 A a 1
2 A a 2
3 A a 3
4 B b 1
5 B b 2
6 B b 3
7 C c 1
8 C c 2
9 C c 3
10 D d 1
# ... with 68 more rows
Slicing down the data by chance on two rows per group.
set.seed(123)
df %>%
arrange(a,b) %>%
group_by(a,b) %>%
mutate(n=1:n()) %>%
sample_n(2)
# A tibble: 52 x 3
# Groups: a, b [26]
a b n
<fctr> <fctr> <int>
1 A a 1
2 A a 2
3 B b 2
4 B b 3
5 C c 3
6 C c 1
7 D d 2
8 D d 3
9 E e 2
10 E e 1
# ... with 42 more rows
# Create sample data
library(dplyr)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3),
nobs=sample(1:100, 26*3,replace=T), stringsAsFactors=F)
# Get all unique combinations of col1 and col2
combos <- df %>%
group_by(a,b) %>%
summarize(n=sum(nobs)) %>%
as.data.frame(.)
top12 <- combos %>%
arrange(desc(n)) %>%
top_n(12,n)
top12
l <- list()
for(i in 1:11){
l[[i]] <- combos[combos$a==top12[i,"a"] & combos$b==top12[i,"b"],]
}
l[[12]] <- combos %>%
anti_join(top12,by=c("a","b"))
l
# This produces a list 'l' that contains twelve data frames -- the top 11 most-commonly occuring pairs of col1 and col2, and all the rest of the data in the 12th list element.

Resources