Is there a way to group_by values in R? - r

I have a dataframe that looks something like this
Column A
Column B
Accepted
Did not accept
Did not accept
Did not accept
and so on..
Just wanted to know if there is a way to manipulate the data into a table visualisation which looks something like this?: To count the number of Accepted and Did not accept for each column.
Accepted or not?
Column A
Column B
Accepted
10
2
Did not accept
5
5

In base R, use sapply with table -
sapply(df,function(x) table(factor(x, levels = c('Accepted', 'Did not accept'))))
# Column.A Column.B
#Accepted 1 0
#Did not accept 1 2
In tidyverse -
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
pivot_wider(names_from = name, values_from = name,
values_fn = length, values_fill = 0)
# value Column.A Column.B
# <chr> <int> <int>
#1 Accepted 1 0
#2 Did not accept 1 2
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Column.A = c("Accepted", "Did not accept"), Column.B = c("Did not accept",
"Did not accept")), row.names = c(NA, -2L), class = "data.frame")

Convert wide-to-long, then tabulate:
table(stack(df))
# ind
# values Column.A Column.B
# Accepted 1 0
# Did not accept 1 2

Related

Assign consecutive trial numbers to data frame beginning at match

I want to assign consecutive trial numbers (1-16) to a long dataframe depending on certain values of other variables.
It should look like this (simplified):
value trial_no
videoA 1
other 1
videoB 2
other 2
other 2
videoC 3
...
This basically does what I want, but it just assigns the row numbers.
df2 <- df1 %>%
mutate(trial_no = case_when(grepl('video', value) ~ row_number())) %>%
fill(trial_no)
This might do what I want, but yet it assigns 16 to all.
for (vid in c(1:16)) {
df2 <- df1 %>%
mutate(trial_no = case_when(grepl("video", value) ~ vid)) %>%
fill(trial_no)
}
I'm pretty sure there is an easy solution to this.
Any help is very much appreciated.
Using grepl and count the TRUEs
transform(dat, trial_no=cumsum(grepl('video', value)))
# value trial_no
# 1 videoA 1
# 2 other 1
# 3 videoB 2
# 4 other 2
# 5 other 2
# 6 videoC 3
Data:
dat <- structure(list(value = c("videoA", "other", "videoB", "other",
"other", "videoC")), class = "data.frame", row.names = c(NA,
-6L))

extract component from a list of file names in a column to create a new column in R

I have a data frame with numerous columns and rows. One column in particular "Filename" has information that I would like to separate and make into a new column "ID"
A Filename
1 Sample.2020-03-16_2345_WES01_FF001_089-267/2355245_H445.FASTQs/AA56789_1.clipped.fastq.gz
2 Sample.2020-03-15_2355_WES01_FF001_089-267/2345245_H345.FASTQs/AA52399_1.clipped.fastq.gz
My new df2 I would like to create is
A ID
1 AA56789
2 AA52399
I do not have a code written to this as I am just begining to understand sub and gsub. And help would be appreciate, thank you!
We could use basename with trimws
df2 <- cbind(df1['A'], ID = trimws(basename(df1$Filename), whitespace = "_.*"))
-output
df2
A ID
1 1 AA56789
2 2 AA52399
data
df1 <- structure(list(A = 1:2, Filename = c("Sample.2020-03-16_2345_WES01_FF001_089-267/2355245_H445.FASTQs/AA56789_1.clipped.fastq.gz",
"Sample.2020-03-15_2355_WES01_FF001_089-267/2345245_H345.FASTQs/AA52399_1.clipped.fastq.gz"
)), class = "data.frame", row.names = c(NA, -2L))
You can use regex to extract the interested value. In base R using sub you can do -
cbind(df[1], ID = sub('.*/(\\w+)_.*', '\\1', df$Filename))
# A ID
#1 1 AA56789
#2 2 AA52399
A tidyverse-solution:
library(dplyr)
library(stringr)
df %>%
mutate(ID = str_replace(Filename, '.*/(\\w+)_.*', '\\1'), .keep = "unused")
which is basically the same as Ronak Shah's or
df %>%
mutate(ID = str_extract(Filename, "(?<=/)(\\w+)(?=_\\d+\\.clipped)"), .keep = "unused")
Both return
# A tibble: 2 x 2
A ID
<dbl> <chr>
1 1 AA56789
2 2 AA52399

convert values in a column consists of text map into integers and aggregate them

I have a dataframe as following:
data.frame("id" = 1:2, "tag" = c("a,b,c","a,d"))
id tag
1 a,b,c
2 a,d
in tag where ever is a or b consider as lan and and "d"="c"="con" means that a and b are consider as lan , d and c consider as con then we want to count the number of lan and con in each row in seperate columns like table in below:
I want to create two columns which are the aggregation of a,b,c to shows like the follows:
id tag. lan_count. con_count
1 a,b,c 2 1
2 a,d 1 1
Could you please give me advice how to do this.
You can also use the following code:
library(dplyr)
library(tidyr)
df <- data.frame("id" = 1:2, "tag" = c("a,b,c","a,d"))
df %>%
separate_rows(tag, sep = ",") %>%
group_by(id) %>%
add_count(tag) %>%
pivot_wider(id, names_from = tag, values_from = n) %>%
rowwise() %>%
mutate(lan_count = sum(c_across(a:b), na.rm = TRUE),
con_count = sum(c_across(c:d), na.rm = TRUE)) %>%
select(-c(a:d))
# A tibble: 2 x 3
# Rowwise: id
id lan_count con_count
<int> <int> <int>
1 1 2 1
2 2 1 1
The main issue here is that your data is untidy. So my solution is in two parts: first, tidy the data and then summarise it. Once the data is tidy, the summary is trivial.
library(tidyverse)
# Adjust to suit your real data
maxCols <- 10
d <- data.frame(id = 1:2, tag = c("a,b,c","a,d"))
d %>%
separate(
tag,
sep=",",
into=paste0("Element", 1:maxCols),
extra="drop",
fill="right",
remove=FALSE
) %>%
pivot_longer(
cols=starts_with("Element"),
values_to="Value",
names_prefix="Element"
) %>%
select(-name) %>%
# Remove unused Values
filter(!is.na(Value)) %>%
# At this point the data frame is tidy
group_by(tag) %>%
# Translate tags into "categories". Add more if required. or write a function
mutate(
lan=Value %in% c("a", "b"),
con=Value %in% c("c", "d")
) %>%
# Adjust the column specification if more categories are added.
# Or use a factor instead of binary indicators
summarise(across(lan:con, sum))
# A tibble: 2 x 3
tag lan con
* <fct> <int> <int>
1 a,b,c 2 1
2 a,d 1 1

Return column names based on condition

I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)

Separating Column Based on First Value of String

I have an ID variable that I am trying to separate into two separate columns based on their prefix being either a 1 or 2.
An example of my data is:
STR_ID
1434233
2343535
1243435
1434355
I have tried countless ways to try to separate these variables into columns based on their prefixes, but cannot seem to figure it out. Any ideas on how I would do this? Thank you in advance.
We create a grouping variable with substr by extracting the first character/digit of 'STR_ID', and spread it to 'wide' format
library(tidyverse)
df1 %>%
group_by(grp = paste0('grp', substr(STR_ID, 1, 1))) %>%
mutate(i = row_number()) %>%
spread(grp, STR_ID) %>%
select(-i)
# A tibble: 3 x 2
# grp1 grp2
# <int> <int>
#1 1434233 2343535
#2 1243435 NA
#3 1434355 NA
data
df1 <- structure(list(STR_ID = c(1434233L, 2343535L, 1243435L, 1434355L
)), class = "data.frame", row.names = c(NA, -4L))

Resources