How to Classify data frame Based on a Columns in R? [duplicate]

How to Classify data frame Based on a Columns in R? [duplicate] - r

This question already has answers here:
Assign unique ID based on two columns [duplicate]
(2 answers)
Closed 3 years ago.
I have a data frame and has columns like this:
gene col1 col2 type
------------------------------
gene_1 a b 1
gene_2 aa bb 2
gene_3 a b 1
gene_4 aa bb 2
I want to find the column "type" using column "col2" and "col1". so I need a classification based on "col2" and "col1". how should I do this in R?
thanks a lot

Based. on the output, an option is to create group indices from columns 'col1', and 'col2'
library(dplyr)
df1 %>%
mutate(type = group_indices(., col1, col2))
#. gene col1 col2 type
#1 gene_1 a b 1
#2 gene_2 aa bb 2
#3 gene_3 a b 1
#4 gene_4 aa bb 2
If there are multiple names, then one option is to convert the string column names to symbols and then evaluate (!!!)
df1 %>%
mutate(type = group_indices(., !!! rlang::syms(names(.)[2:3])))
Or in data.table
library(data.table)
setDT(df1)[, type := .GRP, .(col1, col2)]
data
df1 <- structure(list(gene = c("gene_1", "gene_2", "gene_3", "gene_4"
), col1 = c("a", "aa", "a", "aa"), col2 = c("b", "bb", "b", "bb"
), type = c(1L, 2L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-4L))

Related

Rename columns of a dataframe based on another dataframe except columns not in that dataframe in R

Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.

By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.

A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5

How fill a dataframe from another one in R?

I want to fill df2 with information from df1.
df1 as below
ID Mutation
1 A
2 B
2 C
3 A
df2 as below
ID A B C
1
2
3
For example, if mutation A is found in ID 1, then I want it in df2 it marked as "Y".
So the df2 result should be
ID A B C
1 Y
2 Y Y
3 Y
I have hundreds of IDs and more than 20 mutations. How can I efficiently achieve this in R? Thanks!

Using data.table you can try
setDT(df)
df2 <- dcast(df,formula = ID~Mutation )
df2[, c("A", "B", "C") := lapply(.SD, function(x) ifelse(is.na(x), " ", "Y")), ID]
df2
#Output
ID A B C
1: 1 Y
2: 2 Y Y
3: 3 Y

Create a new column with value 'Y' and cast the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(value = 'Y') %>%
pivot_wider(names_from = Mutation, values_from = value, values_fill = '')
# ID A B C
# <int> <chr> <chr> <chr>
#1 1 "Y" "" ""
#2 2 "" "Y" "Y"
#3 3 "Y" "" ""
data
df <- structure(list(ID = c(1L, 2L, 2L, 3L), Mutation = c("A", "B",
"C", "A")), class = "data.frame", row.names = c(NA, -4L))

Count number of occurrences of two column cases

I have a dataframe:
ID col1 col2
1 LOY A
2 LOY B
3 LOY B
4 LOY B
5 LOY A
I want to count number of occurrences of unique values according to col1 and col2. So, desired result is:
event count
loy-a 2
loy-b 3
How could i do that?

You can also try:
library(dplyr)
#Code
new <- df %>% group_by(event=tolower(paste0(col1,'-',col2))) %>%
summarise(count=n())
Output:
# A tibble: 2 x 2
event count
<chr> <int>
1 loy-a 2
2 loy-b 3
Some data used:
#Data
df <- structure(list(ID = 1:5, col1 = c("LOY", "LOY", "LOY", "LOY",
"LOY"), col2 = c("A", "B", "B", "B", "A")), class = "data.frame", row.names = c(NA,
-5L))

Here is an option where we convert the columns to lower case, then get the count and unite the 'col1', 'col2' to a single 'event' column
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(c(col1, col2), tolower)) %>%
count(col1, col2) %>%
unite(event, col1, col2, sep='-')
-output
# event n
#1 loy-a 2
#2 loy-b 3
NOTE: Returns the OP's expected output
Or using base R
with(df1, table(tolower(paste(col1, col2, sep='-'))))
data
df1 <- structure(list(ID = 1:5, col1 = c("LOY", "LOY", "LOY", "LOY",
"LOY"), col2 = c("A", "B", "B", "B", "A")),
class = "data.frame", row.names = c(NA,
-5L))

How to aggregate undirected combinations in R [duplicate]

This question already has answers here:
Create unique identifier from the interchangeable combination of two variables
(2 answers)
Closed 6 years ago.
I have a dataframe of 3 columns
A B 1
A B 1
A C 1
B A 1
I want to aggregate it such that it considers combinations A-B and B-A to be the same, resulting in
A B 3
A C 1
How do I go about this?

Use pmin and pmax on the first two columns and then do the group-by-count:
library(dplyr);
df %>% group_by(G1 = pmin(V1, V2), G2 = pmax(V1, V2)) %>% summarise(Count = sum(V3))
Source: local data frame [2 x 3]
Groups: G1 [?]
G1 G2 Count
(chr) (chr) (int)
1 A B 3
2 A C 1
Corresponding data.table solution would be:
library(data.table)
setDT(df)
df[, .(Count = sum(V3)), .(G1 = pmin(V1, V2), G2 = pmax(V1, V2))]
G1 G2 Count
1: A B 3
2: A C 1
Data:
structure(list(V1 = c("A", "A", "A", "B"), V2 = c("B", "B", "C",
"A"), V3 = c(1L, 1L, 1L, 1L)), .Names = c("V1", "V2", "V3"), row.names = c(NA,
-4L), class = "data.frame")

merge the rows in R with the same row name concatenating the content in the column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 7 years ago.
I need help merging the rows with the same name by concatenating the content in one of the columns. For example, in my dataframe,df, the rows with the same name match completely across the columns except in col 3. I want to merge the rows with the same rowname and concatenate the contents in col3 separated by a comma and get the result as shown below. Thank you for your help.
df
rowname col1 col2 col3
pat 122 A T
bus 222 G C
pat 122 A G
result
rowname col1 col2 col3
pat 122 A T,G
bus 222 G C

Try
aggregate(col3~., df, FUN=toString)
# rowname col1 col2 col3
#1 pat 122 A T, G
#2 bus 222 G C
Or using dplyr
library(dplyr)
df %>%
group_by_(.dots=names(df)[1:3]) %>%
summarise(col3=toString(col3))
# rowname col1 col2 col3
#1 bus 222 G C
#2 pat 122 A T, G
data
df <- structure(list(rowname = c("pat", "bus", "pat"), col1 = c(122,
222, 122), col2 = c("A", "G", "A"), col3 = c("T", "C", "G")),
.Names = c("rowname",
"col1", "col2", "col3"), row.names = c(NA, -3L), class = "data.frame")