select some values between 2 dataframe - r

I have two dataframes, one of them has a column of 100 genes and the other dataframe has a column that consist 700 rows and each row has several genes that are separated by comma, now I do not know how is it possible to select genes in each row of dataframe 2 according to the gene column in dataframe 1. In another word I want genes in each row of dataframe 2 that are in gene column of dataframe 1.
dataframe1:
column gene:
a
b
c
d
e
f
dataframe2:
column gene:
row1"a,b,c,d,r,t,y"
row2"c,g,h,k,l,a,b,c,p"
I only want comma separated genes in each row of dataframe2 that are in column gene of dataframe1 and other genes in dataframe 2 and are not in dataframe1 be removed.

Using tidyverse:
library(tidyverse)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
dataframe1 <- tibble(gene = c("a", "b", "c", "d", "e", "f"))
dataframe2 <- tibble(gene = c("a,b,c,d,r,t,y","c,g,h,k,l,a,b,c,p"))
result <- str_extract_all(dataframe2$gene, rebus::or1(dataframe1$gene)) %>%
map(~ reduce(.x, str_c, sep = ','))
mutate(dataframe2, gene = result) %>% unnest(c(gene))
#> # A tibble: 2 x 1
#> gene
#> <chr>
#> 1 a,b,c,d
#> 2 c,a,b,c
Created on 2021-06-28 by the reprex package (v2.0.0)

Loop over each row of dataframe2$gene with sapply and keep only those values which are %in% dataframe$gene1, after strspliting to get each comma-separated value.
dataframe1 <- data.frame(gene = c("a", "b", "c", "d", "e", "f"),
stringsAsFactors=FALSE)
dataframe2 <- data.frame(gene = c("a,b,c,d,r,t,y", "c,g,h,k,l,a,b,c,p"),
stringsAsFactors=FALSE)
dataframe2$gene_sub <- sapply(
strsplit(dataframe2$gene, ","),
function(x) paste(x[x %in% dataframe1$gene], collapse=",")
)
dataframe2
## gene gene_sub
##1 a,b,c,d,r,t,y a,b,c,d
##2 c,g,h,k,l,a,b,c,p c,a,b,c

A tidyverse option using data from #thelatemail.
library(dplyr)
library(tidyr)
dataframe2 %>%
mutate(row = row_number()) %>%
separate_rows(gene, sep = ',') %>%
left_join(dataframe1 %>%
mutate(gene_sub = gene), by = 'gene') %>%
group_by(row) %>%
summarise(across(c(gene, gene_sub), ~toString(na.omit(.)))) %>%
select(-row)
# gene gene_sub
# <chr> <chr>
#1 a, b, c, d, r, t, y a, b, c, d
#2 c, g, h, k, l, a, b, c, p c, a, b, c

Related

Convert every n # of rows to columns and stack them in R?

I have a tab-delimited text file with a series of timestamped data. I've read it into R using read.delim() and it gives me all the data as characters in a single column. Example:
df <- data.frame(c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
colnames(df) <- "col1"
df
I want to convert every n # of rows (in this case 4) to columns and stack them without using a for loop. Desired result:
col1 <- c("2017","2018","2018")
col2 <- c("A","X","X")
col3 <- c("B","Y","B")
col4 <- c("C","Z","C")
df2 <- data.frame(col1, col2, col3, col4)
df2
I created a for loop, but it can't handle the millions of rows in my df. Should I convert to a matrix? Would converting to a list help? I tried as.matrix(read.table()) and unlist() but without success.
You could use tidyr to reshape data into the form you want, you will first need to mutate the data as to identify which indexes should be first, and which go with a specific column.
Assuming you know there are 4 groups (n = 4) you could do something like the following with the help of the dplyr package.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n <- 4
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C")) %>%
mutate(cols = rep(1:n, n()/n),
id = rep(1:(n()/n), each = n))
pivot_wider(df, id_cols = id, names_from = cols, values_from = x, names_prefix = "cols")
#> # A tibble: 3 × 5
#> id cols1 cols2 cols3 cols4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 2017 A B C
#> 2 2 2018 X Y Z
#> 3 3 2018 X B C
Or, in base you could use the split function on the vector, and then use do.call to make the data frame
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
split_df <- setNames(split(df$x, rep(1:4, 3)), paste0("cols", 1:4))
do.call("data.frame", split_df)
#> cols1 cols2 cols3 cols4
#> 1 2017 A B C
#> 2 2018 X Y Z
#> 3 2018 X B C
Created on 2022-02-01 by the reprex package (v2.0.1)
The easiest way would be to create a matrix with matrix(ncol=x, byrow=TRUE), then convert back to data.frame. Should be quite fast too.
df |>
unlist() |>
matrix(ncol=4, byrow = TRUE) |>
as.data.frame() |>
setNames(paste0('col', 1:4))
col1 col2 col3 col4
1 2017 A B C
2 2018 X Y Z
3 2018 X B C

R merge rows with same row names and column name [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
I have the data like
How can I reshape the data by merge the rows with same rowname and columname like this:
Trust you have allele information missing.
If added as following to the data:
data['allele']=c('a1','a2','a1','a2')
then following will solve the problem easily:
Basically wide to long, followed by joining columns of SNP and allele and then wide again.
library(tidyr)
long=data %>% gather(snp, value, -c(Pedigree,allele))
long_joined=unite(long, snp, c(snp, allele), remove=TRUE)
spread(long_joined, key = snp, value = value)
Maybe you can try aggregate with unlist:
> aggregate(.~P,df,unlist)
P S1.1 S1.2 S2.1 S2.2
1 a C C G G
2 b C C T T
Data
> dput(df)
structure(list(P = c("a", "a", "b", "b"), S1 = c("C", "C", "C",
"C"), S2 = c("G", "G", "T", "T")), class = "data.frame", row.names = c(NA,
-4L))
Solution using dplyr which is part of the tidyverse collection of R packages.
library(dplyr)
Data:
bar <- "Pedigree SNP1 SNP2
'Individual 1' C G
'Individual 1' C G
'Individual 2' C T
'Individual 2' C T"
foo <- read.table(text=bar, header = TRUE)
Code:
foo %>%
group_by(Pedigree) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = SNP1:SNP2, names_prefix = ".a")
Output:
#> # A tibble: 2 x 5
#> # Groups: Pedigree [2]
#> Pedigree SNP1_.a1 SNP1_.a2 SNP2_.a1 SNP2_.a2
#> <fct> <fct> <fct> <fct> <fct>
#> 1 Individual 1 C C G G
#> 2 Individual 2 C C T T
```
Created on 2020-07-26 by the reprex package (v0.3.0)

Organize subgroup strings (text)

I am trying to convert something like this df format:
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df
first words
1 a about
2 a among
3 b blue
4 b but
5 b both
6 c cat
into the following format:
df1
first words
1 a about, among
2 b blue, but, both
3 c cat
>
I have tried
aggregate(words ~ first, data = df, FUN = list)
first words
1 a 1, 2
2 b 3, 5, 4
3 c 6
and tidyverse:
df %>%
group_by(first) %>%
group_rows()
Any suggestions would be appreciated!
A data.table solution:
library(data.table)
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df <- setDT(df)[, lapply(.SD, toString), by = first]
df
# first words
# 1: a about, among
# 2: b blue, but, both
# 3: c cat
# convert back to a data.frame if you want
setDF(df)
Using tidyverse, after the group_by use summarise to either paste
library(dplyr)
df %>%
group_by(first) %>%
summarise(words = toString(words))
# A tibble: 3 x 2
# first words
# <fct> <chr>
#1 a about, among
#2 b blue, but, both
#3 c cat
or keep it as a list column
df %>%
group_by(first) %>%
summarise(words = list(words))

How to add new column to R dataframe based on values in multiple columns

I have created the following dataframe
df<-data.frame("A"<-(1:5), "B"<-c("A","B", "C", "B",'C' ), "C"<-c("A", "A",
"B", 'B', "B"))
names(df)<-c("A", "B", "C")
I am triyng to obtain the duplicated values between columns A and C following output and add the corresponding values in column B . The expected dataframe should be
df2<- "B" "Dupvalues"
1 A
4 B
I am unable to do this. I request some help here
df<-data.frame(A = (1:5),
B = c("A","B", "C", "B",'C' ),
C = c("A", "A","B", 'B', "B"), stringsAsFactors = F)
library(dplyr)
df %>%
filter(B == C) %>% # keep rows when B equals C
group_by(A) %>% # for each A
transmute(DupValues = B) %>% # keep the duplicate value
ungroup() # forget the grouping
# # A tibble: 2 x 2
# A DupValues
# <int> <chr>
# 1 1 A
# 2 4 B
Note that this works if your variables are not factors, but character varaibles.

left_join two data frames and overwrite

I'd like to merge two data frames where df2 overwrites any values that are NA or present in df1. Merge data frames and overwrite values provides a data.table option, but I'd like to know if there is a way to do this with dplyr. I've tried all of the _join options but none seem to do this. Is there a way to do this with dplyr?
Here is an example:
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
Desired output:
y x1
1 A 5
2 B 6
3 C 7
4 D 4
I think what you want is to keep the values of df2 and only add the ones in df1 that are not present in df2 which is what anti_join does:
"anti_join return all rows from x where there are not matching values in y, keeping just columns from x."
My solution:
df3 <- anti_join(df1, df2, by = "y") %>% bind_rows(df2)
Warning messages:
1: In anti_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
2: In rbind_all(x, .id) : Unequal factor levels: coercing to character
> df3
Source: local data frame [4 x 2]
y x1
(chr) (dbl)
1 D 4
2 A 5
3 B 6
4 C 7
this line gives the desired output (in a different order) but, you should pay attention to the warning message, when working with your dataset be sure to read y as a character variable.
This is the idiom I now use, as, in addition, it handles keeping columns that are not part of the update table. I use some different names than from the OP, but the flavor is similar.
The one thing I do is create a variable for the keys used in the join, as I use that in a few spots. But otherwise, it does what is desired.
In itself it doesn't handle the action of, for example, "update this row if a value is NA", but you should exercise that condition when creating the join table.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
.keys <- c("key1", "key2")
.base_table <- tribble(
~key1, ~key2, ~val1, ~val2,
"A", "a", 0, 0,
"A", "b", 0, 1,
"B", "a", 1, 0,
"B", "b", 1, 1)
.join_table <- tribble(
~key1, ~key2, ~val2,
"A", "b", 100,
"B", "a", 111)
# This works
df_result <- .base_table %>%
# Pull off rows from base table that match the join table
semi_join(.join_table, .keys) %>%
# Drop cols from base table that are in join table, except for the key columns
select(-matches(setdiff(names(.join_table), .keys))) %>%
# Left join on the join table columns
left_join(.join_table, .keys) %>%
# Remove the matching rows from the base table, and bind on the newly joined result from above.
bind_rows(.base_table %>% anti_join(.join_table, .keys))
df_result %>%
print()
#> # A tibble: 4 x 4
#> key1 key2 val1 val2
#> <chr> <chr> <dbl> <dbl>
#> 1 A b 0 100
#> 2 B a 1 111
#> 3 A a 0 0
#> 4 B b 1 1
Created on 2019-12-12 by the reprex package (v0.3.0)

Resources