This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
I have the data like
How can I reshape the data by merge the rows with same rowname and columname like this:
Trust you have allele information missing.
If added as following to the data:
data['allele']=c('a1','a2','a1','a2')
then following will solve the problem easily:
Basically wide to long, followed by joining columns of SNP and allele and then wide again.
library(tidyr)
long=data %>% gather(snp, value, -c(Pedigree,allele))
long_joined=unite(long, snp, c(snp, allele), remove=TRUE)
spread(long_joined, key = snp, value = value)
Maybe you can try aggregate with unlist:
> aggregate(.~P,df,unlist)
P S1.1 S1.2 S2.1 S2.2
1 a C C G G
2 b C C T T
Data
> dput(df)
structure(list(P = c("a", "a", "b", "b"), S1 = c("C", "C", "C",
"C"), S2 = c("G", "G", "T", "T")), class = "data.frame", row.names = c(NA,
-4L))
Solution using dplyr which is part of the tidyverse collection of R packages.
library(dplyr)
Data:
bar <- "Pedigree SNP1 SNP2
'Individual 1' C G
'Individual 1' C G
'Individual 2' C T
'Individual 2' C T"
foo <- read.table(text=bar, header = TRUE)
Code:
foo %>%
group_by(Pedigree) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = SNP1:SNP2, names_prefix = ".a")
Output:
#> # A tibble: 2 x 5
#> # Groups: Pedigree [2]
#> Pedigree SNP1_.a1 SNP1_.a2 SNP2_.a1 SNP2_.a2
#> <fct> <fct> <fct> <fct> <fct>
#> 1 Individual 1 C C G G
#> 2 Individual 2 C C T T
```
Created on 2020-07-26 by the reprex package (v0.3.0)
Related
Hi I have two dataframes, based on the id match, i wanted to replace table a's values with that of table b.
sample dataset is here :
a = tibble(id = c(1, 2,3),
type = c("a", "x", "y"))
b= tibble(id = c(1,3),
type =c("d", "n"))
Im expecting an output like the following :
c= tibble(id = c(1,2,3),
type = c("d", "x", "n"))
In dplyr v1.0.0, the rows_update() function was introduced for this purpose:
rows_update(a, b)
# Matching, by = "id"
# # A tibble: 3 x 2
# id type
# <dbl> <chr>
# 1 1 d
# 2 2 x
# 3 3 n
Here is an option using dplyr::left_join and dplyr::coalesce
library(dplyr)
a %>%
rename(old = type) %>%
left_join(b, by = "id") %>%
mutate(type = coalesce(type, old)) %>%
select(-old)
## A tibble: 3 × 2
# id type
#. <dbl> <chr>
#1 1 d
#2 2 x
#3 3 n
The idea is to join a with b on column id; then replace missing values in type from b with values from a (column old is the old type column from a, avoiding duplicate column names).
I have very basic knowledge of R. I have two tabs (A and B) with rows I want to compare - some values match and some don't. I want R to find the matching elements and add the text value "E" to a pre-existing row in tab A if this is the case.
Example:
Tab A
ID Existing?
1 A
2 B
3 C
4 D
5 E
Tab B
ID
1 D
2 B
3 Y
4 A
5 W
Upon match:
Tab A
ID Existing?
1 A E
2 B E
3 C
4 D E
5 E
I have found information online on how to match tables but none on how to write new information when the match takes place.
Please explain like I'm 5... I have no programming background.
Thank you in advance!
Use match to get the elements in df1$ID that are also in df2$ID, and ifelse to recode the values that are both in df1 and in df2 with "E", and NA otherwise.
df1 <- data.frame(ID = LETTERS[1:5])
df2 <- data.frame(ID = c("D", "B", "Y", "A", "W"))
df1$Existing <- ifelse(match(df1$ID, df2$ID), "E", NA)
ID Existing
1 A E
2 B E
3 C <NA>
4 D E
5 E <NA>
Another solution - using dplyr - would be to join the two dataframes, where you have added the column Existing to the one being joined:
library(dplyr, warn.conflicts = FALSE)
df1 <- tibble(ID = LETTERS[1:5])
df2 <- tibble(ID = c("D", "B", "Y", "A", "W"))
df1 %>%
left_join(df2 %>% mutate(Existing = "E"))
#> Joining, by = "ID"
#> # A tibble: 5 x 2
#> ID Existing
#> <chr> <chr>
#> 1 A E
#> 2 B E
#> 3 C <NA>
#> 4 D E
#> 5 E <NA>
This will set all matching IDs to E and all non-matching to NA.
# data
tab1 <- structure(list(ID = c("A", "B", "C", "D", "E"), Existing = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-5L))
tab2 <- structure(list(ID = c("D", "B", "Y", "A", "W")), class = "data.frame", row.names = c(NA,
-5L))
There are many ways to skin this cat. In base-R, you could try, e.g.,
tab1$Existing[tab1$ID %in% tab2$ID] <- 'E'
In practise, for anything more complicated than tables with 6 rows, you could try dplyr:
library(dplyr)
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA))
Another useful tool -- with a slightly differing syntax -- is data.table.
library(data.table)
setDT(tab1) -> tab1
setDT(tab2) -> tab2
tab1[,Existing := ifelse(tab1$ID %in% tab2$ID, 'E',NA)]
Note that, here mutate and := play roughly the same role. Probably, if you work more with R, you will develop an affinity with one of the "dialects" above.
EDIT: To drop the rows NA values values (in dplyr), you could either do:
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA)) %>%
filter(!is.na(Existing))
Or piggy-backing on #jpiversen's solution:
df1 %>%
inner_join(df2 %>% mutate(Existing = "E"))
This question already has answers here:
Pasting elements of two vectors alphabetically
(5 answers)
How do you sort and paste two columns in a mutate statement?
(1 answer)
Row-wise sort then concatenate across specific columns of data frame
(2 answers)
Closed 1 year ago.
I'm not sure if I phrased my question properly, so let me give an simplified example:
Given a dataset as follows:
dat <- data_frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))
how can I compute a pair variable, so that it represents whatever was within X and Y at a given row BUT not generating duplicates, as here:
dat$pair <- c("A-B", "A-B", "B-C", "C-A", "C-A")
dat
# A tibble: 5 × 3
X Y pair
<chr> <chr> <chr>
1 A B A-B
2 B A A-B
3 B C B-C
4 C A C-A
5 A C C-A
I can compute a pairing with paste0 but it will indroduce duplicates (C-A is the same as A-C for me) that I want to avoid:
> dat <- mutate(dat, pair = paste0(X, "-", Y))
> dat
# A tibble: 5 × 3
X Y pair
<chr> <chr> <chr>
1 A B A-B
2 B A B-A
3 B C B-C
4 C A C-A
5 A C A-C
We can use pmin and pmax to sort the values parallely and paste them.
transform(dat, pair = paste(pmin(X, Y), pmax(X, Y), sep = '-'))
# X Y pair
#1 A B A-B
#2 B A A-B
#3 B C B-C
#4 C A A-C
#5 A C A-C
If you prefer dplyr this can be written as -
library(dplyr)
dat %>% mutate(pair = paste(pmin(X, Y), pmax(X, Y), sep = '-'))
I reordered each column once
dat <- data.frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))
library(dplyr)
dat %>%
rowwise %>%
mutate(pair = paste0(sort(c(as.character(X),as.character(Y)),decreasing = F),collapse = '-')) %>%
ungroup
output;
X Y pair
<fct> <fct> <chr>
1 A B A-B
2 B A A-B
3 B C B-C
4 C A A-C
5 A C A-C
With dplyr and tidyr you could try:
library(dplyr)
library(tidyr)
dat %>%
rowwise() %>%
mutate(pair = list(c(X, Y)),
pair = list(sort(pair)),
pair = list(paste(pair, collapse = "-"))) %>%
select(pair) %>%
distinct() %>%
unnest(pair)
#> # A tibble: 3 x 1
#> pair
#> <chr>
#> 1 A-B
#> 2 B-C
#> 3 A-C
Created on 2021-08-27 by the reprex package (v2.0.0)
data
dat <- data.frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))
I have two dataframes, one of them has a column of 100 genes and the other dataframe has a column that consist 700 rows and each row has several genes that are separated by comma, now I do not know how is it possible to select genes in each row of dataframe 2 according to the gene column in dataframe 1. In another word I want genes in each row of dataframe 2 that are in gene column of dataframe 1.
dataframe1:
column gene:
a
b
c
d
e
f
dataframe2:
column gene:
row1"a,b,c,d,r,t,y"
row2"c,g,h,k,l,a,b,c,p"
I only want comma separated genes in each row of dataframe2 that are in column gene of dataframe1 and other genes in dataframe 2 and are not in dataframe1 be removed.
Using tidyverse:
library(tidyverse)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
dataframe1 <- tibble(gene = c("a", "b", "c", "d", "e", "f"))
dataframe2 <- tibble(gene = c("a,b,c,d,r,t,y","c,g,h,k,l,a,b,c,p"))
result <- str_extract_all(dataframe2$gene, rebus::or1(dataframe1$gene)) %>%
map(~ reduce(.x, str_c, sep = ','))
mutate(dataframe2, gene = result) %>% unnest(c(gene))
#> # A tibble: 2 x 1
#> gene
#> <chr>
#> 1 a,b,c,d
#> 2 c,a,b,c
Created on 2021-06-28 by the reprex package (v2.0.0)
Loop over each row of dataframe2$gene with sapply and keep only those values which are %in% dataframe$gene1, after strspliting to get each comma-separated value.
dataframe1 <- data.frame(gene = c("a", "b", "c", "d", "e", "f"),
stringsAsFactors=FALSE)
dataframe2 <- data.frame(gene = c("a,b,c,d,r,t,y", "c,g,h,k,l,a,b,c,p"),
stringsAsFactors=FALSE)
dataframe2$gene_sub <- sapply(
strsplit(dataframe2$gene, ","),
function(x) paste(x[x %in% dataframe1$gene], collapse=",")
)
dataframe2
## gene gene_sub
##1 a,b,c,d,r,t,y a,b,c,d
##2 c,g,h,k,l,a,b,c,p c,a,b,c
A tidyverse option using data from #thelatemail.
library(dplyr)
library(tidyr)
dataframe2 %>%
mutate(row = row_number()) %>%
separate_rows(gene, sep = ',') %>%
left_join(dataframe1 %>%
mutate(gene_sub = gene), by = 'gene') %>%
group_by(row) %>%
summarise(across(c(gene, gene_sub), ~toString(na.omit(.)))) %>%
select(-row)
# gene gene_sub
# <chr> <chr>
#1 a, b, c, d, r, t, y a, b, c, d
#2 c, g, h, k, l, a, b, c, p c, a, b, c
Suppose I have a data frame with a single column that contains letters a, b, c, d, e.
a
b
c
d
e
In R, is it possible to extract a single letter, such as 'a', and produce all possible paired combinations between 'a' and the other letters (with no duplications)? Could the combn command be used in this case?
a b
a c
a d
a e
We can use data.frame
data.frame(col1 = 'a', col2 = setdiff(df1$V1, "a"))
-ouptput
col1 col2
1 a b
2 a c
3 a d
4 a e
data
df1 <- structure(list(V1 = c("a", "b", "c", "d", "e")),
class = "data.frame", row.names = c(NA,
-5L))
Update:
With .before=1 argument the code is shorter :-)
df %>%
mutate(col_a = first(col1), .before=1) %>%
slice(-1)
With dplyr you can:
library(dplyr)
df %>%
mutate(col2 = first(col1)) %>%
slice(-1) %>%
select(col2, col1)
Output:
col2 col1
<chr> <chr>
1 a b
2 a c
3 a d
4 a e
You could use
expand.grid(x=df[1,], y=df[2:5,])
which returns
x y
1 a b
2 a c
3 a d
4 a e