Related
I am analyzing a dataset and need to find matching samples between 2 versions of the data.
they (should) contain the same expression data but they have different sample identifiers. Lets say the first dataframe looks like this:
gene sample expression
1 a a 1
2 a b 2
3 a c 3
4 a d 4
5 a e 5
6 a f 6
7 a g 7
8 a h 8
9 a i 9
10 a j 10
11 a k 11
12 a l 12
13 a m 13
14 a n 14
I made the dataframe for one gene, but u can imagine that this is a large dataset containing ~20k genes. What I need to do is find the closest match in gene expression so I know which samples correspond. the second dataframe might look like this:
gene sample expression
1 a z 1.5
2 a y 2.5
3 a x 3
4 a w 4.5
5 a v 5.7
6 a u 6.2
7 a t 7.8
8 a s 8.1
9 a r 9.8
10 a q 10.5
11 a p 11
12 a o 12
13 a 2 13.3
14 a 4 14.4
what I need to do is write a function (or something like that) that try's to match the expressions of genes in a dataframe as closely as possible (for all genes) and report the sample identifiers with the closest match. I'm quite new to R and could use a little help.
I would like the output to look like this::
gene sample expression sample2
1 a z 1 z
2 a y 2 y
3 a x 3 x
4 a w 4 w
5 a v 5 v
6 a u 6 u
7 a t 7 t
8 a s 8 s
9 a r 9 r
10 a q 10 q
11 a p 11 p
12 a o 12 o
13 a 2 13 2
14 a 4 14 4
an extra column per sample that sepcifies the closest match in gene expression accros all genes. But the extra column must be created based on all genes and not on one gene.
Here are two options. In your example, it looks like there are always whole number matches, so you could join by whole number. Alternatively, you could try to extract the closest number. I use floor because it looks like you want 1.5 to be joined to 1 and not 2.
library(tidyverse)
#extract closest whole number
df1 |>
mutate(sample2 = map_chr(expression,
\(x)df2$sample[which.min(abs(x - floor(df2$expression)))]))
#> # A tibble: 14 x 4
#> gene sample expression sample2
#> <chr> <chr> <dbl> <chr>
#> 1 a a 1 z
#> 2 a b 2 y
#> 3 a c 3 x
#> 4 a d 4 w
#> 5 a e 5 v
#> 6 a f 6 u
#> 7 a g 7 t
#> 8 a h 8 s
#> 9 a i 9 r
#> 10 a j 10 q
#> 11 a k 11 p
#> 12 a l 12 o
#> 13 a m 13 2
#> 14 a n 14 4
#join by whole number
left_join(df1,
df2 |>
mutate(expression = as.numeric(gsub("^(.*)\\.\\d+$", "\\1", expression))) |>
select(sample2 = sample, expression),
by = "expression")
#> # A tibble: 14 x 4
#> gene sample expression sample2
#> <chr> <chr> <dbl> <chr>
#> 1 a a 1 z
#> 2 a b 2 y
#> 3 a c 3 x
#> 4 a d 4 w
#> 5 a e 5 v
#> 6 a f 6 u
#> 7 a g 7 t
#> 8 a h 8 s
#> 9 a i 9 r
#> 10 a j 10 q
#> 11 a k 11 p
#> 12 a l 12 o
#> 13 a m 13 2
#> 14 a n 14 4
I have two phylogenetic trees which have the same topology (expect for branch lengths):
In R using ape:
t1 <- ape::read.tree(file="",text="(((HS:72,((CP:30,CL:30.289473923):32,RN:62):10):2,(CS:63,BS:63):11):5,LA:79);")
t2 <- ape::read.tree(file="",text="(((((CP:39,CL:39):29,RN:68):9,HS:77):5,(BS:63,CS:63):19):14,LA:96);")
> ape::all.equal.phylo(t1,t2,use.edge.length = F,use.tip.label = T)
[1] TRUE
I want to compute the mean branch lengths across the two but the problem is that although their topologies are identical the order at which their nodes are represented is not identical, and not all tree nodes are labeled tips so I don't think there's a simple join solution:
> head(tidytree::as_tibble(t1))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 10 1 72 HS
2 12 2 30 CP
3 12 3 30.3 CL
4 11 4 62 RN
5 13 5 63 CS
6 13 6 63 BS
> tail(tidytree::as_tibble(t1))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 8 8 NA NA
2 8 9 5 NA
3 9 10 2 NA
4 10 11 10 NA
5 11 12 32 NA
6 9 13 11 NA
> head(tidytree::as_tibble(t2))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 12 1 39 CP
2 12 2 39 CL
3 11 3 68 RN
4 10 4 77 HS
5 13 5 63 BS
6 13 6 63 CS
> tail(tidytree::as_tibble(t2))
# A tibble: 6 x 4
parent node branch.length label
<int> <int> <dbl> <chr>
1 8 8 NA NA
2 8 9 14 NA
3 9 10 5 NA
4 10 11 9 NA
5 11 12 29 NA
6 9 13 19 NA
So it's not clear to me how I'd correspond between any pair of branch lengths in order to take their mean.
Any idea how to match them or reorder t2 according to t1?
Supposedly phytools' matchNodes method is meant for that but it doesn't seem like it's getting it right:
phytools::matchNodes(t1, t2,method = "descendants")
tr1 tr2
[1,] 8 8
[2,] 9 9
[3,] 10 10
[4,] 11 11
[5,] 12 12
[6,] 13 13
At least I'd expect it to correspond the tips correctly, meaning:
dplyr::left_join(dplyr::filter(tidytree::as_tibble(t1),!is.na(label)) %>% dplyr::select(node,label) %>% dplyr::rename(t1.node=node),
+ dplyr::filter(tidytree::as_tibble(t2),!is.na(label)) %>% dplyr::select(node,label) %>% dplyr::rename(t2.node=node))
Joining, by = "label"
# A tibble: 7 x 3
t1.node label t2.node
<int> <chr> <int>
1 1 HS 4
2 2 CP 1
3 3 CL 2
4 4 RN 3
5 5 CS 6
6 6 BS 5
7 7 LA 7
But that's not happening.
Ultimately the information for matching is in these tree tibbles because they list the parents of each node, but practically using that information for matching the modes probably requires some recursive steps.
Seems like ape's makeNodeLabel using the md5sum as the method argument, which labels the internal nodes by the tip labels achieves that:
t1 <- ape::read.tree(file="",text="(((HS:72,((CP:30,CL:30.289473923):32,RN:62):10):2,(CS:63,BS:63):11):5,LA:79);")
t2 <- ape::read.tree(file="",text="(((((CP:39,CL:39):29,RN:68):9,HS:77):5,(BS:63,CS:63):19):14,LA:96);")
dplyr::left_join(tidytree::as_tibble(ape::makeNodeLabel(t1, method = "md5sum")) %>% dplyr::select(node,label) %>% dplyr::rename(t1.node=node),
tidytree::as_tibble(ape::makeNodeLabel(t2, method = "md5sum")) %>% dplyr::select(node,label) %>% dplyr::rename(t2.node=node))
Joining, by = "label"
# A tibble: 13 x 3
t1.node label t2.node
<int> <chr> <int>
1 1 HS 4
2 2 CP 1
3 3 CL 2
4 4 RN 3
5 5 CS 6
6 6 BS 5
7 7 LA 7
8 8 da5f57f0a757f7211fcf84c540d9531a 8
9 9 9bffe86cf0a2650b6a3f0d494c0183a9 9
10 10 bcf7b41992a064acd2e3e66fee7fe2d4 10
11 11 d50e0698114c621b49322697267900b7 11
12 12 f0a8c7fa67831514e65cdadbc68c3d31 12
13 13 82ab4cf8ae4a4df14cf87a48dc2638e0 13
this may have a simple answer but after after a few hours of searching I still cannot find it. Basically I need to turn a wide dataset to a long format dataset but with multiple variables. My dataset structure looks like this:
df1 <- data.frame(id = c(1,2,3),
sex = c("M","F","M"),
day0s = c(21,25,15),
day1s = c(20,30,18),
day2s = c(18,18,17),
day0t = c(2,5,7),
day1t = c(3,6,5),
day2t = c(3,8,7))
df1
id sex day0s day1s day2s day0t day1t day2t
1 M 21 20 18 2 3 3
2 F 25 30 18 5 6 8
3 M 15 18 17 7 5 7
Basically 3 subjects have done a math test (s) and history test (t) every day for 3 days.
I tried to use gather from tidyr to turn it into long form, but I don't know how to assign the mt and ht variables to the same day. I also coded a new variable day with just day0 = 0, day1 = 1 and day2 = 2.
dfl <- df1 %>%
gather(day, value, - c(id,sex))
dfl
id sex variable value day
1 M day0s 21 0
1 M day1s 20 1
1 M day2s 18 2
1 M day0t 2 0
1 M day1t 3 1
1 M day2t 3 2
2 F day0s 25 0
2 F day1s 30 1
2 F day2s 18 2
2 F day0t 5 0
2 F day1t 6 1
2 F day2t 8 2
3 M day0s 15 0
3 M day1s 18 1
3 M day2s 17 2
3 M day0t 7 0
3 M day1t 5 1
3 M day2t 7 1
Ideally in the end it should look like this.
id sex day s t
1 M 0 21 2
1 M 1 20 3
1 M 2 18 3
2 F 0 25 5
2 F 1 30 6
2 F 2 18 8
3 M 0 15 7
3 M 1 18 5
3 M 2 17 7
Do you please have any suggestions on how to achieve this?
You can use {tidyr}'s pivot_longer here.
If your actual variables are named a bit differently, you can adapt the regex to your case. Here you can try out and adapt accordingly . (Note that in R the backslash has to be escaped, therefore the
double backslash in \\d+ and \\w+)
In general, the names_pattern argument works by matching the regex within the parenthesis with the names_to argument, so that here:
(\\d+) -> becomes variable day. Regex \d+ matches 1 or more digits.
(\\w+) -> becomes ".value". Regex \w+ matches 1 or more word character. Thanks to r2evans for pointing out the ".value" argument that spares one further reshape. The documentation states that .value "tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output)." While I don't fully grasp the documentation explanation, the results are that the matching regex are mapped to the variable names in the output data.
library(dplyr)
library(tidyr)
df1 <- data.frame(id = c(1,2,3),
sex = c("M","F","M"),
day0mt = c(21,25,15),
day1mt = c(20,30,18),
day2mt = c(18,18,17),
day0ht = c(2,5,7),
day1ht = c(3,6,5),
day2ht = c(3,8,7))
df1
#> id sex day0mt day1mt day2mt day0ht day1ht day2ht
#> 1 1 M 21 20 18 2 3 3
#> 2 2 F 25 30 18 5 6 8
#> 3 3 M 15 18 17 7 5 7
df1 %>%
pivot_longer(cols = starts_with("day"),
names_pattern = "day(\\d+)(\\w+)",
names_to = c("day", ".value"))
#> # A tibble: 9 x 5
#> id sex day mt ht
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 M 0 21 2
#> 2 1 M 1 20 3
#> 3 1 M 2 18 3
#> 4 2 F 0 25 5
#> 5 2 F 1 30 6
#> 6 2 F 2 18 8
#> 7 3 M 0 15 7
#> 8 3 M 1 18 5
#> 9 3 M 2 17 7
Created on 2021-06-20 by the reprex package (v2.0.0)
Note that in newer versions of tidyr, gather and spread are deprecated and replaced by pivot_longer and pivot_wider.
Using the latest development-version of data.table (1.14.1) which adds some cool new melt-features..
use data.table::update.dev.pkg() for installation of the dev-version
library(data.table)
# data.table 1.14.1 IN DEVELOPMENT built 2021-06-22 09:38:23 UTC
dcast(
melt(setDT(df1), measure.vars = measure(day, type, pattern="^day(.)(.)")),
... ~ type, value.var = "value")
id sex day s t
1: 1 M 0 21 2
2: 1 M 1 20 3
3: 1 M 2 18 3
4: 2 F 0 25 5
5: 2 F 1 30 6
6: 2 F 2 18 8
7: 3 M 0 15 7
8: 3 M 1 18 5
9: 3 M 2 17 7
Here is a way. It first reshapes to long format, separates the day* column into day number and suffix columns and reshapes back to wide format.
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
pivot_longer(cols = starts_with("day")) %>%
mutate(day = str_extract(name, "\\d+"),
suffix = str_extract(name, "[^[:digit:]]+$")) %>%
select(-name) %>%
pivot_wider(
id_cols = -c(value, suffix),
names_from = suffix,
values_from = value
)
## A tibble: 9 x 5
# id sex day s t
# <dbl> <chr> <chr> <dbl> <dbl>
#1 1 M 0 21 2
#2 1 M 1 20 3
#3 1 M 2 18 3
#4 2 F 0 25 5
#5 2 F 1 30 6
#6 2 F 2 18 8
#7 3 M 0 15 7
#8 3 M 1 18 5
#9 3 M 2 17 7
I am reading through excel file that has multiple sheets.
file_to_read <- "./file_name.xlsx"
# Get all names of sheets in the file
sheet_names <- readxl::excel_sheets(file_to_read)
# Loop through sheets
L <- lapply(sheet_names, function(x) {
all_cells <-
tidyxl::xlsx_cells(file_to_read, sheets = x)
})
L here has all the sheets. Now, I need to get the data from each sheet to combine all the columns and rows into one file. To be exact, I want to sum the matching columns and rows in the data into one file.
I will put simple example to make it clear.
For example, this table in one sheet,
df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7
The second table in the next sheet,
df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7, w = 8:12)
rownames(df2) <- LETTERS[3:7]
df2
M x y z w
C 1 2 3 8
D 2 3 4 9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12
My goal is to combine (sum) the matched records in all 100 tables from one excel file to get one big tables that has the total sum of each value.
The final table should be like this:
M x y z w
A 1 2 3 0
B 2 3 4 0
C 4 6 8 8
D 6 8 10 9
E 8 10 12 10
F 4 5 6 11
G 5 6 7 12
Is there a way to achieve this in R? I am not an expert in R, but I wish if I could know how to read all sheets and do the sum Then save the output to a file.
Thank you
As you have stated that you have hundreds of sheets it is suggested that you should import all of these in one single list say my.list in R (as per this link or this readxl documentation suggested) and follow this strategy instead of binding every two dfs one by one
df1 <- read.table(text = 'M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7', header = T)
df2 <- read.table(text = 'M x y z w
C 1 2 3 8
D 2 3 4 9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12', header = T)
library(tibble)
library(tidyverse)
my.list <- list(df1, df2)
map_dfr(my.list, ~.x)
#> M x y z w
#> 1 A 1 2 3 NA
#> 2 B 2 3 4 NA
#> 3 C 3 4 5 NA
#> 4 D 4 5 6 NA
#> 5 E 5 6 7 NA
#> 6 C 1 2 3 8
#> 7 D 2 3 4 9
#> 8 E 3 4 5 10
#> 9 F 4 5 6 11
#> 10 G 5 6 7 12
map_dfr(my.list , ~ .x) %>%
group_by(M) %>%
summarise(across(everything(), sum, na.rm = T))
#> # A tibble: 7 x 5
#> M x y z w
#> <chr> <int> <int> <int> <int>
#> 1 A 1 2 3 0
#> 2 B 2 3 4 0
#> 3 C 4 6 8 8
#> 4 D 6 8 10 9
#> 5 E 8 10 12 10
#> 6 F 4 5 6 11
#> 7 G 5 6 7 12
Created on 2021-05-26 by the reprex package (v2.0.0)
One approach that will work is these steps:
read each sheet into a list
convert each sheet into a long format
bind into a single data frame
sum and group by over that long data frame
cast back to tabular format
That should work for N sheets with any combination of row and column headers in those sheets. E.g.
file <- "D:\\Book1.xlsx"
sheet_names <- readxl::excel_sheets(file)
sheet_data <- lapply(sheet_names, function(sheet_name) {
readxl::read_xlsx(path = file, sheet = sheet_name)
})
# use pivot_longer on each sheet to make long data
long_sheet_data <- lapply(sheet_data, function(data) {
long <- tidyr::pivot_longer(
data = data,
cols = !M,
names_to = "col",
values_to = "val"
)
})
# combine into a single tibble
long_data = dplyr::bind_rows(long_sheet_data)
# sum up matching pairs of `M` and `col`
summarised <- long_data %>%
group_by(M, col) %>%
dplyr::summarise(agg = sum(val))
# convert to a tabular format
tabular <- summarised %>%
tidyr::pivot_wider(
names_from = col,
values_from = agg,
values_fill = 0
)
tabular
I get this output with a spreadsheet using your initial inputs:
> tabular
# A tibble: 7 x 5
# Groups: M [7]
M x y z w
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 2 3 0
2 B 2 3 4 0
3 C 4 6 8 8
4 D 6 8 10 9
5 E 8 10 12 10
6 F 4 5 6 11
7 G 5 6 7 12
You could use dplyr and tidyr to get your desired result:
Let be
df <- data.frame(subject=c(rep("Mother", 2), rep("Child", 2)), modifier=c("chart2", "child", "tech", "unkn"), mother_chart2=1:4, mother_child=5:8, child_tech=9:12, child_unkn=13:16)
> df
subject modifier mother_chart2 mother_child child_tech child_unkn
1 Mother chart2 1 5 9 13
2 Mother child 2 6 10 14
3 Child tech 3 7 11 15
4 Child unkn 4 8 12 16
and
df2 <- data.frame(subject=c(rep("Mother", 2), rep("Child", 2)), modifier=c("chart", "child", "tech", "unkn"), mother_chart=101:104, mother_child=105:108, child_tech=109:112, child_unkn=113:116)
> df2
subject modifier mother_chart mother_child child_tech child_unkn
1 Mother chart 101 105 109 113
2 Mother child 102 106 110 114
3 Child tech 103 107 111 115
4 Child unkn 104 108 112 116
Then
library(dplyr)
library(tidyr)
df2_tmp <- df2 %>%
pivot_longer(col=-c("subject", "modifier"))
df %>%
pivot_longer(col=-c("subject", "modifier")) %>%
full_join(df2_tmp, by=c("subject", "modifier", "name")) %>%
mutate(across(starts_with("value"), ~ replace_na(., 0)),
sum = value.x + value.y) %>%
select(-value.x, -value.y) %>%
pivot_wider(names_from=name, values_from=sum, values_fill=0)
returns
# A tibble: 5 x 7
subject modifier mother_chart2 mother_child child_tech child_unkn mother_chart
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mother chart2 1 5 9 13 0
2 Mother child 2 112 120 128 102
3 Child tech 3 114 122 130 103
4 Child unkn 4 116 124 132 104
5 Mother chart 0 105 109 113 101
I have a data looks like this:
The sample data can be get by following codes:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
I want to build a variable "Event" to capture all events. The final results will look like this:
What should I do? I would like to know as many ways as possible. Thanks.
One option could be using apply() like this. The suggestion from #AllanCameron is also a great choice. Here the code as option for you:
#Vectors
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
#Data
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C,stringsAsFactors = F)
#Option 1
index <- which(grepl('Event',names(Sample.data)))
Sample.data$Event <- apply(Sample.data[,index],1,function(x) paste0(x[x!=''],collapse='/'))
Output:
ID Days Event_P Event_N Event_C Event
1 1 -5 C C
2 1 1
3 1 18 P C P/C
4 1 30
5 2 1 N N
6 2 8
7 2 16 P N C P/N/C
8 3 1
9 3 8
10 3 6 P N C P/N/C
11 4 -6 N N
12 4 1
13 4 7 P N P/N
14 4 15 P N P/N
Duck's answer is very good, but you mentioned you want as many ways as possible so here are two more ways:
You could also use tidyverse's mutate and base r's interaction to combine the columns then use gsub to clear out all the unnecessary things:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
library(tidyverse)
Sample.data %>%
mutate(Event = paste(Event_P, Event_N, Event_C, sep='/'),
Event = gsub('^/|^//|/$|//$', '', Event),
Event = gsub('//', '/', Event))
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Sample.data$Event <-
interaction(Sample.data$Event_P, Sample.data$Event_N, Sample.data$Event_C, sep = '/') %>%
gsub('^/|^//|/$|//$', '', .) %>%
gsub('//', '/', .)
Sample.data
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Created on 2020-09-18 by the reprex package (v0.3.0)
What inside the gsub(^/|^//|/$|//$) does is
^/|^//: Take out all / or // that start the string
/$|//$: Take out all / or // that end the string