How to use a statistic function and subsetting data simultaneously in R? - r

I have data that looks like this (dat)
region muscle protein
head cerebrum 78
head cerebrum 56
head petiole 1
head petiole 2
tail pectoral 3
tail pectoral 4
I want to take the mean of protein values of cerebrum. I tried to look up different ways to subset data here and here. But there does not seem a straightforward way of doing it. Right now, I'm doing this:
datcerebrum <- dat[which(dat$muscle == "cerebrum"),]
mean(datcerebrum$protein)
I try to condense this one line :
mean(dat[which(dat$muscle == "cerebrum"),])
But it throws out a NA with a warning that argument is not numeric or logical. Is there an easy way to achieve this?

We can use aggregate from base R
aggregate(protein ~muscle, dat, mean)
# muscle protein
#1 cerebrum 67.0
#2 pectoral 3.5
#3 petiole 1.5

I'd do this with the tidyverse package dplyr:
library(readr)
library(dplyr)
fwf <- "head cerebrum 78
head cerebrum 56
head petiole 1
head petiole 2
tail pectoral 3
tail pectoral 4"
dat <- read_fwf(fwf, fwf_empty(fwf, col_names = c("region", "muscle", "protein")))
# The above code is just to create your data frame - please provide reproducible data!
dat %>% filter(muscle == "cerebrum") %>% summarise(m = mean(protein))
#> # A tibble: 1 x 1
#> m
#> <dbl>
#> 1 67
You could even do it for every muscle at once:
dat %>% group_by(muscle) %>% summarise(m = mean(protein))
#> # A tibble: 3 x 2
#> muscle m
#> <chr> <dbl>
#> 1 cerebrum 67.0
#> 2 pectoral 3.5
#> 3 petiole 1.5

Solution using data.table:
# Load required library
library(data.table)
# Transform you data into a data.table object
setDT(dat)
# Subset cerebrum and mean protein values
data[muscle == "cerebrum"][, mean(protein)]

Related

Sort a strings based on the string patterns in R

I have a data.frame that looks like df.
I want to sort the genes columns so that they start with the AT1G... pattern.
library(tidyverse)
df <- tibble(genes=c("18S","ACLA","AT1G25240","AT1G25241","AT1G25242"), functions=c("ribosome","dunno","flowering","O2","photosynthesis"))
df
#> # A tibble: 5 × 2
#> genes functions
#> <chr> <chr>
#> 1 18S ribosome
#> 2 ACLA dunno
#> 3 AT1G25240 flowering
#> 4 AT1G25241 O2
#> 5 AT1G25242 photosynthesis
Created on 2022-09-28 with reprex v2.0.2
I want my data to look like this:
genes functions
AT1G25240 flowering
AT1G25241 O2
AT1G25242 photosynthesis
ACLA dunno
18S ribosome
Any idea or help is highly appreciated it!
The rationale is that I want from a huge data set to see first the core genes that start with AT..
If you sort (arrange) by the presence of the pattern using grepl, then FALSE (pattern not found) sorts first. If we negate that pattern, we get what you want:
df %>%
arrange(!grepl("^AT1G", genes))
# # A tibble: 5 x 2
# genes functions
# <chr> <chr>
# 1 AT1G25240 flowering
# 2 AT1G25241 O2
# 3 AT1G25242 photosynthesis
# 4 18S ribosome
# 5 ACLA dunno
You can add other arguments to arrange for secondary sorts, e.g., arrange(!grepl(..), genes, functions).

Divide multiple variable values by a specific value in R

I'm trying to pull something that is simple but can't seem to get my head over it. My data looks like this
|Assay|Sample|Number|
|A|1|10|
|B|1|25|
|C|1|30|
|A|2|45|
|B|2|65|
|C|2|8|
|A|3|10|
|B|3|81|
|C|3|12|
What I need to do is to divide each "Number" value for each sample by the value of the respective assay A. That is, for sample 1, I would like to have 10/10, 25/10 and 30/10. Then for sample 2, I would need 45/45, 65/45 and 8/45 and so on with the rest of the samples.
I have already tried doing:
mutate(Normalised = Number/Number[Assay == "A"])
as suggested in another post but the results are not correct.
Any help would be great. Thank you very much!
Using dplyr
df <- data.frame(Assay=rep(c('A','B','C'),3),
Sample=rep(1:3,each=3),
Number=c(10,25,30,45,65,8,10,81,12))
df <- df %>%
group_by(Sample) %>%
arrange(Assay) %>%
mutate(Normalised=Number/first(Number)) %>%
ungroup() %>%
arrange(Sample)
gives out
> df
# A tibble: 9 × 4
Assay Sample Number Normalised
<chr> <int> <dbl> <dbl>
1 A 1 10 1
2 B 1 25 2.5
3 C 1 30 3
4 A 2 45 1
5 B 2 65 1.44
6 C 2 8 0.178
7 A 3 10 1
8 B 3 81 8.1
9 C 3 12 1.2
Note: I added arrange(Assay) just to make sure "A" is always the first row within each group. Also, arrange(Sample) is there just to get the output in the same order as it was but it doesn't really need to be there if you don't care about the display order.

Data wrangling into long format in R

I have a source dataset from a nature article. I was wondering how I could extract the values from rows 4 and 12 into a long data format with the relevant assigned group (i.e. Inefficient/Efficient).
This is the code I have used to get the data into R.
# load the required libraries
library(ggsignif)
library(readxl)
library(svglite)
library(tidyverse)
library(tidyr)
library(dplyr)
# The paper from which the figure is taken is Tasdogen et al. (2020)
# Metabolic heterogeneity confers differences in melanoma metastatic potential
# The figure is 2b and can be accessed at
# https://www.nature.com/articles/s41586-019-1847-2#MOESM3
# The link to the raw data used in the article is given below and directly improted for plotting
url <-'https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1847-2/MediaObjects/41586_2019_1847_MOESM3_ESM.xlsx'
#create a dataframe from the Excel data
temp <- tempfile()
download.file(url, temp, mode='wb')
myData <- read_excel(path = temp)
I cant figure out how to insert an image of the dataset but it should show up with the previous code. I need columns 2-31 for efficient and 2 to 37 for inefficient.
I hope that's enough information for people to understand want I'm talking about.
This data is really not structured well for a general read like that, but I'll try to make do:
### myData <- read_excel(...)
Data_wide<- myData[c(2:4,10:12), c(2:37)]
tmp <- as.data.frame(t(Data_wide))
head(tmp)
# V1 V2 V3 V4 V5 V6
# ...2 Efficient #1 0.47699999999999998 Inefficient #1 0.48499999999999999
# ...3 Efficient #2 0.376 Inefficient #2 0.47399999999999998
# ...4 Efficient #3 0.496 Inefficient #3 0.48799999999999999
# ...5 Efficient #4 0.32500000000000001 Inefficient #4 0.45600000000000002
# ...6 Efficient #5 8.8999999999999996E-2 Inefficient #5 0.53100000000000003
# ...7 Efficient #6 4.5999999999999999E-2 Inefficient #6 0.318
tmp <- rbind(tmp[,1:3], setNames(tmp[,4:6], names(tmp)[1:3]))
head(tmp)
# V1 V2 V3
# ...2 Efficient #1 0.47699999999999998
# ...3 Efficient #2 0.376
# ...4 Efficient #3 0.496
# ...5 Efficient #4 0.32500000000000001
# ...6 Efficient #5 8.8999999999999996E-2
# ...7 Efficient #6 4.5999999999999999E-2
tmp <- tmp[complete.cases(tmp),]
tmp$V3 <- as.numeric(tmp$V3)
rownames(tmp) <- NULL
head(tmp,3); tail(tmp,3)
# V1 V2 V3
# 1 Efficient #1 0.477
# 2 Efficient #2 0.376
# 3 Efficient #3 0.496
# V1 V2 V3
# 64 Inefficient #34 0.2451
# 65 Inefficient #35 0.2450
# 66 Inefficient #36 0.2529
With this structure, you can subset (remove V2, though I wonder why you feel it is not important) and rename (colnames(tmp) <- c(...)).
Although it might not be pretty, I believe this would be your solution using only readxl and tidyverse packages:
# Select first set of rows with group and value
set1 <-
myData %>%
filter(row_number() %in% c(2, 4))
# Select second set of rows with group and value
set2 <-
myData %>%
filter(row_number() %in% c(10, 12))
# Join both sets of data, so that all group labels are in one row and all values are in one row.
left_join(set1, set2, by = "Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion") %>%
#pivot the table to a long format with group lable and value labels in separate columns
pivot_longer(cols = !`Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion`) %>%
# pivot wider to a format with group lable and value labels in separate columns
pivot_wider(names_from = `Fractional enrichment of glucose m+6 in primary subcutaneous tumors after [U-13C]glucose infusion`, values_from = value) %>%
# Remove old column names/numbers
select(-name)
# A tibble: 72 x 2
Group `Glucose m+6`
<chr> <chr>
1 Inefficient 0.48499999999999999
2 Inefficient 0.47399999999999998
3 Inefficient 0.48799999999999999
4 Inefficient 0.45600000000000002
5 Inefficient 0.53100000000000003
6 Inefficient 0.318
7 Inefficient 0.26600000000000001
8 Inefficient 0.30399999999999999
9 Inefficient 0.309
10 Inefficient 0.33
# ... with 62 more rows
A clean way to address your problem is to use the libraries tidyxl and unpivotr.
They may seem rather complicated at first, but it's probably the cleanest way to handle excel files. I left some comments to help you go through it.
I suggest you to have a look at unpivotr vignettes.
# libraries
library(tidyverse)
library(tidyxl)
library(unpivotr)
# download data
url <-'https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1847-2/MediaObjects/41586_2019_1847_MOESM3_ESM.xlsx'
temp <- tempfile()
download.file(url, temp, mode='wb')
# read excel file
myData <- xlsx_cells(path = temp)
# select the sheet
figure1a <- myData %>% filter(sheet == "Figure 1 A")
# you can visualize data in an excel-like format with
# View(rectify(figure1a))
# since the sheet is composed by two tables
# get the top-left corner of each table (where in the first column you find Group)
corners <- figure1a %>% filter(character == "Group")
# partition the spreadsheet based on the corners you just got
# select the rows you will need
partitions <- figure1a %>% filter(row %in% c(3:5, 11:13)) %>% partition(corners)
# get the two partitions and edit them
# with purrr::map it will be easy
df <- partitions$cells %>%
# the first column for each partition shows the headers
map(behead, "left", "header") %>%
# the first row for each partition shows the Group: Efficient/Inefficient
map(behead, "up", "Group") %>%
# the second row for each partition shows the mouse id
# and bind the edited partitions together
map_dfr(behead, "up", "Mouse_ID") %>%
# select the columns we need
select(Group, Mouse_ID, Glucose_m6 = numeric)
# the final result
df
#> # A tibble: 66 x 3
#> Group Mouse_ID Glucose_m6
#> <chr> <chr> <dbl>
#> 1 Efficient #1 0.477
#> 2 Efficient #2 0.376
#> 3 Efficient #3 0.496
#> 4 Efficient #4 0.325
#> 5 Efficient #5 0.089
#> 6 Efficient #6 0.046
#> 7 Efficient #7 0.213
#> 8 Efficient #8 0.082
#> 9 Efficient #9 0.359
#> 10 Efficient #10 0.306
#> # ... with 56 more rows
Created on 2021-11-04 by the reprex package (v2.0.0)

running Shannon and Simpson : Vegan package

I am interested in biodiversity index calculations using vegan
package. The simpsons index works but no results from Shannon
argument. I was hoping somebody know the solution
What I have tried is that I have converted data. frame into vegan
package test data format using code below
Plot <- c(1,1,2,2,3,3,3)
species <- c( "Aa","Aa", "Aa","Bb","Bb","Rr","Xx")
count <- c(3,2,1,4,2,5,7)
veganData <- data.frame(Plot,species,count)
matrify(veganData )
diversity(veganData,"simpson")
diversity(veganData,"shannon", base = exp(1))
1. I get the following results, so I think it produces all
simpsons indices
> diversity(veganData,"simpson")
simpson.D simpson.I simpson.R
1 1.00 0.00 1.0
2 0.60 0.40 1.7
3 0.35 0.65 2.8
2. But when I run for Shannon index get the following
message
> diversity(veganData,"shannon")
data frame with 0 columns and 3 rows
I am not sure why its not working ? do we need to make any changes
in data formatting while switching the methods?
Your data need to be in the wide format. Also the counts must be either in total or averages (not repeated counts for the same plot).
library(dply); library(tidyr)
df <- veganData %>%
group_by(Plot, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
spread(species, count, fill=0)
df
# # A tibble: 3 x 5
# Plot Aa Bb Rr Xx
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5 0 0 0
# 2 2 1 4 0 0
# 3 3 0 2 5 7
diversity(df[,-1], "shannon")
# [1] 0.0000000 0.5004024 0.9922820
To check if the calculation is correct, note the Shannon calculation is carried out as -1 x summation of Pi*lnPi
# For plot 3:
-1*(
(2/(2+5+7))*log((2/(2+5+7))) + #Pi*lnPi of Bb
(5/(2+5+7))*log((5/(2+5+7))) + #Pi*lnPi of Rr
(7/(2+5+7))*log((7/(2+5+7))) #Pi*lnPi of Xx
)
# [1] 0.992282

Function in R (Merge Bases)

I have the following bases in R.
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))
table1<-table1 %>% arrange(price) %>% split(.$group)
$`1`
group price visits
1 1 10 100
3 1 20 200
5 1 30 300
$`2`
group price visits
2 2 10 150
4 2 20 250
6 2 30 350
group_1<-data.frame(case_1=c(0.2,0.3,0.4),case_2=c(0.22,0.33,0.44))
group_2<-data.frame(case_1=c(0.3,0.4,0.5),case_2=c(0.33,0.44,0.55))
So, the question is How can I do the following operation without repeating it four times. I suppose that an apply function, or similar, will suit better.
sum(table1$`1`[,c("group")] * group_1[,c("case_1")])
sum(table1$`1`[,c("group")] * group_1[,c("case_2")])
sum(table2$`1`[,c("group")] * group_2[,c("case_1")])
sum(table2$`1`[,c("group")] * group_2[,c("case_2")])
After going through step-by-step in the data you have provided and understanding what you are trying to do. Here is a suggestion using mapply.
group_list <- list(group_1, group_2)
mapply(function(x, y) colSums(x * y),split(table1$group, table1$group),group_list)
# 1 2
#case_1 0.90 2.40
#case_2 0.99 2.64
We take the groups in one list say group_list. Split table1 by group and perform multiplication between them using mapply and take the column-wise sum. If I have understood you correctly, this is what you needed let me know if it is otherwise.
Based on the initial dataset, we can do this using group_by operations
library(tidyverse)
bind_rows(group_1, group_2) %>%
bind_cols(table1['group'], .) %>%
mutate(case_1 = group*case_1, case_2 = group*case_2) %>%
group_by(group) %>%
summarise_each(funs(sum))
# A tibble: 2 × 3
# group case_1 case_2
# <dbl> <dbl> <dbl>
#1 1 0.9 0.99
#2 2 2.4 2.64
data
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))

Resources