Function in R (Merge Bases) - r

I have the following bases in R.
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))
table1<-table1 %>% arrange(price) %>% split(.$group)
$`1`
group price visits
1 1 10 100
3 1 20 200
5 1 30 300
$`2`
group price visits
2 2 10 150
4 2 20 250
6 2 30 350
group_1<-data.frame(case_1=c(0.2,0.3,0.4),case_2=c(0.22,0.33,0.44))
group_2<-data.frame(case_1=c(0.3,0.4,0.5),case_2=c(0.33,0.44,0.55))
So, the question is How can I do the following operation without repeating it four times. I suppose that an apply function, or similar, will suit better.
sum(table1$`1`[,c("group")] * group_1[,c("case_1")])
sum(table1$`1`[,c("group")] * group_1[,c("case_2")])
sum(table2$`1`[,c("group")] * group_2[,c("case_1")])
sum(table2$`1`[,c("group")] * group_2[,c("case_2")])

After going through step-by-step in the data you have provided and understanding what you are trying to do. Here is a suggestion using mapply.
group_list <- list(group_1, group_2)
mapply(function(x, y) colSums(x * y),split(table1$group, table1$group),group_list)
# 1 2
#case_1 0.90 2.40
#case_2 0.99 2.64
We take the groups in one list say group_list. Split table1 by group and perform multiplication between them using mapply and take the column-wise sum. If I have understood you correctly, this is what you needed let me know if it is otherwise.

Based on the initial dataset, we can do this using group_by operations
library(tidyverse)
bind_rows(group_1, group_2) %>%
bind_cols(table1['group'], .) %>%
mutate(case_1 = group*case_1, case_2 = group*case_2) %>%
group_by(group) %>%
summarise_each(funs(sum))
# A tibble: 2 × 3
# group case_1 case_2
# <dbl> <dbl> <dbl>
#1 1 0.9 0.99
#2 2 2.4 2.64
data
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))

Related

Divide multiple variable values by a specific value in R

I'm trying to pull something that is simple but can't seem to get my head over it. My data looks like this
|Assay|Sample|Number|
|A|1|10|
|B|1|25|
|C|1|30|
|A|2|45|
|B|2|65|
|C|2|8|
|A|3|10|
|B|3|81|
|C|3|12|
What I need to do is to divide each "Number" value for each sample by the value of the respective assay A. That is, for sample 1, I would like to have 10/10, 25/10 and 30/10. Then for sample 2, I would need 45/45, 65/45 and 8/45 and so on with the rest of the samples.
I have already tried doing:
mutate(Normalised = Number/Number[Assay == "A"])
as suggested in another post but the results are not correct.
Any help would be great. Thank you very much!
Using dplyr
df <- data.frame(Assay=rep(c('A','B','C'),3),
Sample=rep(1:3,each=3),
Number=c(10,25,30,45,65,8,10,81,12))
df <- df %>%
group_by(Sample) %>%
arrange(Assay) %>%
mutate(Normalised=Number/first(Number)) %>%
ungroup() %>%
arrange(Sample)
gives out
> df
# A tibble: 9 × 4
Assay Sample Number Normalised
<chr> <int> <dbl> <dbl>
1 A 1 10 1
2 B 1 25 2.5
3 C 1 30 3
4 A 2 45 1
5 B 2 65 1.44
6 C 2 8 0.178
7 A 3 10 1
8 B 3 81 8.1
9 C 3 12 1.2
Note: I added arrange(Assay) just to make sure "A" is always the first row within each group. Also, arrange(Sample) is there just to get the output in the same order as it was but it doesn't really need to be there if you don't care about the display order.

running Shannon and Simpson : Vegan package

I am interested in biodiversity index calculations using vegan
package. The simpsons index works but no results from Shannon
argument. I was hoping somebody know the solution
What I have tried is that I have converted data. frame into vegan
package test data format using code below
Plot <- c(1,1,2,2,3,3,3)
species <- c( "Aa","Aa", "Aa","Bb","Bb","Rr","Xx")
count <- c(3,2,1,4,2,5,7)
veganData <- data.frame(Plot,species,count)
matrify(veganData )
diversity(veganData,"simpson")
diversity(veganData,"shannon", base = exp(1))
1. I get the following results, so I think it produces all
simpsons indices
> diversity(veganData,"simpson")
simpson.D simpson.I simpson.R
1 1.00 0.00 1.0
2 0.60 0.40 1.7
3 0.35 0.65 2.8
2. But when I run for Shannon index get the following
message
> diversity(veganData,"shannon")
data frame with 0 columns and 3 rows
I am not sure why its not working ? do we need to make any changes
in data formatting while switching the methods?
Your data need to be in the wide format. Also the counts must be either in total or averages (not repeated counts for the same plot).
library(dply); library(tidyr)
df <- veganData %>%
group_by(Plot, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
spread(species, count, fill=0)
df
# # A tibble: 3 x 5
# Plot Aa Bb Rr Xx
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5 0 0 0
# 2 2 1 4 0 0
# 3 3 0 2 5 7
diversity(df[,-1], "shannon")
# [1] 0.0000000 0.5004024 0.9922820
To check if the calculation is correct, note the Shannon calculation is carried out as -1 x summation of Pi*lnPi
# For plot 3:
-1*(
(2/(2+5+7))*log((2/(2+5+7))) + #Pi*lnPi of Bb
(5/(2+5+7))*log((5/(2+5+7))) + #Pi*lnPi of Rr
(7/(2+5+7))*log((7/(2+5+7))) #Pi*lnPi of Xx
)
# [1] 0.992282

group by in R dplyr for more than one variable on unique value of other variable

I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')

How to use a statistic function and subsetting data simultaneously in R?

I have data that looks like this (dat)
region muscle protein
head cerebrum 78
head cerebrum 56
head petiole 1
head petiole 2
tail pectoral 3
tail pectoral 4
I want to take the mean of protein values of cerebrum. I tried to look up different ways to subset data here and here. But there does not seem a straightforward way of doing it. Right now, I'm doing this:
datcerebrum <- dat[which(dat$muscle == "cerebrum"),]
mean(datcerebrum$protein)
I try to condense this one line :
mean(dat[which(dat$muscle == "cerebrum"),])
But it throws out a NA with a warning that argument is not numeric or logical. Is there an easy way to achieve this?
We can use aggregate from base R
aggregate(protein ~muscle, dat, mean)
# muscle protein
#1 cerebrum 67.0
#2 pectoral 3.5
#3 petiole 1.5
I'd do this with the tidyverse package dplyr:
library(readr)
library(dplyr)
fwf <- "head cerebrum 78
head cerebrum 56
head petiole 1
head petiole 2
tail pectoral 3
tail pectoral 4"
dat <- read_fwf(fwf, fwf_empty(fwf, col_names = c("region", "muscle", "protein")))
# The above code is just to create your data frame - please provide reproducible data!
dat %>% filter(muscle == "cerebrum") %>% summarise(m = mean(protein))
#> # A tibble: 1 x 1
#> m
#> <dbl>
#> 1 67
You could even do it for every muscle at once:
dat %>% group_by(muscle) %>% summarise(m = mean(protein))
#> # A tibble: 3 x 2
#> muscle m
#> <chr> <dbl>
#> 1 cerebrum 67.0
#> 2 pectoral 3.5
#> 3 petiole 1.5
Solution using data.table:
# Load required library
library(data.table)
# Transform you data into a data.table object
setDT(dat)
# Subset cerebrum and mean protein values
data[muscle == "cerebrum"][, mean(protein)]

translate data frame of observations into ranks

I have a data set like this:
df <- data.frame(situation1=rnorm(30),
situation2=rnorm(30),
situation3=rnorm(30),
models=c(rep("A",10), rep("B",10), rep("C", 10)))
where I compare three models (A,B,C) in three situations. I have 10 measurements for each model.
I now want to summarise this into ranks, i.e. how often each models wins in each situtation. Win is defined by the highest value.
A final output could be something like this:
model situation1 situtation2 situtation3
A 4 3 3
B 7 1 2
C 1 4 5
In base R:
table(df$models,colnames(df[-4])[max.col(df[-4])])
# situation1 situation2 situation3
# A 2 4 4
# B 4 5 1
# C 2 4 4
Results may change from your OP, since you didn't set a seed.
Here is an option using data.table
library(data.table)
setDT(df)[, lapply(Map(`==`, .SD, list(do.call(pmax, .SD))), sum), models]
Here's a dplyr option:
df %>%
group_by(models) %>%
mutate_all(funs(. == pmax(situation1, situation2, situation3))) %>%
summarise_all(sum)
Or possibly a little more efficient:
df %>%
mutate_at(vars(-models), funs(. == pmax(situation1, situation2, situation3))) %>%
group_by(models) %>%
summarise_all(sum)
## A tibble: 3 × 4
# models situation1 situation2 situation3
# <chr> <int> <int> <int>
#1 A 3 3 3
#2 B 3 5 1
#3 C 6 1 2
If you're looking for the minimum, use pmin instead of pmax. And in case there may be NAs, use the na.rm-argument in pmax/pmin.
Final note: the result doesn't match OP's because the sample data was generated without setting a seed.

Resources