I am interested in biodiversity index calculations using vegan
package. The simpsons index works but no results from Shannon
argument. I was hoping somebody know the solution
What I have tried is that I have converted data. frame into vegan
package test data format using code below
Plot <- c(1,1,2,2,3,3,3)
species <- c( "Aa","Aa", "Aa","Bb","Bb","Rr","Xx")
count <- c(3,2,1,4,2,5,7)
veganData <- data.frame(Plot,species,count)
matrify(veganData )
diversity(veganData,"simpson")
diversity(veganData,"shannon", base = exp(1))
1. I get the following results, so I think it produces all
simpsons indices
> diversity(veganData,"simpson")
simpson.D simpson.I simpson.R
1 1.00 0.00 1.0
2 0.60 0.40 1.7
3 0.35 0.65 2.8
2. But when I run for Shannon index get the following
message
> diversity(veganData,"shannon")
data frame with 0 columns and 3 rows
I am not sure why its not working ? do we need to make any changes
in data formatting while switching the methods?
Your data need to be in the wide format. Also the counts must be either in total or averages (not repeated counts for the same plot).
library(dply); library(tidyr)
df <- veganData %>%
group_by(Plot, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
spread(species, count, fill=0)
df
# # A tibble: 3 x 5
# Plot Aa Bb Rr Xx
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5 0 0 0
# 2 2 1 4 0 0
# 3 3 0 2 5 7
diversity(df[,-1], "shannon")
# [1] 0.0000000 0.5004024 0.9922820
To check if the calculation is correct, note the Shannon calculation is carried out as -1 x summation of Pi*lnPi
# For plot 3:
-1*(
(2/(2+5+7))*log((2/(2+5+7))) + #Pi*lnPi of Bb
(5/(2+5+7))*log((5/(2+5+7))) + #Pi*lnPi of Rr
(7/(2+5+7))*log((7/(2+5+7))) #Pi*lnPi of Xx
)
# [1] 0.992282
Related
I have a data frame in long format and I want to replace missing values by random numbers, but I want to do this group wise with different settings...
library(dplyr)
set.seed(1)
imp_df <-
data.frame(exp=rep(letters[1:3], each=2),
rep=1:2,
mean=1:6,
sd=seq(0,0.5,0.1))
df <-
data.frame(
exp=rep(letters[1:3], each=20),
rep=1:2,
int=rnorm(60,10,5)
)
df[sample(1:60,25,replace=F), 'int'] <- NA
So my data looks like above, in the imp_df I have the settings for the rnorm function based on the experiment exp and the replicate rep.
My data frame has then some missing values and I want to replace the NA by the random numbers.
How can I do it using dplyr or tidyr?
Edit
After the answer from #starja, I found a quick, but maybe slow solution by using rowwise together with left_join.
df %>%
left_join(imp_df) %>%
rowwise() %>%
mutate(imp.int=if_else(
is.na(int),
rnorm(1, mean, sd),
int
)) %>%
print(n=60)
Are there other ways to do this?
Edit 2
Since the rowwise approach is pretty slow and I couldn't get it running within some dplyr code, I used a for loop to go through imp_df with the imputation settings.
This is a pretty quick solution, but not as readable as I was hoping:
df$imp.int <- df$int
for(line in 1:nrow(imp_df)) {
imp_settings <- as.list(imp_df[line,])
rows_missing_values <- which(
df$exp == imp_settings$exp &
df$rep == imp_settings$rep &
is.na(df$imp.int)
)
df$imp.int[rows_missing_values] <-
stats::rnorm(length(rows_missing_values), imp_settings$mean, imp_settings$sd)
}
So we first add a column imp.int for the imputed values and run now line by line the different imputation settings by replacing the NAs for each group.
This could also be done:
library(dplyr)
library(purrr)
df %>%
left_join(imp_df, by = c("exp", "rep")) %>%
mutate(int = ifelse(is.na(int),
map2(mean, sd, ~ rnorm(1, .x, .y)), int))
exp rep int mean sd
1 a 1 1 1 0.0
2 a 2 10.91822 2 0.1
3 a 1 5.821857 1 0.0
4 a 2 17.9764 2 0.1
5 a 1 11.64754 1 0.0
6 a 2 5.897658 2 0.1
7 a 1 12.43715 1 0.0
8 a 2 13.69162 2 0.1
9 a 1 12.87891 1 0.0
10 a 2 1.986482 2 0.1
I guess there are cleverer solutions out there that use vectorisation, but if you don't have super large data, I like to use a purrr::map function for this together with a small custom made function:
library(dplyr)
set.seed(1)
imp_df <-
data.frame(exp=rep(letters[1:3], each=2),
rep=1:2,
mean=1:6,
sd=seq(0,0.5,0.1))
df <-
data.frame(
exp=rep(letters[1:3], each=20),
rep=1:2,
int=rnorm(60,10,5)
)
df[sample(1:60,25,replace=F), 'int'] <- NA
replace_fun <- function(x, mean, sd) {
if (is.na(x)) {
rnorm(1, mean, sd)
} else {
x
}
}
df %>%
left_join(imp_df, by = c("exp", "rep")) %>%
mutate(int = purrr::pmap_dbl(list(int, mean, sd), replace_fun)) %>%
head()
#> exp rep int mean sd
#> 1 a 1 1.000000 1 0.0
#> 2 a 2 10.918217 2 0.1
#> 3 a 1 5.821857 1 0.0
#> 4 a 2 17.976404 2 0.1
#> 5 a 1 11.647539 1 0.0
#> 6 a 2 5.897658 2 0.1
Created on 2021-05-27 by the reprex package (v0.3.0)
(If you want, you can remove the mean/sd columns with select(-c(mean, sd)).)
I want time series correlations in a grouped data frame. Here's a sample dataset:
x <- cbind(expand.grid(type = letters[1:4], time = seq(1:4), kind = letters[5:8]), value = rnorm(64)) %>% arrange(type, time, kind)
which produces 64 rows of the variables type, time, kind and value.
I want a time series correlation of the values for each kind grouped by type. Think of each type and time combination as an ordered vector of 4 values. I group by type and time, then arrange by kind, then remove kind.
y <- x %>% group_by(type) %>% arrange(type, time, kind) %>% select(-kind)
I can then group y by type and time and nest such that all the values are together in the data variable, regroup by type only and create a new variable which is the lead data.
z <- y %>% group_by(type, time) %>% nest(value) %>% group_by(type) %>% mutate(ahead = lead(data))
Now I want to run mutate(R = cor(data, ahead)), but I can't seem get the syntax correct.
I've also tried mutate(R = cor(data$value, ahead$value)) and mutate(R = cor(data[1]$value, ahead[1]$value)), to no avail.
The error I get from cor is: supply both 'x' and 'y' or a matrix-like 'x'.
How do I reference the data and ahead variables as vectors to run with cor?
Ultimately, I'm looking for a 16 row data frame with columns type, time, and R where R is a single correlation value.
Thank you for your attention.
We can use map2_dbl from purrr to pass data and ahead at the same time to cor function.
library(dplyr)
z %>%
mutate(R = purrr::map2_dbl(data, ahead, cor)) %>%
select(-data, -ahead)
# type time R
# <fct> <int> <dbl>
# 1 a 1 0.358
# 2 a 2 -0.0498
# 3 a 3 -0.654
# 4 a 4 1
# 5 b 1 -0.730
# 6 b 2 0.200
# 7 b 3 -0.928
# 8 b 4 1
# 9 c 1 0.358
#10 c 2 0.485
#11 c 3 -0.417
#12 c 4 1
#13 d 1 0.140
#14 d 2 -0.448
#15 d 3 -0.511
#16 d 4 1
In base R, we can use mapply
z$R <- mapply(cor, z$data, z$ahead)
I am working with weather data and trying to find the first time a temperature is negative for each winter season. I have a data frame with a column for the winter season (1,2,3,etc.), the temperature, and the ID.
I can get the first time the temperature is negative with this code:
FirstNegative <- min(which(df$temp<=0))
but it only returns the first value, and not one for each season.
I know I somehow need to group_by season, but how do I incorporate this?
For example,
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df <- cbind(season,temp,ID)
Ideally I want a table that looks like this from the above dummy code:
table
season id_firstnegative
[1,] 1 2
[2,] 2 4
[3,] 3 8
[4,] 4 10
[5,] 5 13
A base R option using subset and aggregate
aggregate(ID ~ season, subset(df, temp < 0), head, 1)
# season ID
#1 1 2
#2 2 4
#3 3 8
#4 4 10
#5 5 13
library(dplyr)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<-as.data.frame(cbind(season,temp,ID))
df %>%
dplyr::filter(temp < 0) %>%
group_by(season) %>%
dplyr::filter(row_number() == 1) %>%
ungroup()
As you said, I believe you could solve this by simply grouping season and examining the first index of IDs below zero within that grouping. However, the ordering of your data will be important, so ensure that each season has the correct ordering before using this possible solution.
library(dplyr)
library(tibble)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<- tibble(season,temp,ID)
df <- df %>%
group_by(season) %>%
mutate(firstNeg = ID[which(temp<0)][1]) %>%
distinct(season, firstNeg) # Combine only unique values of these columns for reduced output
This will provide output like:
# A tibble: 5 x 2
# Groups: season [5]
season firstNeg
<dbl> <dbl>
1 1 2
2 2 4
3 3 8
4 4 10
5 5 13
I have data that looks like this (dat)
region muscle protein
head cerebrum 78
head cerebrum 56
head petiole 1
head petiole 2
tail pectoral 3
tail pectoral 4
I want to take the mean of protein values of cerebrum. I tried to look up different ways to subset data here and here. But there does not seem a straightforward way of doing it. Right now, I'm doing this:
datcerebrum <- dat[which(dat$muscle == "cerebrum"),]
mean(datcerebrum$protein)
I try to condense this one line :
mean(dat[which(dat$muscle == "cerebrum"),])
But it throws out a NA with a warning that argument is not numeric or logical. Is there an easy way to achieve this?
We can use aggregate from base R
aggregate(protein ~muscle, dat, mean)
# muscle protein
#1 cerebrum 67.0
#2 pectoral 3.5
#3 petiole 1.5
I'd do this with the tidyverse package dplyr:
library(readr)
library(dplyr)
fwf <- "head cerebrum 78
head cerebrum 56
head petiole 1
head petiole 2
tail pectoral 3
tail pectoral 4"
dat <- read_fwf(fwf, fwf_empty(fwf, col_names = c("region", "muscle", "protein")))
# The above code is just to create your data frame - please provide reproducible data!
dat %>% filter(muscle == "cerebrum") %>% summarise(m = mean(protein))
#> # A tibble: 1 x 1
#> m
#> <dbl>
#> 1 67
You could even do it for every muscle at once:
dat %>% group_by(muscle) %>% summarise(m = mean(protein))
#> # A tibble: 3 x 2
#> muscle m
#> <chr> <dbl>
#> 1 cerebrum 67.0
#> 2 pectoral 3.5
#> 3 petiole 1.5
Solution using data.table:
# Load required library
library(data.table)
# Transform you data into a data.table object
setDT(dat)
# Subset cerebrum and mean protein values
data[muscle == "cerebrum"][, mean(protein)]
I have the following bases in R.
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))
table1<-table1 %>% arrange(price) %>% split(.$group)
$`1`
group price visits
1 1 10 100
3 1 20 200
5 1 30 300
$`2`
group price visits
2 2 10 150
4 2 20 250
6 2 30 350
group_1<-data.frame(case_1=c(0.2,0.3,0.4),case_2=c(0.22,0.33,0.44))
group_2<-data.frame(case_1=c(0.3,0.4,0.5),case_2=c(0.33,0.44,0.55))
So, the question is How can I do the following operation without repeating it four times. I suppose that an apply function, or similar, will suit better.
sum(table1$`1`[,c("group")] * group_1[,c("case_1")])
sum(table1$`1`[,c("group")] * group_1[,c("case_2")])
sum(table2$`1`[,c("group")] * group_2[,c("case_1")])
sum(table2$`1`[,c("group")] * group_2[,c("case_2")])
After going through step-by-step in the data you have provided and understanding what you are trying to do. Here is a suggestion using mapply.
group_list <- list(group_1, group_2)
mapply(function(x, y) colSums(x * y),split(table1$group, table1$group),group_list)
# 1 2
#case_1 0.90 2.40
#case_2 0.99 2.64
We take the groups in one list say group_list. Split table1 by group and perform multiplication between them using mapply and take the column-wise sum. If I have understood you correctly, this is what you needed let me know if it is otherwise.
Based on the initial dataset, we can do this using group_by operations
library(tidyverse)
bind_rows(group_1, group_2) %>%
bind_cols(table1['group'], .) %>%
mutate(case_1 = group*case_1, case_2 = group*case_2) %>%
group_by(group) %>%
summarise_each(funs(sum))
# A tibble: 2 × 3
# group case_1 case_2
# <dbl> <dbl> <dbl>
#1 1 0.9 0.99
#2 2 2.4 2.64
data
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))