R falsely identifying abundance data as presence/absence data (fossil package) - r

I am attempting to use the spp.est function of the package called "fossil" in RStudio. I have created a matrix called "akimiskibb" of abundance data with species as the columns and sites as the rows. When I try to use the function spp.est, I type this:
spp.est(akimiskibb, rand = 10, abund = TRUE, counter = FALSE, max.est = 'all')
The problem comes in because my abundance data has a lot of zeroes, so I get this error message:
Error in if (max(x) == 1) warning("cannot use incidence data for abundance-based analyses. If the data is incidence based, please run this function again with the option of 'abund=FALSE'") :
missing value where TRUE/FALSE needed
This function has worked in the past with matrices with a lot of zeroes (which are also abundance data, not presence/absence). I don't know what I am doing wrong.
Has anyone experienced something similar and found a way around this?
Thank you,
Kayla
Data:
matrix format:
*sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9 sp10 sp11 sp12 sp13 sp14 sp15 sp16 sp17
sample1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
0 0
sample 2 0 0 0 1 0 0 1 25 7 0 18 12 0 0 0
1 1
sample3 0 0 0 0 0 0 0 3 0 0 3 1 0 0 0
5 4
sp18 sp19 sp20 sp21 sp22 sp23 sp24 sp25 sp26 sp27 sp28 sp29 sp30 sp31
sp32
sample1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0
sample 2 1 0 1 0 0 0 0 0 0 0 0 3 2
0 3
sample3 0 0 1 0 0 11 0 0 0 0 0 0 0
0 1
sp33 sp34 sp35 sp36 sp37 sp38 sp39 sp40 sp41 sp42 sp43 X
sample1 0 0 0 0 0 0 0 0 0 0 0 NA
sample 2 0 0 3 2 1 0 0 1 8 0 0 NA
sample3 0 0 0 0 0 0 0 0 0 0 0 NA*
dput:
*structure(list(sp1 = c(0L, 0L, 0L), sp2 = c(0L, 0L, 0L), sp3 = c(0L,
0L, 0L), sp4 = c(0L, 1L, 0L), sp5 = c(0L, 0L, 0L), sp6 = c(0L,
0L, 0L), sp7 = c(0L, 1L, 0L), sp8 = c(1L, 25L, 3L), sp9 = c(0L,
7L, 0L), sp10 = c(0L, 0L, 0L), sp11 = c(0L, 18L, 3L), sp12 = c(0L,
12L, 1L), sp13 = c(0L, 0L, 0L), sp14 = c(0L, 0L, 0L), sp15 = c(1L,
0L, 0L), sp16 = c(0L, 1L, 5L), sp17 = c(0L, 1L, 4L), sp18 = c(0L,
1L, 0L), sp19 = c(0L, 0L, 0L), sp20 = c(0L, 1L, 1L), sp21 = c(0L,
0L, 0L), sp22 = c(0L, 0L, 0L), sp23 = c(0L, 0L, 11L), sp24 = c(0L,
0L, 0L), sp25 = c(0L, 0L, 0L), sp26 = c(0L, 0L, 0L), sp27 = c(0L,
0L, 0L), sp28 = c(0L, 0L, 0L), sp29 = c(0L, 3L, 0L), sp30 = c(0L,
2L, 0L), sp31 = c(0L, 0L, 0L), sp32 = c(0L, 3L, 1L), sp33 = c(0L,
0L, 0L), sp34 = c(0L, 0L, 0L), sp35 = c(0L, 3L, 0L), sp36 = c(0L,
2L, 0L), sp37 = c(0L, 1L, 0L), sp38 = c(0L, 0L, 0L), sp39 = c(0L,
0L, 0L), sp40 = c(0L, 1L, 0L), sp41 = c(0L, 8L, 0L), sp42 = c(0L,
0L, 0L), sp43 = c(0L, 0L, 0L), X = c(NA, NA, NA)), .Names = c("sp1",
"sp2", "sp3", "sp4", "sp5", "sp6", "sp7", "sp8", "sp9", "sp10",
"sp11", "sp12", "sp13", "sp14", "sp15", "sp16", "sp17", "sp18",
"sp19", "sp20", "sp21", "sp22", "sp23", "sp24", "sp25", "sp26",
"sp27", "sp28", "sp29", "sp30", "sp31", "sp32", "sp33", "sp34",
"sp35", "sp36", "sp37", "sp38", "sp39", "sp40", "sp41", "sp42",
"sp43", "X"), class = "data.frame", row.names = c("sample1",
"sample 2", "sample3"))*
packages used:
fossil (made in R version 3.4.4)
Version of R: R x64 3.4.1

OK, not familiar with this package so thank you for introducing it to me. Always great to see how R is being used in so many disciplines.
I see two issues with your applications. First, according to the fossil documentation for spp.est, your data needs to have your samples as columns and your species as rows. The second issue is the "species" X with the NA values. You need to get rid of these because the function can't handle them.
Here's the code:
library(fossil)
library(tidyverse)
akimiskibb <- as.data.frame(t(akimiskibb)) %>%
filter(!is.na(sample1) == T)
spp.est(akimiskibb, rand = 10, abund = TRUE, counter = FALSE, max.est = 'all')
# which results in
N.obs S.obs S.obs(+95%) S.obs(-95%) Chao1 Chao1(upper) Chao1(lower)
[1,] 1 11.8 25.30995 -1.709948 20.90 28.38331 13.41669
[2,] 2 16.0 25.46770 6.532301 27.15 33.37680 20.92320
[3,] 3 20.0 20.00000 20.000000 26.00 29.24037 22.75963
ACE ACE(upper) ACE(lower) Jack1 Jack1(upper) Jack1(lower)
[1,] 14.56286 31.16289 -2.037175 16.82501 36.43129 -2.781268
[2,] 19.19286 30.64000 7.745713 22.21008 35.28448 9.135677
[3,] 22.28571 22.28571 22.285714 25.95082 25.95082 25.950820
attr(,"data.type")
[1] "abundance"
Warning messages:
1: In ests[[k]](b) :
This data appears to be presence/absence based, but this estimator is for abundance data only
2: In ests[[k]](b) :
This data appears to be presence/absence based, but this estimator is for abundance data only

Related

How to change the reference level for Risk Ratio in logistic regression in R?

Survey.ID Quit Boss Subord Subord2 Subord3 Subord4
1 1 0 0 0 0 1 0
2 2 1 0 0 1 0 0
3 3 0 0 0 0 0 0
4 4 0 0 0 0 1 0
5 5 0 0 0 0 1 0
6 6 1 0 0 0 1 0
I have a df above. Each of the variables is a binary variable that categorizes if someone is a boss, or a certain level of subordinate. I am trying to see what is most predictive of someone quitting the past month. I am using the logistic regression
model <- glm(Quit ~ Subord, family=binomial, data = df)
summary(model)
exp(cbind(RR = coef(model), confint.default(model)))
I would like to find the Relative Risk (RR) for each group of employees: Boss, Subord, Subord2, Subord3, Subord4. However, I would like to reference group to be Subord4. I believe right now, the reference is set to boss? How do I fix this?
I think this might help
Libraries
library(tidyverse)
Sample data
df <-
structure(list(id = 1:6, Quit = c(0L, 1L, 0L, 0L, 0L, 1L), Boss = c(0L,
0L, 0L, 0L, 0L, 0L), Subord = c(0L, 0L, 0L, 0L, 0L, 0L), Subord2 = c(0L,
1L, 0L, 0L, 0L, 0L), Subord3 = c(1L, 0L, 0L, 1L, 1L, 1L), Subord4 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Code
df %>%
#Create a single column of variables: Boss Subord Subord2 Subord3 Subord4
pivot_longer(cols = -c(id,Quit)) %>%
#Keeping only those with value = 1
filter(value == 1) %>%
#Making "Subord4" the baseline, as the first level of the factor
mutate(name = fct_relevel(as.factor(name),"Subord4")) %>%
glm(data = .,formula = Quit~name, family=binomial)

how to make a vector of different values in R for linear regression

I have a matrix (data1) looking like:
gene S869 S907 S909 S016 S090 S160
1 S1 0 0 0 0.000000 0 0
2 S2 0 0 0 0.000000 0 0
3 S3 0 0 0 0.423405 0 0
4 S4 0 0 0 0.000000 0 0
5 S5 0 0 0 0.000000 0 0
6 S6 0 0 0 0.000000 0 0
I have another dataset (data2) looking like:
Cultivar Dose
S869 10
S907 5
S909 7
S016 19
S090 15
S160 12
Then I want to do a linear regression using
for (gene in 1:ngenes){
model = lm(Dose~X[gene,])
}
I want to check regression between genes and dose using these two datasets. So that I get p values for all genes for dose across all varieties. Thank you in advance!
We can use Map to loop over the 'Cultivar, 'Dose' column, create a 'Dose' column in 'data1' subset based on the 'Cultivar' column and apply lm
Map(function(x, y) lm(Dose ~ ., transform(data1[x], Dose = y)),
as.character(data2$Cultivar), data2$Dose)
data
data1 <- structure(list(gene = c("S1", "S2", "S3", "S4", "S5", "S6"),
S869 = c(0L, 0L, 0L, 0L, 0L, 0L), S907 = c(0L, 0L, 0L, 0L,
0L, 0L), S909 = c(0L, 0L, 0L, 0L, 0L, 0L), S016 = c(0, 0,
0.423405, 0, 0, 0), S090 = c(0L, 0L, 0L, 0L, 0L, 0L), S160 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
data2 <- structure(list(Cultivar = c("S869", "S907", "S909", "S016", "S090",
"S160"), Dose = c(10L, 5L, 7L, 19L, 15L, 12L)), class = "data.frame",
row.names = c(NA,
-6L))

how to make a vector of different values in R

I have a matrix (data1) looking like:
gene S869 S907 S909 S016 S090 S160
1 S1 0 0 0 0.000000 0 0
2 S2 0 0 0 0.000000 0 0
3 S3 0 0 0 0.423405 0 0
4 S4 0 0 0 0.000000 0 0
5 S5 0 0 0 0.000000 0 0
6 S6 0 0 0 0.000000 0 0
I want to make a vector (Say X) for this matrix. I have another dataset (data2) looking like:
Cultivar Dose
S869 10
S907 5
S909 7
S016 19
S090 15
S160 12
Then I want to do a linear regression using
for (gene in 1:ngenes){
model = lm(Dose~X[gene,])
}
and also want to get the p values from above regression. Thank you in advance!
If we want to use the 'Cultivar' to extract the column names, loop over the 'Cultivar', 'Dose' corresponding values with Map, extract the column based on the 'Culivar' value for 'data1', create a new column 'Dose' with transform and apply the lm, extract the p values from the coefficients
lst1 <- Map(function(x, y) {
tmpdat <- transform(as.data.frame(data1)[x], Dose = y)
model <- lm(Dose ~ ., data = tmpdat)
coef(summary(model))[, "Pr(>|t|)"]}, data2$Cultivar, data2$Dose)
lst1
#$S869
#[1] 1.218223e-78
#$S907
#[1] 1.218223e-78
#$S909
#[1] 2.265096e-79
#...
data
data1 <- structure(list(gene = c("S1", "S2", "S3", "S4", "S5", "S6"),
S869 = c(0L, 0L, 0L, 0L, 0L, 0L), S907 = c(0L, 0L, 0L, 0L,
0L, 0L), S909 = c(0L, 0L, 0L, 0L, 0L, 0L), S016 = c(0, 0,
0.423405, 0, 0, 0), S090 = c(0L, 0L, 0L, 0L, 0L, 0L), S160 = c(0L,
0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
data2 <- structure(list(Cultivar = c("S869", "S907", "S909", "S016", "S090",
"S160"), Dose = c(10L, 5L, 7L, 19L, 15L, 12L)), class = "data.frame",
row.names = c(NA,
-6L))

String matching where strings contain punctuation

I want to find a case insensitive match using grepl().
I have the following list of keywords that I want to find in a Text column of my data frame df.
# There is a long list of words, but for simplification I have provided only a subset.
I, I'm, the, and, to, a, of
I want to have the counts of these words separately for each of the data rows.
I define this word list to be used in the code as:
word_list = c('\\bI\\b','\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')
# Note that I'm is not currently in this word_list
In my dataframe df I add the columns as below to keep the counts of above words:
df$I = 0
df$IM = 0 # this is where I need help
df$THE = 0
df$AND = 0
df$TO = 0
df$A = 0
df$OF = 0
Then I use the following for-loop for each word of the word list to iterate over each row of the required column.
# for each word of my word_list
for (i in 1:length(word_list)){
# to search in each row of text response
for(j in 1:nrow(df)){
if(grepl(word_list[i], df$Text[j], ignore.case = T)){
df[j,i+4] = (df[j,i+4]) # 4 is added to go to the specific column
}#if
}#for
}#for
For a reproducible example dput(df) is as below:
dput(df)
structure(list(cluster3 = c(2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L), userID = c(3016094L, 3042038L, 3079341L, 3079396L, 3130832L, 3130864L, 3148118L, 3148914L, 3149040L, 3150222L), Text = structure(c(3L, 4L, 2L, 9L, 6L, 10L, 7L, 1L, 5L, 8L), .Label = c("I'm alright","I'm stressed", "I am a good person.", "I don't care", "I have a difficult task", "I like it", "I think it doesn't matter", "Let's not argue about this", "Let's see if I can run", "No, I'm not in a mood"), class = "factor"), I = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), IM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AND = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), THE = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), TO = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), OF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -10L))
I would suggest a more streamlined approach:
## use a named vector for the word patterns
## with the column names you want to add to `df`
word_list = c('I' = '\\bi\\b', 'THE' = '\\bthe\\b', 'AND' = '\\band\\b',
'TO' = '\\bto\\b', 'A' = '\\ba\\b', 'OF' = '\\bof\\b', 'IM' = "\\bim")
## use `stringr::str_count` instead of `grepl`
## sapply does the looping and result gathering for us
library(stringr)
results = sapply(word_list, str_count,
string = gsub("[[:punct:]]", "", tolower(df$Text))
)
results
# I THE AND TO A OF IM
# [1,] 1 3 2 1 1 1 0
# [2,] 0 0 1 0 0 0 0
# [3,] 0 0 0 0 0 0 0
# [4,] 2 2 3 2 1 1 1
# [5,] 0 0 0 1 1 0 0
# [6,] 0 3 2 2 0 0 0
# [7,] 1 3 0 1 1 0 0
# [8,] 1 2 0 1 1 1 0
# [9,] 0 0 0 0 0 0 0
# [10,] 0 0 0 1 2 0 0
## put the results into the data frame based on the names
df[colnames(results)] = data.frame(results)
Since we rely on str_count which is vectorized, this should be much faster than the row-by-row approach.
I am able to make my code working by adding the expression in double quotes:
word_list = c('\\bI\\b',"\\bI'm\\b",'\\bthe\\b','\\band\\b','\\bto\\b','\\ba\\b','\\bof\\b')

Replace value in one data table based on NA in another data table

I've got genotyping data from several overlapping NPs/individuals which I am attempting to compare.
As you can see in the data structure below, e[1,2] and e[2,3] have NA's. Now I want to replace d[1,2](1) and d[2,3](1) by NA values.
d <- structure(list(`100099681` = c(0L, 2L, 0L), `101666591` = c(1L, 1L, 0L), `102247652` = c(1L, 1L, 1L), `102284616` = c(0L, 1L, 0L), `103582612` = c(0L, 1L, 1L), `104344528` = c(2L, 1L, 0L), `105729734` = c(1L, 0L, 1L), `109897137` = c(0L, 0L, 2L), `112768301` = c(0L, 1L, 1L), `114724443` = c(1L, 1L, 1L), `114826164` = c(1L, 0L, 1L), `115358770` = c(0L, 2L, 0L), `115399788` = c(1L, 1L, 0L), `118669033` = c(0L, 1L, 1L), `118875482` = c(2L, 1L, 0L), `119366362` = c(0L, 2L, 0L), `119627971` = c(0L, 1L, 1L), `120295351` = c(0L, 2L, 0L), `120998030` = c(0L, 0L, 2L)), .Names = c("100099681", "101666591", "102247652", "102284616", "103582612", "104344528", "105729734", "109897137", "112768301", "114724443", "114826164", "115358770", "115399788", "118669033", "118875482", "119366362", "119627971", "120295351", "120998030"), row.names = c("7:100038150_C", "7:100079759_T", "7:100256942_A"), class = "data.frame")
> d
# 100099681 101666591 102247652 102284616 103582612 104344528 105729734 109897137 112768301 114724443 114826164 115358770 115399788 118669033 118875482 119366362 119627971 120295351 120998030
#7:100038150_C 0 1 1 0 0 2 1 0 0 1 1 0 1 0 2 0 0 0 0
#7:100079759_T 2 1 1 1 1 1 0 0 1 1 0 2 1 1 1 2 1 2 0
#7:100256942_A 0 0 1 0 1 0 1 2 1 1 1 0 0 1 0 0 1 0 2
e<- structure(list(`100099681` = c(1L, 1L, 0L), `101666591` = c(NA, 1L, 1L), `102247652` = c(0L, NA, 0L), `102284616` = c(1L, 1L, 0L), `103582612` = c(1L, 0L, 1L), `104344528` = c(1L, 0L, 1L), `105729734` = c(0L, 0L, 1L), `109897137` = c(1L, 1L, 0L), `112768301` = c(0L, 1L, 1L), `114724443` = c(0L, 2L, 0L), `114826164` = c(0L, 0L, 2L), `115358770` = c(0L, 0L, 2L), `115399788` = c(0L, 2L, 0L), `118669033` = c(0L, 0L, 2L), `118875482` = c(0L, 1L, 1L), `119366362` = c(2L, 1L, 0L), `119627971` = c(0L, 1L, 1L), `120295351` = c(0L, 2L, 0L), `120998030` = c(0L, 2L, 1L)), .Names = c("100099681", "101666591", "102247652", "102284616", "103582612", "104344528", "105729734", "109897137", "112768301", "114724443", "114826164", "115358770", "115399788", "118669033", "118875482", "119366362", "119627971", "120295351", "120998030"), row.names = c("7:100038150_C", "7:100079759_T", "7:100256942_A"), class = "data.frame")
> e
# 100099681 101666591 102247652 102284616 103582612 104344528 105729734 109897137 112768301 114724443 114826164 115358770 115399788 118669033 118875482 119366362 119627971 120295351 120998030
#7:100038150_C 1 NA 0 1 1 1 0 1 0 0 0 0 0 0 0 2 0 0 0
#7:100079759_T 1 1 NA 1 0 0 0 1 1 2 0 0 2 0 1 1 1 2 2
#7:100256942_A 0 1 0 0 1 1 1 0 1 0 2 2 0 2 1 0 1 0 1
Thus my expected output would be
> expected_d
# 100099681 101666591 102247652 102284616 103582612 104344528 105729734 109897137 112768301 114724443 114826164 115358770 115399788 118669033 118875482 119366362 119627971 120295351 120998030
#7:100038150_C 0 NA 1 0 0 2 1 0 0 1 1 0 1 0 2 0 0 0 0
#7:100079759_T 2 1 NA 1 1 1 0 0 1 1 0 2 1 1 1 2 1 2 0
#7:100256942_A 0 0 1 0 1 0 1 2 1 1 1 0 0 1 0 0 1 0 2
I've gotten this far;
g <- which(is.na(e), arr.ind=TRUE)
> g
# row col
#7:100038150_C 1 2
#7:100079759_T 2 3
Then trying to use an apply function to replace the location by "TEST" (or na for that matter)
apply(g, 1, function(x){
e[x[1], x[2]] <- "TEST" }
)
#> apply(g, 1, function(x){ e[x[1], x[2]] <- "TEST" })
#7:100038150_C 7:100079759_T
# "TEST" "TEST"
I will be running this bit of code over several million rows/columns so speed will be an issue.
Thank you in advance:)
We can try doing
NA^(is.na(e))*d
If memory is an issue
d[] <- Map(function(x,y) NA^(is.na(y))* x, d, e)
Another way based on your approach,
d[which(is.na(e), arr.ind = T)] <- NA

Resources