I'm an R novice, so I can't make a sample data frame for you, for which I apologize. However, I'm doing bacterial community analysis and I have a table that has species for each column and each sample for each row. Each column is an identifier for each species. Within the data frame is the species abundance for each sample. My goal is to identify the most abundant species (column) for each sample (row). I think making a data frame that has samples (rows) with the most abundant species column identifier would be the most useful!
Iterations I've tried (using phyloseq package, but can be used without this package).
beagle <- names(sort(taxa_sums(top.pdy.phyl),T, function(x) which(max==x)))
beagle2 <- names(taxa_sums(top.pdy.phyl),T, function(x) colnames(which.max(top.pdy.phyl))))
Any help would be appreciated! Thank you!
How about:
names(top.pdy.phyl)[ apply(top.pdy.phyl, 1, which.max) ]
and
apply(top.pdy.phyl, 1 , max)
Related
I have a large dataset, lets call it df1 (4226 observations X 186 variables)
I used a package called naniar to assess missingness, and created a dataset that shows, for each observation, what the percentage of missing data is. I then filtered the dataset, to show me only the observations (rows), in which there was less then 50% of missing data. Then, I created a dataset of just the row number of all rows that fit the missingness criteria, we can call this df2
Now, I want to create a subset of dataset df1 using the data in df2 (2044 observations X 1 variable).
Can anyone help me here?
I have tried something like:
df3 <- df2[df2$row %in% df1]
I used Modified Whitaker Plots to measure the rate of species accumulation as I increased my search area (https://www.researchgate.net/figure/Diagram-of-modified-Whittaker-plot-and-subplot-establishment_fig1_11253141).
The tricky part is, I need to only count new occurrences of species as they are encountered within a given Whitaker plot site. I don't want to just count the total number of species in each plot and then add them up, because I would be multi-counting common species that occur in more than one subplot.
I've tried looking around for anything similar, and while I can find information on simulating species-area curves, I can't find anything that helps me use real data (https://www.r-bloggers.com/2012/08/r-for-ecologists-simulating-species-area-curves-linear-vs-nonlinear-regression/).
I'm aware of the "specaccum" command in the vegan package, but am not sure how I would use it here.
Ultimately I need to go from a binary community matrix of species occurrences to a single column dependent variable of species richness that adds only occurrences of never-before-seen species to the total from the prior row. Hope that made sense!
Data can be found here, access set to anyone with the link: https://docs.google.com/spreadsheets/d/1gwyhgLNvTt1yriLX9qeMtsJnI5t540o-fmwzDUKGEAg/edit?usp=sharing
In my code I used a .csv . Unfortunately all I've been able to figure out is getting my data into a binary occurrence matrix and calculating individual species richness for each plot, but I can't figure out what the code-logic would be to tell R to proceed from the top row down, only adding new species occurrences to the richness column rather than just adding up the total number of species in a given subplot as it is now.
Code:
#Load full whitaker dataset
whitakern <- read.csv("WhitakerPlotsKern2019.csv")
#remove i.. from 1st column name... this may or may not be necessary with downloaded Google Sheets data
colnames(whitakern)[1] <- gsub('^...','',colnames(whitakern)[1])
#Plot # 9 was untreated/invaded and was the only replicate of its type; it must be removed
whitakern <- subset(whitakern, SiteType!="NoTrt")
#For species richness and abundance datasets to be combined in the same spreadsheet,
#richness hits were recorded as '.' rather than '1'. So we need to convert these back to usable binary counts:
whitakern[whitakern == '.'] <- '1'
#now we need to convert abundance counts to richness counts.
#Any abundance hit >1 needs to be replaced with 1 so we can calculate species richness rather than relative abundance
spac.spp[spac.spp > 1] <- 1
#separate out count data from treatment variables
spac.spp <- whitakern[,-c(1:8)]
spac.trt <- whitakern[,c(1:8)]
#make sure count data is in correct (numeric) format
spac.spp[sapply(spac.spp, is.character)] <-lapply(spac.spp[sapply(spac.spp, is.character)], as.numeric)
spac.trt$Richness <- rowSums(spac.spp[,c(1:93)], na.rm=TRUE)
#remove columns that are not de facto plant species, and remove the "Hits" column
spac.trt <- spac.trt[,-c(8:13)]
spac.dat <- cbind(spac.trt, spac.spp)
write.csv(spac.dat, "Whitaker-Data.csv")
I have an 11 x 8 data frame of numeric values in R that I want to find the standard deviation of. However, I cannot take the standard deviation of a matrix (use the sd() function), only the columns. But I need every data value used. How do I make this data frame into one column so that all values are used when finding the standard deviation? Hope this makes sense.
#generate data
df <- data.frame(matrix(rbinom(8*11, 1, .5), ncol=8))
#get sd
sd(unlist(df))
edit: just saw the comment where user fra got there first
For each of the 4 genes(each gene is on column), i need to test whether its mean expression is equal for patients with stable and progressive disease and store the corresponding p-value. Someone can help me please ? the language is in R.
Here picture of my dataframe:
Suppose this is your data frame:
df = data.frame(y=sample(c("progres.","stable"),100,replace=TRUE),matrix(rnorm(100*4),ncol=4))
colnames(df)[-1] = c("X1000_at","X1001_at","X1002_at","X1003_at")
If you just need the p.value, you can do:
apply(df[,-1],2,function(i)t.test(i ~ df$y)[["p.value"]])
X1000_at X1001_at X1002_at X1003_at
0.14861795 0.11653459 0.01820033 0.41873270
In the above, you iterate through the gene columns, t.test between groups demarcated by the y column and capture only the p value.
I would like to subset a data frame by group means. I want to subset all data values greater than the group mean. The code I have tried is:
data<-read.csv("TreeData.csv")
library(plyr)
#Calculating the group means
MDBH<-ddply(data, .(PLTPA),summarise, MDBH=mean(DBH))
MDBH
dataDHT<-subset(data,DBH>MDBH)
#The subset data is incorrect, it excluded some value greater than the mean
and included some values less than the mean.
dataDHT
The data set I created for this problem is at:
https://www.dropbox.com/s/ejnjhg4ogk2g4rw/TreeData.csv?dl=0
Thank you in advance for the help.