Aggregating column according to their names - r

French student here, so my english's not that great, sorry.
We transformed a data set with species and their locations, to the corresponding origin of theses species and their locations.
The data set has 600~ columns, named U, A, W, L or E (species origin), inside of which a 0 or 1 (presence / absence of a species at location)
and 2 columns with coordonates (corresponding to the data collecting station).
More than 8000 lines, for each station where data was found.
A simplification of the data set would like that :
[Longitude] [Latitude] [A][U][U][L][E][A][U] ... [+600]
[1,] -5.89 35.71 0 0 1 0 0 1 1
[2,] -5.89 35.81 0 1 0 0 0 0 1
[3,] -5.89 36.01 1 0 0 1 1 1 0
[4,] -5.89 36.1 0 0 0 1 0 1 0
[1,] -5.89 36.21 1 1 1 0 0 1 1
[2,] -5.79 35.81 1 1 0 1 0 1 0
[3,] -5.79 35.91 0 1 0 0 0 0 1
[4,] -5.79 36.01 1 1 0 1 0 1 0
[+8000]
What we want to do is to some sort of conditional sum, where all origin are regrouped into one column each and their content summed , like so :
`
[Longitude] [Latitude] [A][U][L][W][E]
[1,] -5.89 35.71 12 6 5 0 13
[2,] -5.89 35.81 5 1 8 10 20
[3,] -5.89 36.01 1 28 3 6 2
[4,] -5.89 36.1 4 25 0 1 11
[1,] -5.89 36.21 9 1 9 3 5
[2,] -5.79 35.81 6 5 12 1 8
[3,] -5.79 35.91 5 2 7 15 10
[4,] -5.79 36.01 10 3 5 12 4
[+8000]
Only the A,U,L,E,W must be summed.
Longitude, Latitude and number of rows must ne kept the same.
We tried aggregate or tapply, without success, but maybe a loop is needed...
Any ideas ?
a capture of the data set
MacOs answer
MacOS answer 2
Thanks
MacOS function : espOri => df espagg => df.agg
espagg <- aggregate(. ~ Longitude + Latitude,
especeOri,
FUN = sum)
aggregate.columns <- function(especeOri, column.names)
{
for (column.name in column.names) {
especeOri[[column.name]] <- rowSums(subset(espagg, select = grep(paste(column.name, ".*", sep = ""), colnames(especeOri))))
}
return(especeOri)
}
aggregate.column.names <- c("A", "U", "L", "E", "W")
espagg <- aggregate.columns(espagg, aggregate.column.names)
espagg <- subset(especeOri, select = c("Longitude", "Latitude", aggregate.column.names))
View(espagg)
dput of the data set
dput(especeOri[1:10,1:20])
structure(list(Longitude = c(-5.89, -5.89, -5.89, -5.89, -5.89,
-5.79, -5.79, -5.79, -5.79, -5.69), Latitude = c(35.71, 35.81,
36.01, 36.11, 36.21, 35.81, 35.91, 36.01, 36.11, 35.81), L = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), U.1 = c(0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
0L, 1L), A = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.2 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), E = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), U.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), E.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.4 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.5 = c(0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L), U.6 = c(1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 1L), L.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
U.7 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), U.8 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L), U.9 = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), U.10 = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), A.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), U.11 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), row.names = c(NA, 10L), class = "data.frame")
google drive with all the data sets, a few explanations and our script.
https://drive.google.com/drive/folders/1fnWnZZDC3gyWTtSoqi_l7Wuhje5qpJmL?usp=sharing
EDIT : added some values for longitude and latitude to illustrate and a screenshot

Here is a tidyverse solution using the data you provided.
library(dplyr)
library(tidyr)
fish <- read.table("Data_fish.txt", header = T)
traits <- read.table("Data_traits.txt", header = T)
fish %>%
pivot_longer(-c(ID_cellule, Longitude, Latitude), names_to = "Species", values_to = "Occur") %>%
mutate(ID_cellule = factor(ID_cellule, levels = unique(ID_cellule))) %>% # use factor to fix the display order as-is
left_join(traits %>% select(Species, Origin), by = "Species") %>%
group_by(ID_cellule, Longitude, Latitude, Origin) %>%
summarise(Occur = sum(Occur)) %>%
pivot_wider(names_from = "Origin", values_from = "Occur")
Output
# A tibble: 8,154 x 8
# Groups: ID_cellule, Longitude, Latitude [8,154]
ID_cellule Longitude Latitude A E L U W
<fct> <dbl> <dbl> <int> <int> <int> <int> <int>
1 ID1 -5.89 35.7 8 10 0 178 0
2 ID2 -5.89 35.8 11 10 0 234 0
3 ID3 -5.89 36.0 9 11 0 195 0
4 ID4 -5.89 36.1 12 10 0 227 0
5 ID5 -5.89 36.2 13 17 0 268 0
6 ID6 -5.79 35.8 9 8 0 205 0
7 ID7 -5.79 35.9 8 9 0 168 0
8 ID8 -5.79 36.0 11 14 0 262 0
9 ID9 -5.79 36.1 10 10 0 193 0
10 ID10 -5.69 35.8 9 10 0 230 0

The following should do the job.
df <- data.frame(Longitude = c(-5.89, -5.89, -5.89, -5.89, -5.89, -5.79, -5.79, -5.79, -5.89, -5.89),
Latitude = c(35.71, 35.81, 36.01, 36.1, 36.21, 35.81, 35.91, 36.01, 35.71, 35.81),
A = c(0, 0, 1, 0, 1, 1, 0, 1, 1, 1),
U = c(0, 1, 0, 0, 1, 1, 1, 1, 1, 1),
U = c(1, 0, 0, 0, 1, 0, 0, 0, 1, 1),
L = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1),
E = c(0, 0, 1, 0, 0, 0, 0, 0, 1, 1),
A = c(1, 0, 1, 1, 1, 1, 0, 1, 1, 1),
U = c(1, 1, 0, 0, 1, 0, 1, 0, 1, 1))
df.agg <- aggregate(. ~ Longitude + Latitude,
df,
FUN = sum)
df.agg$A <- rowSums(subset(df.agg, select = grep("A.*", colnames(df.agg))))
df.agg$U <- rowSums(subset(df.agg, select = grep("U.*", colnames(df.agg))))
df.agg$L <- rowSums(subset(df.agg, select = grep("L.*", colnames(df.agg))))
df.agg$E <- rowSums(subset(df.agg, select = grep("E.*", colnames(df.agg))))
df.agg <- subset(df.agg, select = c(Longitude, Latitude, A, U, L, E))
Update
The OP user asked for a solution where he/she does not have to write the code for rowSums explicitely, because he/she has to many columns to actually write it out, i.e. it is inconvinient. The following should do the job.
df <- structure(list(Longitude = c(-5.89, -5.89, -5.89, -5.89, -5.89,
-5.79, -5.79, -5.79, -5.79, -5.69), Latitude = c(35.71, 35.81,
36.01, 36.11, 36.21, 35.81, 35.91, 36.01, 36.11, 35.81), L = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), U.1 = c(0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
0L, 1L), A = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.2 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), E = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), U.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), E.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.4 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.5 = c(0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L), U.6 = c(1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 1L), L.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
U.7 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), U.8 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L), U.9 = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), U.10 = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), A.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), U.11 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), row.names = c(NA, 10L), class = "data.frame")
df.agg <- aggregate(. ~ Longitude + Latitude,
df,
FUN = sum)
# This function aggregates rows if their column names have the same start.
# Suppose we have a data frame with column names A, A.1, and A.2. Then,
# the rows of these columns are aggregated using sum. So,
# A 1 1 0
# A.1 2 1 0
# A.2 0 0 1
# becomes
# A 3 2 1
aggregate.columns <- function(df.my, column.names)
{
for (column.name in column.names) {
df.my[[column.name]] <- df.my[[column.name]] +
rowSums(subset(df.my,
select = grep(paste(column.name, ".[1-9]+", sep = ""),
colnames(df.my))))
}
return(df.my)
}
aggregate.column.names <- c("A", "U", "L", "E")
df.agg <- aggregate.columns(df.agg, aggregate.column.names)
df.agg <- subset(df.agg, select = c("Longitude", "Latitude", aggregate.column.names))
df.agg
The key to making this work is this line.
grep(paste(column.name, ".[1-9]+", sep = ""), colnames(df.my))
It returns all column names that have start with the current value of variable column.name followed by a dot and any sequence of digits, e.g. when the value of column.name is A then A.1, A.345, A.67, A.9798, A.111111 should all be returned. Please check!
Update 3
After the user of the OP provided the data, I did come up with the following. This includes a function for renaming. This is necessary since the data frame has columns with identical names. For example, this function transforms a sequence of column names A, A, A, A into A, A.1, A.2, A.3.
climate <- read.table("Data_climate.txt", header = T)
poissons <- read.table("Data_fish.txt", header = T)
traitsNA <- read.table("Data_traits.txt", header = T)
especes <- poissons [,-2]
especes2 <- especes [,-2]
especes3 <- especes2 [,-1]
colnames(especes3) <- traitsNA$Origin
especes44<-cbind(climate$Latitude,especes3)
especeOri <- cbind(climate$Longitude,especes44)
origine <- cbind(climate$ID_cellule,especeOri)
colnames(origine)[1] <- "ID_cellule"
colnames(origine)[2] <- "Longitude"
colnames(origine)[3] <- "Latitude"
colnames(especeOri)[1] <- "Longitude"
colnames(especeOri)[2] <- "Latitude"
rename.columns <- function(df)
{
unique.column.names <- unique(colnames(df))
for (unique.column.name in unique.column.names)
{
idxs.columns <- which(colnames(df) == unique.column.name)
df.tmp.with.new.col.names <- subset(df, select = idxs.columns)
colnames(df)[idxs.columns] <- colnames(df.tmp.with.new.col.names)
}
return(df)
}
especeOri <- rename.columns(especeOri)
espagg <- aggregate(. ~ Longitude + Latitude,
especeOri,
FUN = sum)
# This function aggregates rows if their column names have the same start.
# Suppose we have a data frame with column names A, A.1, and A.2. Then,
# the rows of these columns are aggregated using sum. So,
# A 1 1 0
# A.1 2 1 0
# A.2 0 0 1
# becomes
# A 3 2 1
aggregate.columns <- function(df.my, column.names)
{
for (column.name in column.names) {
df.my[[column.name]] <- df.my[[column.name]] +
rowSums(subset(df.my,
select = grep(paste(column.name, ".[1-9]+",
sep = ""),
colnames(df.my))))
}
return(df.my)
}
aggregate.column.names <- c("A", "U", "L", "E", "W")
espagg <- aggregate.columns(espagg, aggregate.column.names)
espagg <- subset(especeOri, select = c("Longitude", "Latitude", aggregate.column.names))
HTH!

Related

Non-overlapping sliding window based on index

For a data.frame, df, which has an index column and a value column, I would like to calculate e.g.the mean of the values in non-overlapping sliding windows, with the window size based on the units in the index column (for example, windows which cover 10 units in the index).
There is runner::runner and slider::slide_index which allow you to slide in windows based on an index column, but I don't see a way to make the windows non-overlapping.
df = structure(list(V3 = c(17054720L, 17075353L, 17087656L, 17099107L,
17152611L, 17154984L, 17178213L, 17256231L, 17264565L, 17280822L,
17281931L, 17285949L, 17289118L, 17294251L, 17301217L, 17301843L,
17304246L, 17304887L, 17306104L, 17310741L, 17312596L, 17315102L,
17315503L, 17317233L, 17318150L, 17319156L, 17326181L, 17326432L,
17394989L, 17395610L, 17396612L, 17397875L, 17398508L, 17398800L,
17398812L, 17399211L, 17405173L, 17407349L, 17407566L, 17409897L,
17410373L, 17412216L, 17412806L, 17414103L, 17414640L, 17415572L,
17426401L, 17427037L, 17429384L, 17429434L, 17433210L, 17434084L,
17436846L, 17441524L, 17442154L, 17443131L, 17445502L, 17446157L,
17446914L, 17450515L, 17452966L, 17462185L, 17467411L, 17467684L,
17470779L, 17475921L, 17488195L, 17489577L, 17489890L, 17490932L,
17492203L, 17492452L, 17493792L, 17494101L, 17494547L, 17524203L,
17525584L, 17525970L, 17529814L, 17541673L, 17545859L, 17557144L,
17567699L, 17575800L, 17580394L, 17580813L, 17585441L, 17586471L,
17587680L, 17587975L, 17589209L, 17589246L, 17593685L, 17594915L,
17597462L, 17599844L, 17603801L, 17605824L, 17611515L, 17615213L
), V1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L)), row.names = c(NA, -100L), class = "data.frame")
What about something like this:
df <- data.frame(
index = 0:99,
value = 1:100)
df %>%
mutate(window = floor(index/10)) %>%
group_by(window) %>%
summarise(value = mean(value),
n = n())
# # A tibble: 10 × 3
# window value n
# <dbl> <dbl> <int>
# 1 0 5.5 10
# 2 1 15.5 10
# 3 2 25.5 10
# 4 3 35.5 10
# 5 4 45.5 10
# 6 5 55.5 10
# 7 6 65.5 10
# 8 7 75.5 10
# 9 8 85.5 10
# 10 9 95.5 10
In the answer above, you divide the index by the window width and wrap that in the floor() function so it rounds all observations down to the nearest integer. This assumes that the index values are consecutive integers. An alternative, if the index is not sequential is to make it so, like what follows:
df <- data.frame(
index = sample(0:1000, 100, replace=FALSE),
value = 1:100)
df %>%
arrange(index) %>%
mutate(obs = seq_along(index)-1,
window = floor(obs/10)) %>%
group_by(window) %>%
summarise(value = mean(value),
n = n())
# A tibble: 10 × 3
# window value n
# <dbl> <dbl> <int>
# 1 0 38.2 10
# 2 1 50.1 10
# 3 2 63.6 10
# 4 3 64.9 10
# 5 4 44 10
# 6 5 41.5 10
# 7 6 65.4 10
# 8 7 45.1 10
# 9 8 48.9 10
# 10 9 43.3 10

How to properly convert dataframe from integer to numeric in R?

I have an abundance dataframe improved as a csv with column and row headers. When imported, and running str(data) it shows each row as int. I can't use vegan's package specaccum unless the data is numeric. After converting my dataframe into numeric it is still producing the following error:
Error in colSums(x) : 'x' must be numeric
My sample and code:
Structure of my dataframe before any conversion:
> str(data)
'data.frame': 180 obs. of 727 variables:
$ Sample : Factor w/ 180 levels "Sample1","Sample2",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Abrostola : int 0 0 0 0 0 0 0 0 0 0 ...
$ Abrus : int 0 0 1 0 0 0 0 0 0 0 ...
$ Acanthamoeba : int 0 0 0 0 0 0 0 0 0 0 ...
$ Acanthopagrus : int 0 0 0 0 0 0 1 0 0 0 ...
$ Acetilactobacillus : int 0 1 0 0 0 0 0 0 0 0 ...
$ Acetobacter : int 0 0 0 0 0 0 0 0 0 0 ...
Then:
data2 <- data[-1] ## to get rid of factor column
data2 <- lapply(data2, as.numeric)
> str(data2)
List of 726
$ Abrostola : num [1:180] 0 0 0 0 0 0 0 0 0 0 ...
$ Abrus : num [1:180] 0 0 1 0 0 0 0 0 0 0 ...
$ Acanthamoeba : num [1:180] 0 0 0 0 0 0 0 0 0 0 ...
$ Acanthopagrus : num [1:180] 0 0 0 0 0 0 1 0 0 0 ...
$ Acetilactobacillus : num [1:180] 0 1 0 0 0 0 0 0 0 0 ...
$ Acetobacter : num [1:180] 0 0 0 0 0 0 0 0 0 0 ...
$ Achromobacter : num [1:180] 0 0 0 0 0 0 0 0 0 0 ...
$ Acinetobacter : num [1:180] 0 0 0 0 0 0 0 0 0 0 ...
Next I tried running the very basic vegan command:
mycurve <- specaccum(comm = data2, method = "random", permutations = "1000")
But it gives the same error. I don't get it - my df is clearly converted to numeric so what is the issue??
EDIT
Prior to fixing my dataframe into numeric, I was using:
mycurve <- specaccum(comm = data[-1], method = "random", permutations = "1000") ## without prior removal of factor column
But it was giving the following error:
Error in nperm + EPS : non-numeric argument to binary operator
$ Acetobacter : num [1:180] 0 0 0 0 0 0 0 0 0 0 ...
I am not sure why it is targeting this particular column, it looks exactly the same as everything else. No columns are empty (i.e. no column sums equal 0 as I thought that would be causing an issue). I checked for weird symbols/whitespace - the columns do not have anything out of the ordinary. There are no empty cells either with "NA".
Output of dput(head(data)) but due to body limit in this post I had to truncate the output.
structure(list(Sample = structure(1:6, .Label = c("Sample1", "Sample2",
"Sample3", "Sample4", "Sample5", "Sample6", "Sample180"), class = "factor"), Abrostola = c(0L,
0L, 0L, 0L, 0L, 0L), Abrus = c(0L, 0L, 1L, 0L, 0L, 0L), Acanthamoeba = c(0L,
0L, 0L, 0L, 0L, 0L), Acanthopagrus = c(0L, 0L, 0L, 0L, 0L, 0L
), Acetilactobacillus = c(0L, 1L, 0L, 0L, 0L, 0L), Acetobacter = c(0L,
0L, 0L, 0L, 0L, 0L), Achromobacter = c(0L, 0L, 0L, 0L, 0L, 0L
), Acinetobacter = c(0L, 0L, 0L, 0L, 0L, 0L), Acipenser = c(0L,
0L, 0L, 0L, 0L, 0L), Acomys = c(0L, 0L, 0L, 0L, 0L, 0L), Acremonium = c(0L,
0L, 0L, 0L, 0L, 0L), Acromyrmex = c(0L, 0L, 0L, 0L, 0L, 0L),
Acropora = c(0L, 0L, 0L, 0L, 0L, 0L), Actinidia = c(0L, 0L,
0L, 0L, 0L, 0L), Actinobacillus = c(0L, 0L, 0L, 0L, 0L, 0L
), Acyrthosiphon = c(0L, 0L, 0L, 0L, 0L, 0L), Acytostelium = c(0L,
1L, 0L, 0L, 0L, 0L), Aedes = c(0L, 0L, 0L, 0L, 0L, 0L), Aegilops = c(0L,
0L, 0L, 0L, 0L, 0L), Aeromonas = c(0L, 0L, 0L, 0L, 5L, 0L
), Ageratum = c(0L, 0L, 0L, 0L, 0L, 0L), Aggregatibacter = c(0L,
0L, 0L, 0L, 0L, 0L), Albugo = c(0L, 0L, 0L, 0L, 0L, 0L),
Alcaligenes = c(0L, 0L, 0L, 0L, 0L, 0L), Alcanivorax = c(0L,
0L, 0L, 0L, 0L, 0L), Allygidius = c(0L, 0L, 0L, 0L, 0L, 0L
), Amblyraja = c(0L, 0L, 0L, 0L, 0L, 0L), Amoebogregarina = c(0L,
0L, 1L, 1L, 0L, 0L), Amphidinium = c(0L, 0L, 0L, 0L, 0L,
0L), Amphiprion = c(0L, 0L, 0L, 0L, 0L, 0L), Amphipyra = c(0L,
1L, 1L, 1L, 0L, 1L), Amycolatopsis = c(0L, 0L, 0L, 0L, 0L,
0L), Ananas = c(1L, 1L, 1L, 1L, 0L, 0L), Anas = c(0L, 0L,
0L, 0L, 0L, 0L), Andhravirus = c(0L, 0L, 0L, 0L, 0L, 0L),
Andrena = c(0L, 0L, 0L, 0L, 0L, 0L), Anolis = c(0L, 0L, 0L,
0L, 0L, 0L), Anopheles = c(0L, 1L, 0L, 0L, 0L, 0L), Anoplophora = c(0L,
0L, 0L, 0L, 0L, 0L), Anoxybacillus = c(0L, 0L, 0L, 0L, 0L,
0L), Anthocharis = c(0L, 1L, 2L, 1L, 0L, 1L), Aphanomyces = c(0L,
0L, 0L, 1L, 0L, 0L), Aphyllon = c(0L, 0L, 0L, 0L, 0L, 0L),
Apilactobacillus = c(2L, 0L, 0L, 0L, 0L, 0L), Apotomis = c(0L,
0L, 0L, 0L, 0L, 0L), Apteryx = c(0L, 0L, 0L, 0L, 0L, 0L),
Aquila = c(0L, 0L, 0L, 0L, 0L, 0L), Arabidopsis = c(0L, 0L,
0L, 0L, 0L, 0L), Arabis = c(0L, 0L, 0L, 0L, 0L, 0L), Arachis = c(0L,
0L, 0L, 0L, 0L, 0L), Arctia = c(0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 6L), class = "data.frame")
SOLVED
While the code I was using for mycurve was working with other method options and quoting permutations, simply removing the quotes for permutations along with method random worked. Did not even need to convert the data into numeric.
mycurve <- specaccum(comm = data[-1], method = "random", permutations = 1000) ## REMOVED QUOTES IN PERMUTATIONS and used with INT dataframe
According to ?specaccum
permutations - Number of permutations with method = "random". Usually an integer giving the number permutations, but can also be a list of control values for the permutations as returned by the function how, or a permutation matrix where each row gives the permuted indices.
specaccum(comm = data[-1], method = "random", permutations = 1000)
-ouput
Species Accumulation Curve
Accumulation method: random, with 719 permutations
Call: specaccum(comm = data[-1], method = "random", permutations = 1000)
Sites 1.000000 2.000000 3.000000 4.000000 5.000000 6
Richness 3.502086 5.731572 7.297636 8.598053 9.831711 11
sd 1.894776 1.483049 1.228892 1.143491 0.897720 0

How to replace NAs with row means if proportion of row-wise NAs is below a certain threshold?

Apologies for the somewhat cumbersome question, but I am currently working on a mental health study. For one of the mental health screening tools there are 15 variables, each of which can have values of 0-3. The total score for each row/participant is then assigned by taking the sum of these 15 variables. The documentation for this tool states that if more than 20% of the values for a particular row/participant are missing, the total score should be taken as missing also, however if fewer than 20% of the values for a row are missing, each missing value should be assigned the mean of the remaining values for that row.
I decided that to do this I would have to calculate the proportion of NAs for each participant, calculate the mean of all 15 variables excluding NAs for each participant, and then use a conditional mutate statement (or something similar) that checked if the proportion of NAs was less than 20% and if so replaced NAs for the relevant columns with the mean value for that row, before finding the sum of all 15 variables for each row. The dataset also contains other columns besides these 15, so applying a function to all of the columns would not be useful.
To calculate the mean score without NAs I did the following:
mental$somatic_mean <- rowMeans(mental [, c("var1", "var2", "var3",
"var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11",
"var12","var13", "var14", "var15")], na.rm=TRUE)
And to calculate the proportion of NAs for each variable:
mental$somatic_na <- rowMeans(is.na(mental [, c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", "var11",
"var12", "var13", "var14", "var15")]))
However when I attempted the mutate() statement to alter the rows where fewer than 20% of values were NA I can't identify any code that works. I have tried a lot of permutations by this point, including the following for each variable:
mental_recode <- mental %>%
rowwise() %>%
mutate(var1 = if(somatic_na<0.2)
replace_na(list(var1= somatic_mean)))
Which returns:
"no applicable method for 'replace_na' applied to an object of class "list""
and attempting to do them all together without using mutate():
mental %>%
rowwise() %>%
if(somatic_na<0.2)
replace_na(list(var1 = somatic_mean, var2=
somatic_mean, var3 = somatic_mean, var4 = somatic_mean, var5 =
somatic_mean, var6 = somatic_mean, var7 = somatic_mean, var8 =
somatic_mean, var9 = somatic_mean, var10 = somatic_mean, var11 =
somatic_mean, var12 = somatic_mean, var13 = somatic_mean, var14 =
somatic_mean, var15 = somatic_mean ))
Which returns:
Error in if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean, :
argument is not interpretable as logical
In addition: Warning message:
In if (.) somatic_na < 0.2 else replace_na(mental, list(var1 = somatic_mean, :
the condition has length > 1 and only the first element will be used
I also tried using if_else() in conjunction with mutate() and setting the value to NA if the condition was not met, but could not get that to work after various permutations and error messages either.
EDIT: Dummy data can be generated by the following:
mental <- structure(list(id = 1:21, var1 = c(0L, 0L, 1L, 1L, 1L, 0L, 0L,
NA, 0L, 0L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 0L, 0L), var2 = c(0L,
0L, 1L, 1L, 1L, 0L, 0L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L,
2L, 0L, 1L, 1L), var3 = c(0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 1L,
1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 2L, 0L, 1L, 1L), var4 = c(1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 1L, 0L, 0L), var5 = c(0L, 0L, 0L, 1L, NA, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), var6 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), var7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, NA, 0L), var8 = c(0L,
0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), var9 = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), var10 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L,
0L, 0L, NA, 0L), var11 = c(1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, NA, 0L), var12 = c(1L,
0L, 1L, 1L, NA, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 1L, 1L), var13 = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L,
0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, NA, 0L), var14 = c(1L,
0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L,
2L, 0L, 1L, 0L), var15 = c(1L, 0L, 2L, NA, NA, 0L, NA, 0L, 0L,
0L, 0L, 0L, NA, NA, 0L, NA, NA, NA, NA, NA, 0L)), .Names = c("id",
"var1", "var2", "var3", "var4", "var5", "var6", "var7", "var8",
"var9", "var10", "var11", "var12", "var13", "var14", "var15"), class =
"data.frame", row.names = c(NA,
-21L))
Does anyone know of code that would work for this sort of situation?
Thanks in advance!
Here is a way to do it all in one chain using dplyr using your supplied data frame.
First create a vector of all column names of interest:
name_col <- colnames(mental)[2:16]
And now use dplyr
library(dplyr)
mental %>%
# First create the column of row means
mutate(somatic_mean = rowMeans(.[name_col], na.rm = TRUE)) %>%
# Now calculate the proportion of NAs
mutate(somatic_na = rowMeans(is.na(.[name_col]))) %>%
# Create this column for filtering out later
mutate(somatic_usable = ifelse(somatic_na < 0.2,
"yes", "no")) %>%
# Make the following replacement on a row basis
rowwise() %>%
mutate_at(vars(name_col), # Designate eligible columns to check for NAs
funs(replace(.,
is.na(.) & somatic_na < 0.2, # Both conditions need to be met
somatic_mean))) %>% # What we are subbing the NAs with
ungroup() # Now ungroup the 'rowwise' in case you need to modify further
Now, if you wanted to only select the entries that have less than 20% NAs, you can pipe the above into the following:
filter(somatic_usable == "yes")
Also of note, if you wanted to instead make the condition less than or equal to 20%, you would need to replace the two somatic_na < 0.2 with somatic_na <= 0.2.
Hope this helps!
Here's a way using just base R expressions and remember mathematical properties of sums and means:
# generate fake data
set.seed(123)
dat <- data.frame(
ID = 1:10,
matrix(sample(c(0:3, NA), 10 * 15, TRUE), nrow = 10, ncol = 15),
'another_var' = 'foo',
'second_var' = 'bar',
stringsAsFactors = FALSE
)
var_names <- paste0('X', 1:15)
# add number of NAs to data
dat$na_num <- rowSums(is.na(dat[var_names]))
# add row sum
dat$row_sum <- rowSums(dat[var_names], na.rm = TRUE)
# add row mean
dat$row_mean <- rowMeans(dat[var_names], na.rm = TRUE)
# add final sum
dat$final_sum <- dat$row_sum + dat$row_mean * dat$na_num
# recode final sum to be NA if prop > .2
dat$final_sum <- ifelse(rowMeans(is.na(dat[var_names])) > .2,
NA,
dat$final_sum)
Here's a function that does the same thing. Where you specify your data and then a character vector of your variable names.
total_sum_calculation <- function(data, var_names){
# add number of NAs to data
na_num <- rowSums(is.na(data[var_names]))
# add row sum
row_sum <- rowSums(data[var_names], na.rm = TRUE)
# add row mean
row_mean <- rowMeans(data[var_names], na.rm = TRUE)
# add final sum
final_sum <- row_sum + row_mean * na_num
# recode final sum to be NA if prop > .2
ifelse(rowMeans(is.na(data[var_names])) > .2,
NA,
final_sum)
}
v_names <- paste0('var', 1:15)
total_sum_calculation(data = mental, var_names = v_names)
[1] 6.000000 0.000000 8.000000 7.500000 NA 0.000000 3.214286 9.230769 6.000000 2.000000 1.000000 0.000000 4.285714
[14] NA 5.357143 5.357143 5.357143 9.642857 1.071429 NA 3.000000

How do you run a chisq.test on two different dplyr outputs and then summarize it in a table?

The question I have is related to one I posted a while back here
Jaap was fantastic, in that he's helped me create this wonderful output of summary tables of counts and frequencies (percentage) of categorical variables.
The "real data" that I'm analysing is from two different hospitals, each with a different frequency of drugs given and but not always were the same drugs given.
The summary from Jaap's func function from here looks as follows, and the whole data.frame is provided below (hospitals number one and two):
id AB1 AB2 AB3 AB4 AB5 AB6 AB7 AB8 AB9 AB10 AB11 AB12 AB13 total perc
1 1st gen Cephalosporin 4 0 0 1 1 0 0 0 0 0 0 0 0 6 1.9
2 3rd gen Cephalosporin 44 7 8 1 3 2 0 0 0 0 0 0 0 65 20.5
3 4th gen Cephalosporin 3 3 0 1 2 1 0 0 0 0 0 0 0 10 3.2
Now I'd like to run a chisq.test (or Fisher's if frequency lower 5) of all the variable (names) found in the id column using the total frequency found in the total column by comparing hospital one versus hospital two.
So in layman terms I want to answer the following question: "Were 1st gen Cephalosporins given more frequently in hospital one compared to hospital two?" etc.
As some variable id's may not be identical between hospitals, I anticipate, this may return a NULL calculation.
Ideally, I'd then like to summarize all of these findings in a table with the correpsonding p-value to look as follows:
id Hospital One Total Frequency Hospital Two Total Frequency p-value
xyz 15 30 0.01
Thank you very much for your help.
All the data can be found below.
Cheers
EDIT following Khashaa's issues raised:
This is simply a mock-output (ideally, what I would like to have).
id Hospital One Total Frequency Hospital Two Total Frequency p-value
xyz n i x.xx
As mentioned, the p-value should be derived from a chisq.test or fisher.test.
I gather the output will have to be generated somehow this way, with Hospital #1 called hosp1 and Hospital #2 called hosp2
# first take those columns of the dplyr output your interested in
hosp1_sel<-hosp1[,c("id","total")]
hosp2_sel<-hosp2[,c("id","total")]
#then merge the data.frames to one so you can perform analysis on one dataframe
new_df <- merge(hosp1_sel, hosp2_sel, by=0)
#this looks like this
> new_df
Row.names id.x total.x id.y total.y
1 1 1st gen Cephalosporin 6 3rd gen Cephalosporin 19
2 10 Trimethoprim 2 Polypeptide 1
3 11 Ureidopenicillin 46 Rifamycin 1
4 12 Carbapenem 19 Tetracycline 1
5 13 Fluorquinolone 17 Lincosamide 1
6 14 Nitromidazole 12 Quinolone 2
7 15 Antifungal 6 Sulfonamides 2
8 16 Oxazolidinone 2 Nitroimidazole 1
9 17 Rifamycin 1 Polymyxine 1
10 18 Polypeptide 1 Colistin 1
11 2 3rd gen Cephalosporin 65 Carbapenem 37
12 3 4th gen Cephalosporin 10 Fluoroquinolone 24
13 4 Aminoglycoside 31 Glycopeptide 32
14 5 Clindamycin 2 Penicillin 29
15 6 Glycopeptide 55 Ureidopenicillin 36
16 7 Macrolide 3 Lipopeptide 4
17 8 Penicillin 36 Macrolid 2
18 9 Tetracycline 2 Aminoglycoside 9
This is where I get stuck.
In my mind I'd now have to make this data.frame wider, to then be able to run something like:
chisq.test(hosp1$Ureidopenicillin, hosp2$Ureidopenicillin)
To determine, if "Ureidopenicillins" were given more frequently in hospital #1 compared to hospital #2 and so forth.
The issue is, that this actually is comparing "counts" and not "proportions" from a contingency table though...
Any ideas?
O.
Hospital #1 data.frame:
structure(list(id = structure(1:19, .Label = c("1st gen Cephalosporin",
"3rd gen Cephalosporin", "4th gen Cephalosporin", "Aminoglycoside",
"Clindamycin", "Glycopeptide", "Macrolide", "Penicillin", "Tetracycline",
"Trimethoprim", "Ureidopenicillin", "Carbapenem", "Fluorquinolone",
"Nitromidazole", "Antifungal", "Oxazolidinone", "Rifamycin",
"Polypeptide", "Lipopeptide "), class = "factor"), AB1 = c(4L,
44L, 3L, 1L, 1L, 7L, 1L, 7L, 2L, 1L, 12L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), AB2 = c(0L, 7L, 3L, 7L, 0L, 16L, 2L, 9L, 0L, 0L,
9L, 1L, 2L, 6L, 0L, 0L, 0L, 0L, 0L), AB3 = c(0L, 8L, 0L, 5L,
1L, 13L, 0L, 5L, 0L, 0L, 12L, 4L, 1L, 2L, 0L, 0L, 0L, 0L, 0L),
AB4 = c(1L, 1L, 1L, 6L, 0L, 5L, 0L, 8L, 0L, 0L, 5L, 3L, 4L,
1L, 1L, 1L, 1L, 0L, 0L), AB5 = c(1L, 3L, 2L, 2L, 0L, 4L,
0L, 1L, 0L, 0L, 2L, 4L, 1L, 1L, 2L, 0L, 0L, 0L, 0L), AB6 = c(0L,
2L, 1L, 3L, 0L, 5L, 0L, 1L, 0L, 0L, 2L, 1L, 1L, 2L, 1L, 0L,
0L, 0L, 0L), AB7 = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 3L, 0L,
0L, 2L, 2L, 2L, 0L, 0L, 1L, 0L, 1L, 0L), AB8 = c(0L, 0L,
0L, 3L, 0L, 1L, 0L, 0L, 0L, 0L, 2L, 1L, 1L, 0L, 0L, 0L, 0L,
0L, 1L), AB9 = c(0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L,
0L, 2L, 2L, 0L, 0L, 0L, 0L, 0L, 0L), AB10 = c(0L, 0L, 0L,
1L, 0L, 2L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), AB11 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 2L, 0L, 1L, 0L, 0L, 0L, 0L), AB12 = c(0L, 0L, 0L, 0L,
0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L
), AB13 = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L), total = c(6, 65, 10, 31, 2,
55, 3, 36, 2, 2, 46, 19, 17, 12, 6, 2, 1, 1, 1), perc = c(1.9,
20.5, 3.2, 9.8, 0.6, 17.4, 0.9, 11.4, 0.6, 0.6, 14.5, 6,
5.4, 3.8, 1.9, 0.6, 0.3, 0.3, 0.3)), class = "data.frame", .Names = c("id",
"AB1", "AB2", "AB3", "AB4", "AB5", "AB6", "AB7", "AB8", "AB9",
"AB10", "AB11", "AB12", "AB13", "total", "perc"), row.names = c(NA,
-19L))
Hospital #2 data.frame:
structure(list(id = structure(1:18, .Label = c("3rd gen Cephalosporin",
"Carbapenem", "Fluoroquinolone", "Glycopeptide", "Penicillin",
"Ureidopenicillin", "Lipopeptide", "Macrolid", "Aminoglycoside",
"Polypeptide", "Rifamycin", "Tetracycline", "Lincosamide", "Quinolone",
"Sulfonamides", "Nitroimidazole", "Polymyxine", "Colistin"), class = "factor"),
AB1 = c(9L, 3L, 1L, 7L, 16L, 22L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), AB2 = c(2L, 17L, 5L, 8L, 2L, 9L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), AB3 = c(1L,
9L, 4L, 5L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 0L, 0L, 0L, 0L,
0L, 0L), AB4 = c(1L, 3L, 3L, 7L, 4L, 3L, 0L, 0L, 2L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L), AB5 = c(3L, 1L, 4L, 1L,
4L, 1L, 2L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 2L, 0L, 0L, 0L),
AB6 = c(3L, 2L, 4L, 1L, 0L, 0L, 0L, 0L, 3L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 1L, 0L), AB7 = c(0L, 2L, 3L, 3L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), total = c(19,
37, 24, 32, 29, 36, 4, 2, 9, 1, 1, 1, 1, 2, 2, 1, 1, 1),
perc = c(9.4, 18.2, 11.8, 15.8, 14.3, 17.7, 2, 1, 4.4, 0.5,
0.5, 0.5, 0.5, 1, 1, 0.5, 0.5, 0.5)), class = "data.frame", .Names = c("id",
"AB1", "AB2", "AB3", "AB4", "AB5", "AB6", "AB7", "total", "perc"
), row.names = c(NA, -18L))
Merging by hospital <- left_join(hosp1, hosp2, by = "id") %>% select(id, total.x, total.y) resulted in
#id total.x total.y
#1 1st gen Cephalosporin 6 NA
#2 3rd gen Cephalosporin 65 19
#3 4th gen Cephalosporin 10 NA
#4 Aminoglycoside 31 9
#5 Clindamycin 2 NA
#6 Glycopeptide 55 32
#7 Macrolide 3 NA
#8 Penicillin 36 29
#9 Tetracycline 2 1
#10 Trimethoprim 2 NA
#11 Ureidopenicillin 46 36
#12 Carbapenem 19 37
#13 Fluorquinolone 17 NA
#14 Nitromidazole 12 NA
#15 Antifungal 6 NA
#16 Oxazolidinone 2 NA
#17 Rifamycin 1 1
#18 Polypeptide 1 1
#19 Lipopeptide 1 NA
Strange that too many NAs produced for hosp2. Upon closer inspection, there are inconsistencies among id variables. For instance, 14th row in hosp1 is Nitromidazole whereas 16th row in hosp2 is Nitroimidazole, and I am not sure if they are indicating the same medication.
Anyway, though I have some doubts about your use of chisq.test, the desired output can be produced as follows
pval <- function(x, y){
ifelse(!is.na(x) & !is.na(y), chisq.test(c(x, y))$p.value, NA)
}
p <- lapply(1:length(hospital$total.x),
function(i){
pval(hospital$total.x[i],hospital$total.y[i])
}
)
hospital$p_value <- unlist(p)
colnames(hospital) <- c("id", "Hospital One Total Frequency", "Hospital Two Total Frequency", "p-value")
Final output looks
> hospital
# id Hospital One Total Frequency Hospital Two Total Frequency p-value
#1 1st gen Cephalosporin 6 NA NA
#2 3rd gen Cephalosporin 65 19 5.193805e-07
#3 4th gen Cephalosporin 10 NA NA
#4 Aminoglycoside 31 9 5.042182e-04
#5 Clindamycin 2 NA NA
#6 Glycopeptide 55 32 1.366852e-02
#7 Macrolide 3 NA NA
#8 Penicillin 36 29 3.852612e-01
#9 Tetracycline 2 1 5.637029e-01
#10 Trimethoprim 2 NA NA
#11 Ureidopenicillin 46 36 2.694564e-01
#12 Carbapenem 19 37 1.615693e-02
#13 Fluorquinolone 17 NA NA
#14 Nitromidazole 12 NA NA
#15 Antifungal 6 NA NA
#16 Oxazolidinone 2 NA NA
#17 Rifamycin 1 1 1.000000e+00
#18 Polypeptide 1 1 1.000000e+00
#19 Lipopeptide 1 NA NA

Create a single output matrix using apply function

Dear programming gods,
I would like to perform a series of Chi-square tests in R (one test for each column of my species Presence/Absence data.frame) using a function that can yield a single matrix (or data.frame, ideally) which lists as output the species (column name), Chi-square test statistic, df, and p.value.
My species data snippet (actual dimensions = 50x131):
Species<-structure(list(Acesac = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L
), Allpet = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), Ambser = c(0L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), Anoatt = c(0L, 0L, 0L, 1L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 1L, 1L, 1L), Aritri = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L
)), .Names = c("Acesac", "Allpet", "Ambser", "Anoatt", "Aritri"
), row.names = c("BS1", "BS10", "BS2", "BS3", "BS4", "BS5", "BS6",
"BS7", "BS8", "BS9", "LC1", "LC10", "LC2", "LC3", "LC4", "LC5",
"LC6", "LC7", "LC8", "LC9", "TR1", "TR10", "TR2", "TR3", "TR4"
), class = "data.frame")
My environmental data snippet:
Env<-structure(list(Rock = structure(1:25, .Label = c("BS1", "BS10",
"BS2", "BS3", "BS4", "BS5", "BS6", "BS7", "BS8", "BS9", "LC1",
"LC10", "LC2", "LC3", "LC4", "LC5", "LC6", "LC7", "LC8", "LC9",
"TR1", "TR10", "TR2", "TR3", "TR4", "TR5", "TR6", "TR7", "TR8",
"TR9", "WD1", "WD10", "WD2", "WD3", "WD4", "WD5", "WD6", "WD7",
"WD8", "WD9", "WW1", "WW10", "WW2", "WW3", "WW4", "WW5", "WW6",
"WW7", "WW8", "WW9"), class = "factor"), Climbed = structure(c(1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L), .Label = c("climbed", "unclimbed"
), class = "factor")), .Names = c("Rock", "Climbed"), row.names = c(NA,
25L), class = "data.frame")
The following apply function code performs a chi-sq test on each species (column) by first creating a contingency table with the number of occurrences of a given species on climbed vs. unclimbed rocks (Env$Climbed).
apply(Species, 2, function(x) {
Table<-table(Env$Climbed, x)
Test<-chisq.test(Table, corr = TRUE)
out <- data.frame("Chi.Square" = round(Test$statistic,3)
, "df" = Test$parameter
, "p.value" = round(Test$p.value, 3)
)
})
This yields a separate data.frame for each species (column). I would like to yield one data.frame, which includes also the column name of each species. Something like this:
mydf<-data.frame("spp"= colnames(Species[1:25,]), "Chi.sq"=c(1:25), "df"=
c(1:25),"p.value"= c(1:25))
Should this be done with ddply or adply? Or just a loop? (I tried, but failed). I reviewed a posting on a similar topic ([Chi Square Analysis using for loop in R), but could not make it work for my purposes.
Thank you for your time and expertise!
TC
Don't use apply on data.frames. It internally coerces to a matrix, which can have unintended consequences for some data structures (i.e. factors). It is also not efficient (memorywise).
If you want to apply a function by column, use lapply (as a data.frame is a list)
You can use plyr::ldply do automagically return a data.frame not a list.
# rewrite the function so `Env$Climbed` is not hard coded....
my_fun <- function(x,y) {
Table<-table(y, x)
Test<-chisq.test(Table, corr = TRUE)
out <- data.frame("Chi.Square" = round(Test$statistic,3)
, "df" = Test$parameter
, "p.value" = round(Test$p.value, 3)
)
}
library(plyr)
results <- ldply(Species,my_fun, y = Env$Climbed)
results
# .id Chi.Square df p.value
# 1 Acesac 0.000 1 1.000
# 2 Allpet 0.000 1 1.000
# 3 Ambser 0.000 1 1.000
# 4 Anoatt 0.338 1 0.561
# 5 Aritri 0.085 1 0.770
If you save the result of your apply as
kk <- apply(Species, 2, function(x) {...})
Then you can finish the transformation with
do.call(rbind, Map(function(x,y) cbind(x, species=y), kk, names(kk)))
Here we just append the name of the species to each data.frame and combine all the rows with rbind.
You can also try
kk <- apply(Species,2,....)
library(plyr)
ldply(kk,.id='spp')
spp Chi.Square df p.value
1 Acesac 0.000 1 1.000
2 Allpet 0.000 1 1.000
3 Ambser 0.000 1 1.000
4 Anoatt 0.338 1 0.561
5 Aritri 0.085 1 0.770
Upd:
library(plyr)
library(reshape2)
ddply(setNames(melt(Species), c("spp", "value")), .(spp), function(x) {
Test <- chisq.test(table(Env$Climbed, x$value), corr = TRUE)
data.frame(Chi.Square = round(Test$statistic, 3), df = Test$parameter, p.value = round(Test$p.value,
3))
})

Resources