Applying a function repeatedly to many subjects

Applying a function repeatedly to many subjects - r

I have a data frame as follows,
> mydata
date station treatment subject par
A a 0 R1 1.3
A a 0 R1 1.4
A a 1 R2 1.4
A a 1 R2 1.1
A b 0 R1 1.5
A b 0 R1 1.8
A b 1 R2 2.5
A b 1 R2 9.5
B a 0 R1 0.3
B a 0 R1 8.2
B a 1 R2 7.3
B a 1 R2 0.2
B b 0 R1 9.4
B b 0 R1 3.2
B b 1 R2 3.5
B b 1 R2 2.4
....
where:
date is a factor with 2 levels A/B;
station is a factor with 2 levels a/b;
treatment is a factor with 2 levels 0/1;
subject are the replicates R1 to R20 assigned to treatment (10 to treatment 0 and 10 to treatment 1);
and
par is my parameter, which is a repeated measurement of particle size for each subject at at each date and station
What i need to do is:
divide par in 10 equal bins and count the number in each bin. This has to be done in subsets of mydata definded by a combination of date station and subject. The final outcome has to be a daframe myres as follow:
> myres
date station treatment bin.centre freq
A a 0 1.2 4
A a 0 1.3 3
A a 0 1.4 2
A a 0 1.5 1
A a 1 1.2 4
A a 1 1.3 3
A a 1 1.4 2
A a 1 1.5 1
B b 0 2.3 5
B b 0 2.4 4
B b 0 2.5 3
B b 0 2.6 2
B b 1 2.3 5
B b 1 2.4 4
B b 1 2.5 3
B b 1 2.6 2
....
this is what i've done so far:
#define the number of bins
num.bins<-10
#define the width of each bins
bin.width<-(max(par)-min(par))/num.bins
#define the lower and upper boundaries of each bins
bins<-seq(from=min(par), to=max(par), by=bin.width)
#define the centre of each bins
bin.centre<-c(seq(min(bins)+bin.width/2,max(bins)-bin.width/2,by=bin.width))
#create a vector to store the frequency in each bins
freq<-numeric(length(length(bins-1)))
# this is the loop that counts the frequency of particles between the lower and upper boundaries
of each bins and store the result in freq
for(i in 1:10){
freq[i]<-length(which(par>=bins[i] &
par<bins[i+1]))
}
#create the data frame with the results
res<-data.frame(bin.centre,res)
my first approach was to subset mydata manually, using subset(),for each combination of subject station and date, and apply the above sequence of commands for each subsets, then build the final dataframe combining each single res using rbind(), but this procedure was very convoluted and subject to the propagation of errors.
What i would like to do, is to automate the above procedure so that it calculates the binned frequency distribution for each subject. My intuition is that the best way to do this is by creating a function for estimating this particle distribution, and then applying it to each subject via a for loop. However, I am not sure of how to do this. Any suggestions would be really appreciated.
thanks
matteo.

You can do this in a few steps using the functionality in the plyr package. This allows you to split your data into the desired chunks, apply a statistic to each chunk, and combine the results.
First I set up some dummy data:
set.seed(1)
n <- 100
dat <- data.frame(
date=sample(LETTERS[1:2], n, replace=TRUE),
station=sample(letters[1:2], n, replace=TRUE),
treatment=sample(0:1, n, replace=TRUE),
subject=paste("R", sample(1:2, n, replace=TRUE), sep=""),
par=runif(n, 0, 5)
)
head(dat)
date station treatment subject par
1 A b 0 R2 3.2943880
2 A a 0 R1 0.9253498
3 B a 1 R1 4.7718907
4 B b 0 R1 4.4892425
5 A b 0 R1 4.7184853
6 B a 1 R2 3.6184538
Now I use the function in base called cut to divide your par into equal sized bins:
dat$bin <- cut(dat$par, breaks=10)
Now for the fun bit. Load package plyr and use the function ddply to split, apply and combine. Because you want a frequency count, we can use the function length to count the number of times each replicate appeared in that bin:
library(plyr)
res <- ddply(dat, .(date, station, treatment, bin),
summarise, freq=length(treatment))
head(res)
date station treatment bin freq
1 A a 0 (0.00422,0.501] 1
2 A a 0 (0.501,0.998] 2
3 A a 0 (1.5,1.99] 4
4 A a 0 (1.99,2.49] 2
5 A a 0 (2.49,2.99] 2
6 A a 0 (2.99,3.48] 1

Related

Transform community data into wide-format for vegan package

I am trying to analyze some community data with the vegan package. I have my data in the wrong format, and am looking for ways to change the format.
What I have is something like this:
Habitat Species Abundance
1 A 3
2 B 2
3 C 1
1 D 5
2 A 8
3 F 4
And what I think I need is:
Habitat Species A Species B Species C Species D Species D
1 3 0 0 5 0
2 8 ...... etc
3 0
Or is there any other format that vegan can take? I am trying to calculate similarity in species composition between habitats.

The function matrify() in the labdsv package does exactly this for community analyses.
Takes a data.frame in three column form (sample.id, taxon, abundance) and converts it into full matrix form, and then exports it as data.frame with the appropriate row.names and column names.
In other words, it converts your data from long to wide format so that each row represents a sample (in your case "habitat"; sometimes this would be a "plot"), each column represents a species, and each cell shows the abundance of the given cell's species (column) in the given cell's habitat (row).
Example:
dat <- data.frame(Habitat = c('Hab1','Hab1','Hab2','Hab2','Hab2','Hab3','Hab3'),
Species = c('Sp1','Sp2','Sp1','Sp3','Sp4','Sp2','Sp3'),
Abundance = c(1,2,1,3,2,2,1))
print(dat)
Habitat Species Abundance
1 Hab1 Sp1 1
2 Hab1 Sp2 2
3 Hab2 Sp1 1
4 Hab2 Sp3 3
5 Hab2 Sp4 2
6 Hab3 Sp2 2
7 Hab3 Sp3 1
library(labdsv)
matrify(dat)
Sp1 Sp2 Sp3 Sp4
Hab1 1 2 0 0
Hab2 1 0 3 2
Hab3 0 2 1 0
Bonus:
I rewrote matrify many years ago so that it could handle longitudinal community data
Specifically, my matrify2() function creates rows for each plot-year combination (i.e., resampled rows for the same plot) by duplicating plot (or habitat) row monikers and adding a Year column.
Below is the code:
#Create data.frame with PLOT, YEAR, and ABUNDANCE for each SPEC:
#Creates function that can sort the data.frame output by:
#Columns = individual SPECS, #Rows = plot by Year
#Note: Code modified from matrify() function from labdsv package (v. 1.6-1)
matrify2 <- function(data) {
#Data must have columns: plot, SPEC, abundance measure,Year
if (ncol(data) != 4)
stop("data frame must have four column format")
plt <- factor(data[, 1])
spc <- factor(data[, 2])
abu <- data[, 3]
yrs <- factor(data[, 4])
plt.codes <- sort(levels(factor(plt))) ##object with sorted plot numbers
spc.codes <- levels(factor(spc)) ##object with sorted SPEC names
yrs.codes <- sort(levels(factor(yrs))) ##object with sorted sampling Years
taxa <- matrix(0, nrow = length(plt.codes)*length(yrs.codes), ncol = length(spc.codes)) ##Create empty matrix with proper dimensions (unique(plotxYear) by # of SPEC)
plt.list <- rep(plt.codes,length(yrs.codes)) ##Create a list of all the plot numbers (in order of input data) to add as an ID column at end of function
yrs.list <- rep(yrs.codes,each=length(plt.codes)) ##Create a list of all the Year numbers (in order of input data) to add as an ID column at end of function
col <- match(spc, spc.codes) ##object that determines the alphabetical order ranking of each SPEC in the spc.code list
row.plt <- match(plt, plt.codes) ##object that determines the rank order ranking of each plot of the input data in the plt.code list
row.yrs <- match(yrs,yrs.codes) ##object that determines the rank order ranking of each Year of the input data in the yrs.code list
for (i in 1:length(abu)) {
row <- (row.plt[i])+length(plt.codes)*(row.yrs[i]-1) ##Determine row number by assuming each row represents a specific plot & year in an object of rep(plot,each=Year)
if(!is.na(abu[i])) { ##ONly use value if !is.na .. [ignore all is.NA values]
taxa[row, col[i]] <- sum(taxa[row, col[i]], abu[i]) ##Add abundance measure of row i to the proper SPEC column and plot/Year row. Sum across all identical individuals.
}
}
taxa <- data.frame(taxa) ##Convert to data.frame for easier manipulation
taxa <- cbind(plt.list,yrs.list,taxa) ##Add ID columns for plot and Year to each row already representing the abundance of Each SPEC of that given plot/Year.
names(taxa) <- c('Plot','Year',spc.codes)
taxa
}
Example:
dat.y <- data.frame(Habitat = c('Hab1','Hab1','Hab2','Hab2','Hab2','Hab3','Hab3','Hab1','Hab1','Hab2','Hab2','Hab2','Hab3','Hab3'),
Species = c('Sp1','Sp2','Sp1','Sp3','Sp4','Sp2','Sp3','Sp1','Sp2','Sp1','Sp3','Sp4','Sp2','Sp3'),
Abundance = c(1,2,1,3,2,2,1,1,2,1,3,2,2,1),
Year = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2))
print(dat.y)
Habitat Species Abundance Year
1 Hab1 Sp1 1 1
2 Hab1 Sp2 2 1
3 Hab2 Sp1 1 1
4 Hab2 Sp3 3 1
5 Hab2 Sp4 2 1
6 Hab3 Sp2 2 1
7 Hab3 Sp3 1 1
8 Hab1 Sp1 1 2
9 Hab1 Sp2 2 2
10 Hab2 Sp1 1 2
11 Hab2 Sp3 3 2
12 Hab2 Sp4 2 2
13 Hab3 Sp2 2 2
14 Hab3 Sp3 1 2
matrify2(dat.y)
Plot Year Sp1 Sp2 Sp3 Sp4
1 Hab1 1 1 2 0 0
2 Hab2 1 1 0 3 2
3 Hab3 1 0 2 1 0
4 Hab1 2 1 2 0 0
5 Hab2 2 1 0 3 2
6 Hab3 2 0 2 1 0
Also, FYI, you should get to know labdsv according to the vegan documentation:
Together with the labdsv package, the vegan package provides most standard tools of descriptive community analysis.

You probably want to spread your data. For example:
library(tidyr)
mydata %>%
spread(Species, Abundance)

This is what I would so, using dcast:
Create a data sample: cc=data.frame(habitat=c(1,2,3,1,2,3),species=c('a','e','a','e','g','a'), abundance=sample(1:10000,6)).
Output looks like this (Ignore first column as it is an automatic index created by the ouput operation in R. What is important is the columns):
> cc
> habitat species abundance
> 1 1 a 7814
> 2 2 e 7801
> 3 3 a 9510
> 4 1 e 7443
> 5 2 g 2160
> 6 3 a 4026
>
Now melt: m=melt(cc, id.vars=c("habitat","species")). Output:
habitat species variable value
1 1 a abundance 7814
2 2 e abundance 7801
3 3 a abundance 9510
4 1 e abundance 7443
5 2 g abundance 2160
6 3 a abundance 4026
Now reshape: dcast(m,habitat~species,fun.aggregate=mean), which yields:
habitat a e g
1 1 7814 7443 NaN
2 2 NaN 7801 2160
3 3 6768 NaN NaN
More info about reshape here.
Kf

Sample random rows in dataframe, where number of samples exceeds number of rows. Assign sampling probability

Consider the following example data, stored in a dataframe called df
df
x y
2 4
1 5
0 8
As you can see, there are 3 rows to this dataframe. What I'd like to do is take 100 row samples, where each row has an equal probability of being selecting (in this case 1/3). My output, let's call it df_result would look something like this:
df_result
x y
0 8
2 4
0 8
1 5
1 5
2 4
etc..... until 100 samples are taken.
I saw this previous stackoverflow post which detailed how to take random samples for a dataframe: df[sample(nrow(df), 3), ]
However, when I tried to sample 100 rows, this (predictably) did not work, and did not allow for the sampling probability to be assigned.
Any tips?
Thanks`

df <- read.table(header = TRUE,
text = "x y
2 4
1 5
0 8")
set.seed(1)
df[sample(nrow(df), 10, replace=T), ]
x y
1 2 4
2 1 5
2.1 1 5
3 0 8
1.1 2 4
3.1 0 8
3.2 0 8
2.2 1 5
2.3 1 5
1.2 2 4

Frequency table with ddply function

ID<-c("R1","R2","R2","R3","R3","R4","R4","R4","R4","R3","R3","R3","R3","R2","R2","R2","R5","R6")
event<-c("a","b","b","M","s","f","y","b","a","a","a","a","s","c","c","b","m","a")
df<-data.frame(ID,event)
How can I modify the below code to get this table. 2-How can i get the average of frequency for each element of frequency?for example: the average of frequency for a would be 1+3+1+1/4.
ddply(df,.(ID),summarise,N=sum(!is.na(ID)),frequency=length(event))
ID N Number-event-level levels frequency
R1 1 1 a a=1
R2 5 2 b,c b=3,c=2
R3 6 3 M,a,s M=1,a=3,s=2
R4 4 4 f,y,b,a f=1,y=1,b=1,a=1
R5 1 1 m m=1
R6 1 1 a a=1

Here's an answer for the first question:
ddply(df,.(ID),summarise,
N=length(event),
Number.event.level=length(unique(event)),
levels=paste(sort(unique(event)),collapse=","),
frequency=paste(paste(sort(unique(event)),table(event)[table(event)>0],sep="="),collapse=","))
# ID N Number.event.level levels frequency
# 1 R1 1 1 a a=1
# 2 R2 5 2 b,c b=3,c=2
# 3 R3 6 3 a,M,s a=3,M=1,s=2
# 4 R4 4 4 a,b,f,y a=1,b=1,f=1,y=1
# 5 R5 1 1 m m=1
# 6 R6 1 1 a a=1
For your second question, it seems like you want to get the average frequency when the frequency is greater than 0. If that's the case, you can do this:
apply(table(df),2,function(x) mean(x[x>0]))
# a b c f m M s y
# 1.5 2.0 2.0 1.0 1.0 1.0 2.0 1.0
Update
If you want to do that last part for each level of a third variable and you still want to use ddply() you could do the following:
df1 <- rbind(df,df)
df1$cat <- rep(c("a","b"),each=nrow(df))
ddply(df1,.(cat),function(y) apply(table(y),2,function(x) mean(x[x>0])))
# cat a b c f m M s y
# 1 a 1.5 2 2 1 1 1 2 1
# 2 b 1.5 2 2 1 1 1 2 1

for loop through data frame and looping with unique values

I'm trying to work on code to build a function for three stage cluster sampling, however, I am just working with dummy data right now so I can understand what is going into my function.
I am working on for loops and have a data frame with grouped values. I'm have a data frame that has data:
Cluster group value value.K.bar value.M.bar N.bar
1 1 A 1 1.5 2.5 4
2 1 A 2 1.5 2.5 4
3 1 B 3 4.0 2.5 4
4 1 B 4 4.0 2.5 4
5 2 B 5 4.0 6.0 4
6 2 C 6 6.5 6.0 4
7 2 C 7 6.5 6.0 4
and I am trying to run the for loop
n <- dim(data)[1]
e <- 0
total <- 0
for(i in 1:n) {e = data.y$value.M.bar[i] - data$N.bar[i]
total = total + e^2}
My question is: Is there a way to run the same loop but for the unique values in the group? Say by:
Group 'A', 'B', 'C'
Any help would be greatly appreciated!
Edit: for correct language

You can use by for example, to apply your data per group. First I wrap your code in a function that take data as input.
get.total <- function(data){
n <- dim(data)[1]
e <- 0
total <- 0
for(i in 1:n) {
e <- data$value.M.bar[i] - data$N.bar[i] ## I correct this line
total <- total + e^2
}
total
}
Then to compute total just for group B and C you do this :
by(data,data$group,FUN=get.total)
data$group: A
[1] 4.5
----------------------------------------------------------------------------------------------------
data$group: B
[1] 8.5
----------------------------------------------------------------------------------------------------
data$group: C
[1] 8
But better , Here a vectorized version of your function
by(data,data$group,
function(dat)with(dat, sum((value.M.bar - N.bar)^2)))

Calculate the difference betwen pairs of consecutive rows in a data frame - R

I have a data.frame in which each gene name is repeated and contains values for 2 conditions:
df <- data.frame(gene=c("A","A","B","B","C","C"),
condition=c("control","treatment","control","treatment","control","treatment"),
count=c(10, 2, 5, 8, 5, 1),
sd=c(1, 0.2, 0.1, 2, 0.8, 0.1))
gene condition count sd
1 A control 10 1.0
2 A treatment 2 0.2
3 B control 5 0.1
4 B treatment 8 2.0
5 C control 5 0.8
6 C treatment 1 0.1
I want to calculate if there is an increase or decrease in "count" after treatment and mark them as such and/or subset them. That is (pseudo code):
for each unique(gene) do
if df[geneRow1,3]-df[geneRow2,3] > 0 then gene is "up"
else gene is "down"
This what it should look like in the end (the last columns is optional):
up-regulated
gene condition count sd regulation
B control 5 0.1 up
B treatment 8 2.0 up
down-regulated
gene condition count sd regulation
A control 10 1.0 down
A treatment 2 0.2 down
C control 5 0.8 down
C treatment 1 0.1 down
I have been raking my brain with this, including playing with ddply, and I've failed to find a solution - please a hapless biologist.
Cheers.

The plyr solution would look something like:
library(plyr)
reg.fun <- function(x) {
reg.diff <- x$count[x$condition=='control'] - x$count[x$condition=='treatment']
x$regulation <- ifelse(reg.diff > 0, 'up', 'down')
x
}
ddply(df, .(gene), reg.fun)
gene condition count sd regulation
1 A control 10 1.0 up
2 A treatment 2 0.2 up
3 B control 5 0.1 down
4 B treatment 8 2.0 down
5 C control 5 0.8 up
6 C treatment 1 0.1 up
>
You could also think about doing this with a different package and/or with data in a different shape:
df.w <- reshape(df, direction='wide', idvar='gene', timevar='condition')
library(data.table)
DT <- data.table(df.w, key='gene')
DT[, regulation:=ifelse(count.control-count.treatment > 0, 'up', 'down'), by=gene]
gene count.control sd.control count.treatment sd.treatment regulation
1: A 10 1.0 2 0.2 up
2: B 5 0.1 8 2.0 down
3: C 5 0.8 1 0.1 up
>

Something like this:
df$up.down <- with( df, ave(count, gene,
FUN=function(diffs) c("up", "down")[1+(diff(diffs) < 0) ]) )
spltdf <- split(df, df$up.down)
> df
gene condition count sd up.down
1 A control 10 1.0 down
2 A treatment 2 0.2 down
3 B control 5 0.1 up
4 B treatment 8 2.0 up
5 C control 5 0.8 down
6 C treatment 1 0.1 down
> spltdf
$down
gene condition count sd up.down
1 A control 10 1.0 down
2 A treatment 2 0.2 down
5 C control 5 0.8 down
6 C treatment 1 0.1 down
$up
gene condition count sd up.down
3 B control 5 0.1 up
4 B treatment 8 2.0 up

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Applying a function repeatedly to many subjects - r

Related

Transform community data into wide-format for vegan package

Sample random rows in dataframe, where number of samples exceeds number of rows. Assign sampling probability

Frequency table with ddply function

for loop through data frame and looping with unique values

Calculate the difference betwen pairs of consecutive rows in a data frame - R

Categories

Resources