Hei,
I am trying to calculate the organisms quantity per class over the entire depth range (e.g., from 10 m to 90 m). To do that I have the abundance at certain depths (e.g., 10, 30 and 90 m) and I use the integrate function which calculate:
the average of abundance between each pair of depths, multiplied by the difference of the pairs of depths. The values are summed up over the entire depth water column to get a totale abundance over the water column.
See an example (only a tiny part of bigger data set with several locations and year, more class and depths):
View(df)
Class Depth organismQuantity
1 Ciliates 10 1608.89
2 Ciliates 30 2125.09
3 Ciliates 90 1184.92
4 Dinophyceae 10 0.00
5 Dinoflagellates 30 28719.60
6 Dinoflagellates 90 4445.26
integrate = function(x) {
averages = (x$organismQuantity[1:length(x)-1] + x$organismQuantity[2:length(x)]) / 2
sum(averages * diff(x$Depth))
}
result = ddply(df, .(Class), integrate)
print(result)
But I got these result and warning message for lines with NA value :
Class V1
1 Ciliates 136640.1
2 Dinoflagellates NA
3 Dinophyceae 0.0
Warning messages:
1: In averages * diff(x$Depth) :
longer object length is not a multiple of shorter object length
I don't understand why Dinoflagellates got NA value... It is the same for several others class in my complete data set (for some class abundance the integration equation applies for others I got the warning message).
thank you for the help!!
Cheers,
Lucie
Here is a way using function trapz from package caTools, adapted to the problem.
#
# library(caTools)
# Author(s)
# Jarek Tuszynski
#
# Original, adapted
trapz <- function(DF, x, y){
x <- DF[[x]]
y <- DF[[y]]
idx <- seq_along(x)[-1]
as.double( (x[idx] - x[idx-1]) %*% (y[idx] + y[idx-1]) ) / 2
}
library(plyr)
ddply(df, .(Class), trapz, x = "Depth", y = "organismQuantity")
# Class V1
#1 Ciliates 136640.1
#2 Dinoflagellates 994945.8
#3 Dinophyceae NA
Data
df <- read.table(text = "
Class Depth organismQuantity
1 Ciliates 10 1608.89
2 Ciliates 30 2125.09
3 Ciliates 90 1184.92
4 Dinophyceae 10 0.00
5 Dinoflagellates 30 28719.60
6 Dinoflagellates 90 4445.26
", header = TRUE)
Related
I'm currently working on a project where I need to calculate the rolling minimum over a spatial window of 30 meters (it's a square around the central point). On my data frame for each point I have the X and Y coordinates and the variable Z for which I'm trying to get the rolling minimum.
So far I have accomplished it using for loops with conditionals and data table filtering. This takes some time, specially when the data bases have over a million points. I would really appreciate if you could help me with some tips of how to improve the performance of this code.
d = 1
attach(data)
#### OPTION 1 - CONDITIONAL ####
op1 = NULL
for (i in 1:nrow(data)) {
op1[i]<-
min(
ifelse(POINT_X>=POINT_X[i]-d,
ifelse(POINT_X<=POINT_X[i]+d,
ifelse(POINT_Y>=POINT_Y[i]-d,
ifelse(POINT_Y<=POINT_Y[i]+d, Z, Z[i]),Z[i]),Z[i]),Z[i]), na.rm = T)}
#### OPTION 2 - SUBSET ####
setDT(data)
local_min = function(i){
x = POINT_X[i]
y = POINT_Y[i]
base = data[POINT_X %inrange% c(x-d,x+d)&
POINT_Y %inrange% c(y-d,y+d)]
local_min = min(base$Z, na.rm=T)
return(local_min)}
op2 = NULL
for (i in 1:nrow(data)) {
op2[i]<- local_min(i)}
I've tried other alternatives but the most common type of rolling statistic functions on R are based on index windows rather than values of other variables. Here's some data for you to try the the code above with d=1. I would be really grateful if you could help me improve this process.
data = data.frame(POINT_X=rep(1:5, each =5),
POINT_Y=rep(1:5,5),
Z=1:25)
The desired output should look like this:
> op1
[1] 1 1 2 3 4 1 1 2 3 4 6 6 7 8 9 11 11 12 13 14 16 16 17 18 19
I think it's important to note that currently the option 1 is faster than the option 2. Thanks in advance for your attention. :)
You could use a non-equi join :
d = 1
data[,`:=`(xmin = POINT_X-d,
xmax = POINT_X+d,
ymin = POINT_Y-d,
ymax = POINT_Y+d)]
data[data,on=.(POINT_X >= xmin,
POINT_X <= xmax,
POINT_Y >= ymin,
POINT_Y <= ymax)][
,.(rollmin=min(Z)),by=.(POINT_X,POINT_Y)][
,rollmin]
#[1] 1 1 2 3 4 1 1 2 3 4 6 6 7 8 9 11 11 12 13 14 16 16 17 18 19
I am attempting to write my own function for total sum of square, within sum of square, and between sum of square in R Studio for my own implementation of k-means.
I've successfully written the function for within sum of square, but I'm having difficulty with total sum of square (and thus bss). The result I get is significantly larger than what R's own kmeans function computes. I'm confused because I am following exactly what formulas provide. Here is my data:
A =
36 3
73 3
30 3
49 3
47 11
47 11
0 7
46 5
16 3
52 4
0 8
21 3
0 4
57 6
31 5
0 6
40 3
31 5
38 4
0 5
59 4
61 6
48 7
29 2
0 4
19 4
19 3
48 9
48 4
21 5
where each column is a feature. This is the function I've created thus far for tss:
tot_sumoSq <- function(data){
avg = mean( as.matrix(data) )
r = matrix(avg, nrow(data), ncol(data))
tot_sumoSq = sum( (data - r)^2 )
}
I receive the result 24342.4, but R gives 13244.8. Am I completely missing something?
The latter value is calculated using the column means. If you use this for calculating the means, you'll get the same answer.
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow=T)
[1] 13244.8
May be there are something wrong in your program. You subtract a matrix from a data frame. Use the following -
tot_sumoSq <- function(data){
data = as.matrix(data)
x = sum((data - mean(data))^2)
return(x)
}
From my side it gives the correct answer.
I found a solution to my issue by combining solutions provided by the first two commentators. I see what my previous mistake was and would like to clear any confusion for future scientists.
tot_sumoSq <- function(data){
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow = T)
data = as.matrix(data)
return( sum( (data - r)^2 ) )
}
Each column is the entire sample for different features, so when we calculate the mean for each column, it is the mean of means for the entire sample for one feature. My conceptual mistake earlier was to combine both features to calculate an overall mean.
I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.
I searched SO, but I could not seem to find the right code that is applicable to my question. It is similar to this question: Linear Regression calculation several times in one dataframe
I got a dataframe of LR coefficients following Andrie's code:
Cddply <- ddply(test, .(sumtest), function(test)coef(lm(Area~Conc, data=test)))
sumtest (Intercept) Conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617
My question is how to apply each of these LR models (1-10) to specific row intervals in another dataframe in order to get x, the independent variable, into a 3rd column. For example, I would like to apply sumtest1 to Samples 6:29, sumtest2 to samples 35:50, sumtest3 to samples 56:79, etc.. in intervals of 24 and 16 samples. The sample numbers repeats after 200, so sumtest9 will be for Samples 6:29 again.
Sample Area
6 236211
7 724919
8 1259814
9 1574722
10 268836
11 863818
12 1261768
13 1591845
14 220322
15 608396
16 980182
17 1415859
18 276276
19 724532
20 1130024
21 1147840
22 252051
23 544870
24 832512
25 899457
26 285093
27 4291007
28 825922
29 865491
35 246707
36 538092
37 767269
38 852410
39 269152
40 971471
41 1573989
42 1897208
43 261321
44 481486
45 598617
46 769240
47 229695
48 782691
49 1380597
50 1725419
The resulting dataframe would look like this:
Sample Area Calc
6 236211 407.5312917
7 724919 985.1525288
8 1259814 1617.363812
9 1574722 1989.564693
10 268836 446.0919309
...
35 246707 365.2452551
36 538092 724.3591324
37 767269 1006.805521
38 852410 1111.736505
39 269152 392.9073207
Thank you for your assistance.
Is this what you want? I made up a slightly larger dummy data set of 'area' to make it easier to see how the code worked when I tried it out.
# create 400 rows of area data
set.seed(123)
df <- data.frame(area = round(rnorm(400, mean = 1000000, sd = 100000)))
# "sample numbers repeats after 200" -> add a sample nr 1-200, 1-200
df$sample_nr <- 1:200
# create a factor which cuts the vector of sample_nr into pieces of length 16, 24, 16, 24...
# repeat to a total length of the pieces is 200
# i.e. 5 repeats of (16, 24)
grp <- cut(df$sample_nr, breaks = c(-Inf, cumsum(rep(c(16, 24), 5))))
# add a numeric version of the chunks to data frame
# this number indicates the model from which coefficients will be used
# row 1-16 (16 rows): model 1; row 17-40 (24 rows): model 2;
# row 41-56 (16 rows): model 3; and so on.
df$mod <- as.numeric(grp)
# read coefficients
coefs <- read.table(text = "intercept beta_conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617", header = TRUE)
# add model number
coefs$mod <- rownames(coefs)
head(df)
head(coefs)
# join area data and coefficients by model number
# (use 'join' instead of merge to avoid sorting)
library(plyr)
df2 <- join(df, coefs)
# calculate conc from area and model coefficients
# area = intercept + beta_conc * conc
# conc = (area - intercept) / beta_conc
df2$conc <- (df2$area - df2$intercept) / df2$beta_conc
head(df2, 41)
I'm having a dataframe as like below. I need to extract df based on the region which is availabe in RL
>avg_data
region SN value
beta 1 32
alpha 2 44
beta 3 55
beta 4 60
atp 5 22
> RL
V1
1 beta
2 alpha
That dataframe should be in array something like REGR[beta] which should contain beta related information as like below
region SN value
beta 1 32
beta 3 55
beta 4 60
Similarly for REGR[alpha]
region SN value
alpha 2 44
So that I can pass REGR as a argument for plotting graph.
REGR <- data.frame()
for (i in levels(RL$V1)){
REGR[i,] <- avg_data[avg_data$region==i, ];
}
I did some mistake in the above code. Please correct me.. Thank you
The split function may be of interest to you. From the help page, split divides the data in the vector x into the groups defined by f.
So for your data, it may look something like:
> split(avg_data, avg_data$region)
$alpha
region SN value
2 alpha 2 44
$atp
region SN value
5 atp 5 22
$beta
region SN value
1 beta 1 32
3 beta 3 55
4 beta 4 60
If you want to filter out the records that do not occur in RL, I'd probably do that in a preprocessing step using the %in% function and [ for extraction:
x <- avg_data[avg_data$region %in% RL$V1,]
#-----
region SN value
1 beta 1 32
2 alpha 2 44
3 beta 3 55
That's what I'd feed to split if you want to drop atp.
The approach above may be overkill if you are just wanting to plot. Here's an example using sapply to iterate through each level of region and make a plot:
sapply(unique(x$region), function(z)
plot(x[x$region == z,"value"], main=z[1]))