How to write an equation as a function argument? - r

The function I created runs correctly on my computer. However, I'm trying to check for errors by devtools :: check_built () and I'm having problems with the myeq argument. The letters d and h represent my columns simple1 $ Diameter (cm) and simple1 $ Height (m), respectively.
indvol(x = simple1, mens="plot", myeq = 0.000065661*d^(2.475293)*h^(0.300022))
> head(simple1)
Plot Individual Specie Height (m) Diameter (cm)
1 1 1 Cariocar braziliense 7.5 22.60
2 1 1 Cariocar braziliense 7.5 25.78
3 1 1 Cariocar braziliense 7.5 46.15
4 1 1 Cariocar braziliense 7.5 9.55
5 1 2 Qualya parvifora 2.0 5.73
6 1 3 Magonia pubescens 4.0 5.73
The following error appears:
Error in indvol(x = simple1, mens = "plot", myeq = 6.5661e-05 * d^(2.475293) * :
object 'd' not found
Execution halted
The part dealing with the myeq argument within the function is:
if(mens=="plot"){
colnames(x)[5]<-"d"
colnames(x)[4]<-"h"
d<-x[,5]
h<-x[,4]
}
x$`Volume (m3)` <- eval(substitute(myeq), envir=x); x
Would anyone know how to fix this problem?

You have to capture the expression without evaluating it. You can do this with match.call():
indvol <- function(x, mens, myeq)
{
mc <- as.list(match.call()[-1])
if (mens == "plot")
{
colnames(x)[5] <- "d"
colnames(x)[4] <- "h"
d <- x[, 5]
h <- x[, 4]
}
x$`Volume (m3)` <- eval(mc$myeq, envir = x)
return(x)
}
So the function now works as expected:
indvol(x = simple1, mens="plot", myeq = 0.000065661*d^(2.475293)*h^(0.300022))
#> Plot Individual Specie h d Volume (m3)
#> 1 1 1 Cariocar braziliense 7.5 22.60 0.270184396
#> 2 1 1 Cariocar braziliense 7.5 25.78 0.374268984
#> 3 1 1 Carioca braziliense 7.5 46.15 1.581822308
#> 4 1 1 Carioca braziliense 7.5 9.55 0.032036171
#> 5 1 2 Qualya parvifora 2.0 5.73 0.006085243
#> 6 1 3 Magoni pubescens 4.0 5.73 0.007491927
As a design point though, it might be better to allow users to just use the column names of the passed data frame into the function.
Data
simple1 <- structure(list(Plot = c(1L, 1L, 1L, 1L, 1L, 1L), Individual = c(1L,
1L, 1L, 1L, 2L, 3L), Specie = c("Cariocar braziliense", "Cariocar braziliense",
"Carioca braziliense", "Carioca braziliense", "Qualya parvifora",
"Magoni pubescens"), `Height (m)` = c(7.5, 7.5, 7.5, 7.5, 2, 4),
`Diameter (cm)` = c(22.6, 25.78, 46.15, 9.55, 5.73, 5.73)),
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))

Related

Sum row of list of dataframe in r

I have a list of dataframes. Each dataframe is a Stock quote whose row names are dates and column names are buy price, sell price, shares and PL.
I want to obtain a column that contains the percentage of every positive PL contribution to the total daily PL.
Making it simplier. I have the following Data:
mylist= structure(list(`1` = structure(list(ID = c(35L, '2009-01-03', '2009-01-04', '2009-01-05'), Income = c(100, 200, 300, 400)), .Names = c("Date", "Income"), row.names = c(1L, 2L, 3L, 4L), class = "data.frame"), `2` = structure(list(ID = c('2009-01-02', '2009-01-03', '2009-01-04', '2009-01-05'), Income = c(500, -600, 700, 800)), .Names = c("Date", "Income"), row.names = c(1L, 2L, 3L, 4L), class = "data.frame"), `3` = structure(list(ID = c('2009-01-02', '2009-01-03', '2009-01-04'), Income = c(100, 200, 300)), .Names = c("Date", "Income"), row.names = c(1L, 2L, 3L), class = "data.frame")))
Which looks like this:
$`1`
Date Income
1 2009-01-01 100
2 2009-01-03 200
3 2009-01-04 300
4 2009-01-05 400
$`2`
Date Income
1 2009-01-02 500
2 2009-01-03 -600
3 2009-01-04 700
4 2009-01-05 800
$`3`
Date Income
1 2009-01-02 100
2 2009-01-03 200
3 2009-01-04 300
I want to obtain something that looks like this:
$`1`
Date Income Perc
1 2009-01-03 100 1.00
2 2009-01-03 200 0.20
3 2009-01-04 300 0.23
4 2009-01-05 400 0.33
$`2`
Date Income Perc
1 2009-01-02 500 0.83
2 2009-01-03 600 -1.50
3 2009-01-04 700 0.54
4 2009-01-05 800 0.67
$`3`
Date Income Perc
1 2009-01-02 100 0.17
2 2009-01-03 200 0.20
3 2009-01-04 300 0.23
I have two solutions for your problem. I highly recommend combining your data frame in one master data frame in order to reduce the complexity of the code if at all possible. I am sure there are better solutions to the "Separate Data Frame" problem, but most of them will involve multiple loops and thus negatively impact performance.
Data
mylist= structure(list(`1` = structure(list(ID = c('2009-01-02', '2009-01-03', '2009-01-04', '2009-01-05'), Income = c(100, 200, 300, 400)), .Names = c("Date", "Income"), row.names = c(1L, 2L, 3L, 4L), class = "data.frame"), `2` = structure(list(ID = c('2009-01-02', '2009-01-03', '2009-01-04', '2009-01-05'), Income = c(500, -600, 700, 800)), .Names = c("Date", "Income"), row.names = c(1L, 2L, 3L, 4L), class = "data.frame"), `3` = structure(list(ID = c('2009-01-02', '2009-01-03', '2009-01-04'), Income = c(100, 200, 300)), .Names = c("Date", "Income"), row.names = c(1L, 2L, 3L), class = "data.frame")))
Combined Data Frame
library(dplyr)
# add an ID to each data frame
for(i in 1:length(mylist)){
mylist[[i]] <- cbind(mylist[[i]], stock_id = names(mylist)[i])
}
# create data frame with all observations
my_data_frame <- do.call(rbind, mylist)
rownames(my_data_frame) <- NULL
my_data_frame %>%
group_by(Date) %>%
mutate(Perc = Income/sum(Income[Income > 0]))
# A tibble: 11 x 4
# Groups: Date [4]
Date Income stock_id Perc
<chr> <dbl> <chr> <dbl>
1 2009-01-02 100 1 0.143
2 2009-01-03 200 1 0.5
3 2009-01-04 300 1 0.231
4 2009-01-05 400 1 0.333
5 2009-01-02 500 2 0.714
6 2009-01-03 -600 2 -1.5
7 2009-01-04 700 2 0.538
8 2009-01-05 800 2 0.667
9 2009-01-02 100 3 0.143
10 2009-01-03 200 3 0.5
11 2009-01-04 300 3 0.231
Separate Data Frames
library(dplyr)
all_dates <- unique(unlist(lapply(mylist, function(x) unique(x$Date))))
for(i in 1:length(mylist)){
mylist[[i]] <- cbind(mylist[[i]], stock_id = names(mylist)[i])
}
perc_all <- list()
for(i in 1:length(all_dates)){
temporary <- lapply(mylist, function(x) x[x$Date == all_dates[i], ])
all_obs_date <- do.call(rbind, temporary)
all_obs_date$Perc <- all_obs_date$Income/sum(all_obs_date$Income[all_obs_date$Income > 0])
perc_all[[i]] <- all_obs_date
}
perc_final <- do.call(rbind, perc_all)
lapply(mylist, function(x) {
left_join(x, perc_final) %>% select(-stock_id)
})
$`1`
Date Income Perc
1 2009-01-02 100 0.1428571
2 2009-01-03 200 0.5000000
3 2009-01-04 300 0.2307692
4 2009-01-05 400 0.3333333
$`2`
Date Income Perc
1 2009-01-02 500 0.7142857
2 2009-01-03 -600 -1.5000000
3 2009-01-04 700 0.5384615
4 2009-01-05 800 0.6666667
$`3`
Date Income Perc
1 2009-01-02 100 0.1428571
2 2009-01-03 200 0.5000000
3 2009-01-04 300 0.2307692

Calculate total sum of squares between clusters in R

My objective is to compare which of the two clustering methods I've used cluster_method_1 and cluster_method_2 has the largest between cluster sum of squares in order to identify which one achieves better separation.
I'm basically looking for an efficient way to calculate the distance between each point of cluster 1 and all points of cluster 2,3,4, and so on.
example dataframe:
structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848,
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376,
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849,
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849,
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684,
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684,
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328,
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861,
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L,
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4",
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L,
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4",
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
x1 x2 x3 x4 cluster_method_1 cluster_method_2
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 0.0176 0.536 0.647 -0.727 1 5
2 -1.15 0.886 -0.604 0.727 4 3
3 1.07 -0.945 -0.604 0.727 4 1
4 2.00 -2.24 -0.604 -0.727 4 3
5 0.0176 -1.81 -0.604 -0.727 2 4
6 0.944 1.83 0.647 0.727 2 2
7 2.00 0.886 -0.604 0.727 4 1
8 0.0176 0.969 -0.604 -0.727 2 1
9 -1.15 0.187 -0.604 -0.727 1 1
10 -1.15 -0.679 1.62 -0.727 6 6
The within sum-of-squares for cluster Si can be written as the sum of all pairwise (Euclidean) distances squared, divided by twice the number of points in that cluster (see e.g. the Wikipedia article on k-means clustering)
For convenience we define a function calc_SS that returns the within sum-of-squares for a (numeric) data.frame
calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))
It's then straightforward to calculate the within (cluster) sum-of-squares for every cluster for every method
library(tidyverse)
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
spread(method, within_SS)
## A tibble: 6 x 3
# cluster cluster_method_1 cluster_method_2
# <chr> <dbl> <dbl>
#1 1 1.52 9.99
#2 2 10.3 0
#3 3 NA 10.9
#4 4 15.2 0
#5 5 NA 0
#6 6 0 0
The total within sum-of-squares is then just the sum of the within sum-of-squares for every cluster
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
group_by(method) %>%
summarise(total_within_SS = sum(within_SS)) %>%
spread(method, total_within_SS)
## A tibble: 1 x 2
# cluster_method_1 cluster_method_2
# <dbl> <dbl>
#1 27.0 20.9
By the way, we can confirm that calc_SS does indeed return the within sum-of-squares using the iris dataset:
set.seed(2018)
df2 <- iris[, 1:4]
kmeans <- kmeans(as.matrix(df2), 3)
df2$cluster <- kmeans$cluster
df2 %>%
group_by(cluster) %>%
nest() %>%
mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
arrange(cluster)
## A tibble: 3 x 3
# cluster data within_SS
# <int> <list> <dbl>
#1 1 <tibble [38 × 4]> 23.9
#2 2 <tibble [62 × 4]> 39.8
#3 3 <tibble [50 × 4]> 15.2
kmeans$within
#[1] 23.87947 39.82097 15.15100
The total sum of squares, sum_x sum_y ||x-y||² is constant.
The total sum of squares can be computed trivially from variance.
If you now subtract the within-cluster sum of squares where x and y belong to the same cluster, then the between cluster sum of squares remains.
If you do this approach, it takes O(n) time instead of O(n²).
Corollary: the solution with the smallest WCSS has the largest BCSS.
Consider the package clValid. It calculates a large number of indexes for validating clustering. The Dunn index is particularly appropriate for what you are trying to do. The documentation says that the Dunn index is the ratio between the smallest distance between observation not in the same cluster to the largest intra-cluster distance. The documentation for it can be found at https://cran.r-project.org/web/packages/clValid/clValid.pdf.

Error with a function inside a Lapply() R

I'm having a very strange error in a script that used to work perfectly and I don't know what's the problem. I start creating a very long list with several data frames with the exact number of columns. The list is called lst. Then I want to do a summarise table with means and sd. Here is the script for that:
w1 <- lapply(lst, function(i) t(cbind(Mean = colMeans(i[, c(6,7,8,9)], na.rm = TRUE),
Sds = colSds(as.matrix(i[, c(6,7,8,9)]), na.rm = TRUE),
N = length(i[,2]),
len.max=max(i[,6]))))
The number of the columns are correct. However when I run the script first I get the Debug location and when I stopped I get this error message:
Error in t(cbind(Mean = colMeans(i[, c(6, 7, 8, 9)], na.rm = TRUE), Sds = colSds(as.matrix(i[, :
error in evaluating the argument 'x' in selecting a method for function 't': Error in `[.data.frame`(i, , c(6, 7, 8, 9)) : undefined columns selected
I dont know whats wrong with the function. I try to search in the internet and I saw something about change as,matrix for data.matrix. However this does not make the trick.
Indeed I get the same problem for another function very similar:
a1 <- lapply(lst, function(i) t(cbind(l1 = NROW(which(i[,6]>1)),
l1.05 = NROW(which(i[,6]<=1)) - NROW(which(i[,6]>0.5)),
l05.03 = NROW(which(i[,6]>0.3)) - NROW(which(i[,6]<=0.5)),
l03 = NROW(which(i[,6]<=0.3)))))
With the same outcome:
Error in t(cbind(l1 = NROW(which(i[, 6] > 1)), l1.05 = NROW(which(i[, :
error in evaluating the argument 'x' in selecting a method for function 't': Error in `[.data.frame`(i, , 6) : undefined columns selected
Can someone point me out what is the problem. Do you need some data? Thanks!
I'm working with the last RStudio and with the following packages:
plyr, matrixStats,dplyr
Here is an example of the list:
> lst
[[1]]
X Chr new pos1 pos2 len nsnp n.ind per.ind
1 1 1 1 12900000 13700000 0.9 284.7560 23.77778 7.952434
2 2 1 2 17000000 17300000 0.4 126.5582 16.00000 5.351171
3 3 1 3 21200000 21500000 0.4 126.5582 40.75000 13.628763
4 4 1 4 45300000 45700000 0.5 158.1978 23.20000 7.759197
5 5 1 5 45900000 46600000 0.8 253.1165 31.12500 10.409699
[[2]]
X Chr new pos1 pos2 len nsnp n.ind per.ind
1 1 1 1 12900000 13700000 0.9 312.90267 24.44444 4.288499
2 2 1 2 21200000 21500000 0.4 139.06785 38.00000 6.666667
3 3 1 3 32600000 33000000 0.5 173.83482 28.40000 4.982456
4 4 1 4 35800000 36100000 0.4 139.06785 37.25000 6.535088
5 5 1 5 36300000 36300000 0.1 34.76696 22.00000 3.859649
[[3]]
X Chr new pos1 pos2 len nsnp n.ind per.ind
1 1 1 1 35700000 36500000 0.9 287.4214 12.22222 11.42264
2 2 1 2 45900000 46600000 0.8 255.4857 12.50000 11.68224
3 3 1 3 49400000 50700000 1.4 447.1000 21.78571 20.36048
4 4 1 4 51000000 52000000 1.1 351.2929 16.00000 14.95327
5 5 1 5 52200000 53000000 0.9 287.4214 19.66667 18.38006
dput(lst[1:3])
list(structure(list(X = 1:5, Chr = c(1L, 1L, 1L, 1L, 1L), new = 1:5,
pos1 = c(12900000, 1.7e+07, 21200000, 45300000, 45900000),
pos2 = c(13700000, 17300000, 21500000, 45700000, 46600000
), len = c(0.9, 0.4, 0.4, 0.5, 0.8), nsnp = c(284.756031128405,
126.558236057069, 126.558236057069, 158.197795071336, 253.116472114137
), n.ind = c(23.7777777777778, 16, 40.75, 23.2, 31.125),
per.ind = c(7.95243403939056, 5.35117056856187, 13.628762541806,
7.75919732441472, 10.4096989966555)), .Names = c("X", "Chr",
"new", "pos1", "pos2", "len", "nsnp", "n.ind", "per.ind"), row.names = c(NA,
5L), class = "data.frame"), structure(list(X = 1:5, Chr = c(1L,
1L, 1L, 1L, 1L), new = 1:5, pos1 = c(12900000, 21200000, 32600000,
35800000, 36300000), pos2 = c(13700000, 21500000, 3.3e+07, 36100000,
36300000), len = c(0.9, 0.4, 0.5, 0.4, 0.1), nsnp = c(312.90267141585,
139.0678539626, 173.83481745325, 139.0678539626, 34.76696349065
), n.ind = c(24.4444444444444, 38, 28.4, 37.25, 22), per.ind = c(4.28849902534113,
6.66666666666667, 4.98245614035088, 6.53508771929825, 3.85964912280702
)), .Names = c("X", "Chr", "new", "pos1", "pos2", "len", "nsnp",
"n.ind", "per.ind"), row.names = c(NA, 5L), class = "data.frame"),
structure(list(X = 1:5, Chr = c(1L, 1L, 1L, 1L, 1L), new = 1:5,
pos1 = c(35700000, 45900000, 49400000, 5.1e+07, 52200000
), pos2 = c(36500000, 46600000, 50700000, 5.2e+07, 5.3e+07
), len = c(0.9, 0.8, 1.4, 1.1, 0.9), nsnp = c(287.421428571429,
255.485714285714, 447.1, 351.292857142857, 287.421428571429
), n.ind = c(12.2222222222222, 12.5, 21.7857142857143,
16, 19.6666666666667), per.ind = c(11.4226375908619,
11.6822429906542, 20.3604806408545, 14.9532710280374,
18.380062305296)), .Names = c("X", "Chr", "new", "pos1",
"pos2", "len", "nsnp", "n.ind", "per.ind"), row.names = c(NA,
5L), class = "data.frame"))

R - ddply summarise using nlevels() does not work

When using the plyr package to summarise my data, it seems impossible to use the nlevels() function.
The structure of my data set is as follows:
>aer <- read.xlsx("XXXX.xlsx", sheetIndex=1)
>aer$ID <- as.factor(aer$ID)
>aer$description <- as.factor(aer$description)
>head(aer)
ID SOC start end days count severity relation
1 1 410 2015-04-21 2015-04-28 7 1 1 3
2 1 500 2015-01-30 2015-05-04 94 1 1 3
3 1 600 2014-11-25 2014-11-29 4 1 1 3
4 1 600 2015-01-02 2015-01-07 5 1 1 3
5 1 600 2015-01-26 2015-03-02 35 1 1 3
6 1 600 2015-04-14 2015-04-17 3 1 1 3
> dput(head(aer,4))
structure(list(ID = structure(c(1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "12", "13", "14",
"15"), class = "factor"), SOC = c(410, 500, 600, 600),
start = structure(c(16546, 16465, 16399, 16437), class = "Date"),
end = structure(c(16553, 16559, 16403, 16442), class = "Date"),
days = c(7, 94, 4, 5), count = c(1, 1, 1, 1), severity = c(1,
1, 1, 1), relation = c(3, 3, 3, 3)), .Names = c("ID", "SOC",
"description", "start", "end", "days", "count", "severity", "relation"
), row.names = c(NA, 4L), class = "data.frame")
What I would like to know is how many levels exists in the "ID" variable in data sections created, when dividing the data set using the variable "SOC". I want to summarise this information together with some other variables in a new data set. Therefore, I would like to use the plyr package like so:
summaer2 <- ddply(aer, c("SOC"), summarise,
participants = nlevels(ID),
events = sum(count),
min_duration = min(days),
max_duration = max(days),
max_severity = max(severity))
This returns the following error:
Error in Summary.factor(c(4L, 5L, 11L, 11L, 14L, 14L), na.rm = FALSE) :
‘max’ not meaningful for factors
Could someone give me advice on how to reach my goal? Or what I'm doing wrong?
Many thanks in advance!
Update:
Substituting nlevels(ID) with length(unique(ID)) seems to give me the desired output:
> head(summaer2)
SOC participants events min_duration max_duration max_severity
1 100 4 7 1 62 2
2 410 9 16 1 41 2
3 431 2 2 109 132 1
4 500 5 9 23 125 2
5 600 8 19 1 35 1
6 1040 1 1 98 98 2

Using ddply to find correlation of a dataframe for separate groups for R [duplicate]

This question already has answers here:
spearman correlation by group in R
(4 answers)
Closed 8 years ago.
ds <- structure(list(GPA = c(1.78, 2.38, 2.43, 1.98, 1.56, 2.32, 1.96,
2.73, 2, 3.59), STUDY_STAGE = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L), OLAGG = c(18, 14, 14, 17, 17, 16, 16, 15, 14, 15)), .Names = c("GPA",
"STUDY_STAGE", "OLAGG"), row.names = c(NA, 10L), class = "data.frame")
I've made reference to this post
spearman correlation by group in R
However, when I attempted to find the correlation based on sub group STUDY_STAGE (there are 3), I obtained all same values.
ddply(ds,.(STUDY_STAGE), summarise, cor(ds$GPA, ds$OLAGG, method = "spearman"))
STUDY_STAGE ..1
1 1 -0.2805924
2 2 -0.2805924
3 3 -0.2805924
Additional information on the dataframe
str(ds)
'data.frame': 3167 obs. of 3 variables:
$ GPA : num 1.78 2.38 2.43 1.98 1.56 2.32 1.96 2.73 2 3.59 ...
$ STUDY_STAGE: int 3 3 3 3 3 3 3 3 3 3 ...
$ OLAGG : num 18 14 14 17 17 16 16 15 14 15 ...
Just to show that they should have different correlation values:
ds.yr1<-ds[ds$STUDY_STAGE=="Yr 1",]
cor(ds.yr1$GPA, ds.yr1$OLAGG)
[1] -0.3313926
ds.yr2<-ds[ds$STUDY_STAGE=="Yr 2",]
cor(ds.yr2$GPA, ds.yr2$OLAGG)
[1] -0.2905399
Full data is available here: https://dl.dropboxusercontent.com/u/64487083/R/mydata.csv
Question:
How can I find the correlation for all the 3 different study_stage?
Thank you all for your time and effort!
By using ds$GPA and ds$OLAGG, we are calculating the cor of the whole columns instead of by groups.
ds <- read.csv("mydata.csv") #full data from the link
cor(ds$GPA, ds$OLAGG, method='spearman')
#[1] -0.2805924
ddply(ds,.(STUDY_STAGE), summarise, Cor=cor(GPA, OLAGG, method = "spearman"))
# STUDY_STAGE Cor
#1 Yr 1 -0.3337192
#2 Yr 2 -0.2803793
#3 Yr 3 -0.2090219
cor(ds.yr1$GPA, ds.yr1$OLAGG, method='spearman')
#[1] -0.3337192

Resources