so I've been trying to create a indexed stock chart as a part of a project while learning R. Now I'd like to do the same with indexed values, so I want to create a vector of indexed values for each of my stocks. I tried the following:
indeksih <- apply(kombo, huhtamaki, FUN = huhtamaki/huhtamaki[1])
however this gives me Error in Ops.data.frame(huhtamaki, huhtamaki[1]) :
‘/’ only defined for equally-sized data frames
This is how my data looks like:
head(kombo)
Date Huhtamaki Sampo Kone
1 2019-12-30 41.38 38.91 58.28
2 2019-12-27 41.84 39.07 59.14
3 2019-12-23 41.66 39.13 59.02
4 2019-12-20 41.57 39.22 59.06
5 2019-12-19 40.69 38.99 58.32
6 2019-12-18 40.74 38.41 57.68
We can use
indexksi <- kombo$Huhtamaki/kombo$Huhtamaki[1]
Simply dividing the column by the first element of the column:
kombo[,"Huhtamaki"]/kombo[1, "Huhtamaki"]
If you want to do it on many columns a data.table approach can be useful
library(data.table)
setDT(kombo)
kombo[,lapply(.SD, function(x) x/x[1]), .SDcols = names(kombo[, -"date"])]
I suspect I'm Doing It Wrong, but I'd like to pass a character vector as an argument to a function in ddply. There's a lot of Q&A on removing quotes, etc. but none of it seems to work for me (eg. Remove quotes from a character vector in R and http://r.789695.n4.nabble.com/Pass-character-vector-to-function-argument-td3045226.html).
# reproducible data
df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))
#make a list
list.1<-list(df1=df1,df2=df2,df3=df3)
# desired output
lapply(list.1, function(x) ddply(x, .(d), function(x) data.frame(am=mean(x$a), bm=mean(x$b), cm=mean(x$c))))
$df1
d am bm cm
1 a 31.00000 29.25000 18.50000
2 b 31.66667 24.33333 34.66667
3 c 18.50000 5.50000 24.50000
4 d 36.00000 39.00000 43.00000
$df2
d am bm cm
1 e 18.25000 32.50000 18
2 f 27.66667 41.33333 24
3 g 25.00000 7.50000 42
$df3
d am bm cm
1 h 36.00000 25.00000 20.50000
2 i 25.33333 37.33333 24.33333
3 j 32.00000 32.00000 46.00000
But my actual use-case has many new columns and different types of calculations that I want to calculate in the ddply function. So I want to do something like:
# here's a simple version of a function that I want to send to ddply
func <- "am=mean(x$a), bm=mean(x$b), cm=mean(x$c)"
# here's how I imagine it might work
lapply(list.1, function(x) ddply(x, .(d), function(x) data.frame(func)) )
# not the desired outcome...
$df1
d func
1 a am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 b am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 c am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
4 d am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
$df2
d func
1 e am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 f am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 g am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
$df3
d func
1 h am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 i am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 j am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
I've tried noquote, deparse, eval(as.symbol()), do.call(data.frame, ...) and some of the methods here: https://github.com/hadley/devtools/wiki/Evaluation on func to no avail. The solution might be obvious at this point (ie. melt everything!), but in case it's not, here's a longer example that's closer to my use case:
# sample data
s <- 23 # number of samples
r <- 10 # number of runs per sample
el <- 17 # number of elements
mydata <- data.frame(ID = unlist(lapply(LETTERS[1:s], function(x) rep(x, r))),
run = rep(1:r, s))
# insert fake element data
mydata[letters[1:el]] <- lapply(1:el, function(i) rnorm(s*r, runif(1)*i^2))
# generate all combinations of 5 runs from ten runs
su <- 5 # number of runs to sample from ten runs
idx <- combn(unique(mydata$run), su)
# RSE function
RSE <- function(x) {100*( (sd(x)/sqrt(length(x)))/mean(x) )}
# make a list of dfs for all samples for each combination of five runs
# to prepare to calculate RSEs
combys1 <- lapply(1:ncol(idx), function(i) mydata[mydata$run %in% idx[,i],] )
# make a list of dfs with RSE for each ID, for each combination of runs
combys2 <- lapply(1:length(combys1), function(i) ddply(combys1[[i]], "ID", summarise, RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c)))
I want to replace RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c) in the last line above with the object doRSE from here, to avoid lots of typing:
# prepare to calculate new colums with RSE and means
RSEs <- sapply(3:ncol(mydata), function(j) paste0("RSE",names(mydata[j])))
RSExs <- sapply(3:ncol(mydata), function(j) paste0("RSE(",names(mydata[j]),")"))
doRSE <- paste0(sapply(1:length(RSEs), function(x) paste0(RSEs[x],"=",RSExs[x])), collapse=",", sep="")
I'm open to solutions involving base, data.table and dirty tricks. Seems like these are close to what I want, but I can't quite translate them to my problem:
Pass character argument and evaluate,
Force evaluation of multiple variables using vector of character,
Using a vector of characters that correspond to an expression as an argument to a function
UPDATE Here's the catch: I want to be able to modify the func in the simple example (or doRSE in my use-case) to create a bunch of new columns that result from various calculations on the existing columns to explore the data. I want a workflow that allows the resulting dataframes to have new columns that were not in the original dataframes. Sorry that wasn't more clear in the original question. I can't see how to adapt #Marius' answer to do this, but #mnel's is helpful (see update below)
Working through #mnel's excellent dirty tricks, with some minor fixes I can get the desired result on my use-case:
# #mnel's solution, adapted (no period before eval)
combys2 <- lapply(combys1, function(x) do.call(ddply,c(.data = quote(x),
.variables = quote(.(ID)), .fun = quote(summarize),
eval(parse(text = sprintf('.(%s)', doRSE ))))))
head(combys2)
[[1]]
ID RSEa RSEb RSEc RSEd RSEe RSEf RSEg RSEh RSEi
1 A 168.30658 21.68632 5.657228 5.048057 4.162017 2.9581874 1.849009 0.6925148 0.4393491
2 B 26.55071 26.20427 4.782578 4.385409 2.342764 2.1813874 2.719625 1.1576681 0.6427935
3 C 73.83165 14.47216 8.154435 6.273202 3.046978 1.2179457 2.811405 1.1401837 0.8167067
4 D 31.96170 57.89260 9.438220 7.388410 3.755772 0.8601780 3.724875 0.8358204 0.9939387
5 E 63.22537 60.35532 5.839690 11.691304 3.828430 0.9217787 4.204300 0.8217187 0.7876634
6 F 56.37635 65.37907 4.149568 5.496308 2.227544 2.1548455 2.847291 1.1956212 0.2506518
7 G 69.32232 23.63214 4.255847 7.979225 4.917660 1.6185960 3.156521 0.3265555 0.8133279
8 H 29.82015 40.74184 7.372100 7.464792 2.749862 0.6054420 4.061368 0.9973909 1.3807720
9 I 50.58114 19.53732 2.989920 9.767678 4.000249 1.7451322 1.175397 0.9952093 0.9095086
10 J 92.96462 39.77475 6.140688 10.295668 3.407726 2.4663758 3.030444 0.5743419 0.9296482
11 K 90.72381 42.25092 2.483069 6.781054 3.142082 1.8080633 2.891740 1.1996176 0.8525290
12 L -385.24547 40.81267 4.506087 8.148382 2.976488 0.8304432 2.234134 0.2108664 0.4979777
13 M 22.77743 33.98332 2.913926 8.764639 2.307293 0.8366635 3.229944 1.0003125 0.3878567
14 N 66.75163 34.16087 6.611326 13.865377 1.285522 1.3863958 4.165575 0.7379386 0.4515194
15 O 37.37188 100.57479 5.738877 5.724862 2.839638 1.1366610 3.186332 0.7383855 0.3954544
16 P 17.08913 26.62210 6.060130 4.110893 2.688908 2.6970727 1.609043 1.3860834 0.8780010
17 Q 13.96392 74.92279 5.469304 8.467638 2.974131 1.2135436 3.284564 0.6232778 1.0759226
18 R 42.59899 30.75952 4.842832 8.764158 1.874020 1.5791048 3.427342 1.4479638 0.2964455
19 S 26.03307 15.56352 6.968717 7.783876 4.439733 2.0764179 4.683080 0.7459654 1.1268772
20 T 71.57945 33.81362 7.147049 11.201551 2.128315 2.2051611 2.419805 0.2688807 1.1559635
21 U 73.93002 11.77155 7.738910 7.207041 1.478491 1.4409844 4.042419 0.5883490 0.5585716
22 V 67.93166 39.54994 5.701551 8.636122 2.472963 1.6514199 2.627965 1.0359048 0.8747136
23 W 11.23057 12.51272 7.003448 7.424559 4.102693 0.6614847 2.246305 1.3422405 0.2665246
RSEj RSEk RSEl RSEm RSEn RSEo RSEp RSEq
1 0.6366733 0.3713819 2.1993487 0.3865293 0.5436581 0.9187585 0.4344699 0.8915868
2 0.3445095 0.2932025 1.8563179 0.5397595 1.0433388 0.3533622 0.1942316 0.1941072
3 0.2720344 0.5507595 2.0305726 0.4377259 0.8589854 0.5690906 0.1397337 0.4043247
4 0.6606667 0.6769112 3.4737352 0.5674656 1.2519256 0.8718298 0.1162969 0.8287504
5 0.4620774 0.5598069 1.9236112 0.7990046 0.9832732 0.6847352 0.4070675 0.9005185
6 0.7981610 0.4005493 0.9721068 0.2770989 1.7054674 0.3110139 0.4521183 0.8740444
7 0.3969116 0.4717575 4.1341106 0.7510628 0.9998299 0.5342292 0.4319642 1.1861705
8 0.2963956 0.2652221 0.4775827 0.2617120 0.8261874 0.5266087 0.1900943 0.2350553
9 0.2609359 0.5431035 2.6478440 0.1606919 0.7407281 0.6802262 0.1802069 0.7438792
10 0.4239787 0.8753544 3.4218030 0.5467869 0.7404017 0.5581173 0.3682014 0.6361436
11 0.4188502 0.8629862 4.4181479 0.1623873 0.8018811 0.5873609 0.3592134 0.5357984
12 0.5790265 0.5009210 3.7534287 0.1933726 0.5809601 0.5777868 0.3400925 0.4783890
13 0.3562582 0.2552756 2.1393219 0.1849345 0.5796194 0.6129469 0.3363311 0.4382125
14 0.7921502 0.6147990 2.9054634 0.5852325 1.4954072 0.9983203 0.2937837 0.7654504
15 0.5840424 0.2757707 1.5695675 0.3305385 0.8712636 0.5816490 0.1985457 0.7213289
16 0.3301280 0.3008273 2.9014987 0.4540833 0.5966479 0.9042004 0.1631630 0.7262141
17 0.5882511 0.2820978 3.0652666 0.4518936 1.3168151 0.4749311 0.2244693 0.6583083
18 0.4048816 0.3708787 3.2207478 0.2603412 1.3168318 0.3318745 0.3120436 0.6210711
19 0.4425123 0.3602076 3.7609863 0.5399527 0.8302572 0.3246904 0.1952143 0.2915325
20 0.5877835 0.6339015 1.6908570 0.3223056 0.5239339 0.6607198 0.2808094 0.3697380
21 0.4454056 0.7733354 4.3433420 0.4391075 0.5503594 0.5893406 0.2262403 0.2361512
22 0.9583940 0.6365843 3.0033951 0.6507968 0.8610046 0.6363198 0.2866719 0.5736855
23 0.4969730 0.3895182 2.0021608 0.3354475 1.4398250 0.7386870 0.2458906 0.3414804
...
...
You can do some ugly computing on the language using quote and plyr::.
Reading https://github.com/hadley/devtools/wiki/Computing-on-the-language will probably help understand whether you really want to do this.
Anyway, an approach could be to use
use .() to create your vector of arguments eg and use how summarize works
.(am=mean(a), bm=mean(b), cm=mean(c))
and if you really wanted to use a character string
foo<- "am=mean(a), bm=mean(b), cm=mean(c)"
eval(parse(text = sprintf('.(%s)', foo )))
Use quote liberally to create your list to be passed to to do.call
for example
lapply(list.1, function(x) do.call(ddply,c(.data = quote(x),
.variables = quote(.(d)), .fun = quote(summarize),
.(am=mean(a), bm=mean(b), cm=mean(c)))))
Oh boy is that ugly.
Or, you could use data.tables
library(data.table)
listDT <- lapply(list.1, data.table)
lapply(listDT, function(x) x[,lapply(.SD, mean), by = 'd'])
or
mystuff <- sprintf('list(%s)', foo)
lapply(listDT, function(x) x[, eval(parse(text = mystuff)), by = 'd'])
However, if you had all the same columns in all your data.tables, it would be more efficient to create one large data.table (with an identifer for each element of the list) and work on that.
Here's a ddply function that calculates the mean for all the columns that aren't d in your dataframes:
lapply(list.1,
function(x) {
ddply(
x,
.(d),
function(df_part) {
result_df <- data.frame(d=df_part$d[1])
non_d_cols <- colnames(df_part)[! colnames(df_part) == "d"]
for (col in non_d_cols) {
col_mean <- mean(df_part[[col]])
col_name <- paste0(col, "_mean")
result_df[[col_name]] <- col_mean
}
return(result_df)
})
})
That seems to me like the simplest way to do it, and it should generalize well to other calculations you might want to do on those columns. Maybe you could pass in a character vector argument of the columns you want to calculate the mean for, and use that in place of non_d_cols.
I'm new to R, and I wrote some code to summarize data from .csv file according to my needs.
here is the code.
raw <- read.csv("trees.csv")
looks like this
SNAME CNAME FAMILY PLOT INDIVIDUAL CAP H
1 Alchornea triplinervia (Spreng.) M. Arg. Tainheiro Euphorbiaceae 5 176 15 9.5
2 Andira fraxinifolia Benth. Angelim Fabaceae 3 321 12 6.0
3 Andira fraxinifolia Benth. Angelim Fabaceae 3 326 14 7.0
4 Andira fraxinifolia Benth. Angelim Fabaceae 3 327 18 5.0
5 Andira fraxinifolia Benth. Angelim Fabaceae 3 328 12 6.0
6 Andira fraxinifolia Benth. Angelim Fabaceae 3 329 21 7.0
#add 2 other rows
for (i in 1:nrow(raw)) {
raw$VOLUME[i] <- treeVolume(raw$CAP[i],raw$H[i])
raw$BASALAREA[i] <- treeBasalArea(raw$CAP[i])
}
#here comes.
I need a new data frame, with the mean of columns H and CAP and the sums of columns VOLUME and BASALAREA. This dataframe is grouped by column SNAME and subgrouped by column PLOT.
plotSummary = merge(
aggregate(raw$CAP ~ raw$SNAME * raw$PLOT, raw, mean),
aggregate(raw$H ~ raw$SNAME * raw$PLOT, raw, mean))
plotSummary = merge(
plotSummary,
aggregate(raw$VOLUME ~ raw$SNAME * raw$PLOT, raw, sum))
plotSummary = merge(
plotSummary,
aggregate(raw$BASALAREA ~ raw$SNAME * raw$PLOT, raw, sum))
The functions treeVolume and treeBasal area just return numbers.
treeVolume <- function(radius, height) {
return (0.000074230*radius**1.707348*height**1.16873)
}
treeBasalArea <- function(radius) {
return (((radius**2)*pi)/40000)
}
I'm sure that there is a better way of doing this, but how?
I can't manage to read your example data in, but I think I've made something that generally represents it...so give this a whirl. This answer builds off of Greg's suggestion to look at plyr and the functions ddply to group by segments of your data.frame and numcolwise to calculate your statistics of interest.
#Sample data
set.seed(1)
dat <- data.frame(sname = rep(letters[1:3],2), plot = rep(letters[1:3],2),
CAP = rnorm(6),
H = rlnorm(6),
VOLUME = runif(6),
BASALAREA = rlnorm(6)
)
#Calculate mean for all numeric columns, grouping by sname and plot
library(plyr)
ddply(dat, c("sname", "plot"), numcolwise(mean))
#-----
sname plot CAP H VOLUME BASALAREA
1 a a 0.4844135 1.182481 0.3248043 1.614668
2 b b 0.2565755 3.313614 0.6279025 1.397490
3 c c -0.8280485 1.627634 0.1768697 2.538273
EDIT - response to updated question
Ok - now that your question is more or less reproducible, here's how I'd approach it. First of all, you can take advantage of the fact that R is a vectorized meaning that you can calculate ALL of the values from VOLUME and BASALAREA in one pass, without looping through each row. For that bit, I recommend the transform function:
dat <- transform(dat, VOLUME = treeVolume(CAP, H), BASALAREA = treeBasalArea(CAP))
Secondly, realizing that you intend to calculate different statistics for CAP & H and then VOLUME & BASALAREA, I recommend using the summarize function, like this:
ddply(dat, c("sname", "plot"), summarize,
meanCAP = mean(CAP),
meanH = mean(H),
sumVOLUME = sum(VOLUME),
sumBASAL = sum(BASALAREA)
)
Which will give you an output that looks like:
sname plot meanCAP meanH sumVOLUME sumBASAL
1 a a 0.5868582 0.5032308 9.650184e-06 7.031954e-05
2 b b 0.2869029 0.4333862 9.219770e-06 1.407055e-05
3 c c 0.7356215 0.4028354 2.482775e-05 8.916350e-05
The help pages for ?ddply, ?transform, ?summarize should be insightful.
Look at the plyr package. I will split the data by the SNAME variable for you, then you give it code to do the set of summaries that you want (mixing mean and sum and whatever), then it will put the pieces back together for you. You probably want either the 'ddply' or the 'daply' function in that package.
I would like to know the number of unique dams which gave birth on each of the birth dates recorded. My data frame is similar to this one:
dam <- c("2A11","2A11","2A12","2A12","2A12","4D23","4D23","1X23")
bdate <- c("2009-10-01","2009-10-01","2009-10-01","2009-10-01",
"2009-10-01","2009-10-03","2009-10-03","2009-10-03")
mydf <- data.frame(dam,bdate)
mydf
# dam bdate
# 1 2A11 2009-10-01
# 2 2A11 2009-10-01
# 3 2A12 2009-10-01
# 4 2A12 2009-10-01
# 5 2A12 2009-10-01
# 6 4D23 2009-10-03
# 7 4D23 2009-10-03
# 8 1X23 2009-10-03
I used aggregate(dam ~ bdate, data=mydf, FUN=length) but it counts all the dams that gave birth on a particular date
bdate dam
1 2009-10-01 5
2 2009-10-03 3
Instead, I need to have something like this:
mydf2
bdate dam
1 2009-10-01 2
2 2009-10-03 2
Your help is very much appreciated!
What about:
aggregate(dam ~ bdate, data=mydf, FUN=function(x) length(unique(x)))
You could also run unique on the data first:
aggregate(dam ~ bdate, data=unique(mydf[c("dam","date")]), FUN=length)
Then you could also use table instead of aggregate, though the output is a little different.
> table(unique(mydf[c("dam","date")])$bdate)
2009-10-01 2009-10-03
2 2
This is just an example of how to think of the problem and one of the approaches on how to solve it.
split.mydf <- with(mydf, split(x = mydf, f = bdate)) #each list element has only one date.
# it's just a matter of counting unique dams
unique.mydf <- lapply(X = split.mydf, FUN = unique)
#and then count the number of unique elements
unilen.mydf <- lapply(unique.mydf, length)
#you can do these two last steps in one go like so
lapply(split.mydf, FUN = function(x) length(unique(x)))
as.data.frame(unlist(unilen.mydf)) #data.frame is just a special list, so this is water to your mill
unlist(unilen.mydf)
2009-10-01 2
2009-10-03 2
In dplyr you can use n_distinct :
library(tidyverse)
mydf %>%
group_by(bdate) %>%
summarize(dam = n_distinct(dam))
I have a data frame in R with the following structure.
> testData
date exch.code comm.code oi
1 1997-12-30 CBT 1 468710
2 1997-12-23 CBT 1 457165
3 1997-12-19 CBT 1 461520
4 1997-12-16 CBT 1 444190
5 1997-12-09 CBT 1 446190
6 1997-12-02 CBT 1 443085
....
77827 2004-10-26 NYME 967 10038
77828 2004-10-19 NYME 967 9910
77829 2004-10-12 NYME 967 10195
77830 2004-09-28 NYME 967 9970
77831 2004-08-31 NYME 967 9155
77832 2004-08-24 NYME 967 8655
What I want to do is produce a table the shows for a given date and commodity the total oi across every exchange code. So, the rows would be made up of
unique(testData$date)
and the columns would be
unique(testData$comm.code)
and each cell would be the total oi over all exch.codes on a given day.
Thanks,
The plyr package is good at this, and you should get this done with a single ddply() call. Something like (untested)
ddply(testData, .(date,comm.code), function(x) sum(x$oi))
should work.
# get it all aggregated
dfl <- aggregate(oi ~ date + comm.code, testData, sum)
# rearrange it so that it's like you requested
uc <- unique(df1$comm.code)
dfw <- with( df1, data.frame(data = unique(date), matrix(oi, ncol = length(uc))) )
names(dfw) <- c( 'date', uc)
This will be much much faster than the equivalent plyr command. And, there are ways to rearrange it in one liners. The rearranging part is very fast.
A data.table solution
library(data.table)
DT <- data.table(testData)
DT[,sum(oi), by = list(date,comm.code)]