Say I have an "integer" factor vector of length 5:
vecFactor = c(1,3,2,2,3)
and another "integer" data vector of length 5:
vecData = c(1.3,4.5,6.7,3,2)
How can I find the average of the data in each factor, so that I would get a result of:
Factor 1: Average = 1.3
Factor 2: Average = 4.85
Factor 3: Average = 3.25
tapply(vecData, vecFactor, FUN=mean)
1 2 3
1.30 4.85 3.25
I sometimes use a linear model to do this instead of tapply, which is quite flexible (for instance if you need to add weights...). Don't forget the "-1" in the formula
lm(vecData~factor(vecFactor)-1)$coef
factor(vecFactor)1 factor(vecFactor)2 factor(vecFactor)3
1.30 4.85 3.25
To get a good table, try aggregate function with data.frame:
ddf = data.frame(vecData, vecFactor)
aggregate(vecData~vecFactor, data=ddf, mean)
vecFactor vecData
1 1 1.30
2 2 4.85
3 3 3.25
data.table can also be used for this:
library(data.table)
ddt = data.table(ddf)
ddt[,list(meanval=mean(vecData)),by=vecFactor]
vecFactor meanval
1: 1 1.30
2: 3 3.25
3: 2 4.85
Related
I have the following dataset with the number of times ("n") the variable "V0220" appears in the data by the "id_municipio", but this variable has two types: 1 and 2. Moreover, I have the weight ("peso_amostral") of each observation.
id_municipio peso_amostral v0220 n
1100015 2.04 2 1
1100015 2.68 1 1
1100015 3.45 2 1
1100015 4.51 1 1
1100015 4.62 2 1
1100015 4.75 1 1
What I would like to do is the following:
id_municipio 2 1
1100015 X Y
Therefore, I want to calculate the weighted mean for each variable "V0220" for the type (2 or 1) of this variable by id_municipio. Note that "X" and "Y" are the weighted mean values for "V0220", by type 2 and 1, respectively. I want to do it using R.
You can try this using dcast from data.table. You can change fun.aggregate for the function that you need.
library(data.table)
dcast(data,
id_municipio ~ v0220,
fun.aggregate = mean,
value.var = "peso_amostral")
OUTPUT:
id_municipio 1 2
1 1100015 3.98 3.37
My dataset has 523 rows and 93 columns and it looks like this:
data <- structure(list(`2018-06-21` = c(0.6959635416667, 0.22265625,
0.50341796875, 0.982942708333301, -0.173828125, -1.229259672619
), `2018-06-22` = c(0.6184895833333, 0.16796875, 0.4978841145833,
0.0636718750000007, 0.5338541666667, -1.3009207589286), `2018-06-23` = c(1.6165364583333,
-0.375, 0.570800781250002, 1.603515625, 0.5657552083333, -0.9677734375
), `2018-06-24` = c(1.3776041666667, -0.03125, 0.7815755208333,
1.5376302083333, 0.5188802083333, -0.552966889880999), `2018-06-25` = c(1.7903645833333,
0.03125, 0.724609375, 1.390625, 0.4928385416667, -0.723074776785701
)), row.names = c(NA, 6L), class = "data.frame")
Each row is a city, and each column is a day of the year.
After calculating the row average in this way
data$mn <- apply(data, 1, mean)
I want to create another column data$duration that indicates the average length of a period of consecutive days where the values are > than data$mn.
I tried with this code:
data$duration <- apply(data[-6], 1, function(x) with(rle`(x > data$mean), mean(lengths[values])))
But it does not seem to work. In particular, it appears that rle( x > data$mean) fails to recognize the end of a row.
What are your suggestions?
Many thanks
EDIT
Reference dataframe has been changed into a [6x5]
The main challenge you're facing in your code is getting apply (which focuses on one row at a time) to look at the right values of the mean. We can avoid this entirely by keeping the mean out of the data frame, and doing the comparison data > mean to the whole data frame at once. The new columns can be added at the end:
mn = rowMeans(data)
dur = apply(data > mn, 1, function(x) with(rle(x), mean(lengths[values])))
dur
# 1 2 3 4 5 6
# 3.0 1.5 2.0 3.0 4.0 2.0
data = cbind(data, mean = mn, duration = dur)
print(data, digits = 2)
# 2018-06-21 2018-06-22 2018-06-23 2018-06-24 2018-06-25 mean duration
# 1 0.70 0.618 1.62 1.378 1.790 1.2198 3.0
# 2 0.22 0.168 -0.38 -0.031 0.031 0.0031 1.5
# 3 0.50 0.498 0.57 0.782 0.725 0.6157 2.0
# 4 0.98 0.064 1.60 1.538 1.391 1.1157 3.0
# 5 -0.17 0.534 0.57 0.519 0.493 0.3875 4.0
# 6 -1.23 -1.301 -0.97 -0.553 -0.723 -0.9548 2.0
I have a large dataframe that looks like this:
group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
...
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32
...
The dataframe is already sorted by group_id and then distance. I want know the dplyr or data.table efficient equivalent to doing the following operations:
Within each group_id:
Let the unique and sorted values of distance within the current group_id be d1,d2,...,d_n.
For each d in d1,d2,...,d_n: Compute some function f on all values of metric whose distance value is less than d. The function f is a custom user defined function, that takes in a vector and returns a scalar. Assume that the function f is well defined on an empty vector.
So, in the example above, the desired dataframe would look like:
group_id distance_less_than metric
1 1.1 f(empty vector)
1 1.7 f(0.85, 0.37)
1 2.3 f(0.85, 0.37, 0.93)
...
1 7.9 f(0.85, 0.37, 0.93, 0.45,...,0.29)
2 2.5 f(empty vector)
2 2.8 f(0.78)
...
Notice how distance values can be repeated, like the value 1.1 under group 1. In such cases, both of the rows should be excluded when the distance is less than 1.1 (in this case this results in an empty vector).
A possible approach is to use non-equi join available in data.table. The left table is the unique set of combinations of group_id and distance and right table are all the distance less than left table's distance.
f <- sum
DT[unique(DT, by=c("group_id", "distance")), on=.(group_id, distance<distance), allow.cartesian=TRUE,
f(metric), by=.EACHI]
output:
group_id distance V1
1: 1 1.1 NA
2: 1 1.7 1.22
3: 1 2.3 2.15
4: 1 6.3 2.60
5: 1 7.9 2.89
6: 2 2.5 NA
7: 2 2.8 0.78
data:
library(data.table)
DT <- fread("group_id distance metric
1 1.1 0.85
1 1.1 0.37
1 1.7 0.93
1 2.3 0.45
1 6.3 0.29
1 7.9 0.12
2 2.5 0.78
2 2.8 0.32")
Don't think this would be faster than data.table option but here is one way using dplyr
library(dplyr)
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .])))
where f is your function. map_dbl expects return type of function to be double. If you have different return type for your function you might want to use map_int, map_chr or likes.
If you want to keep only one entry per distance you might remove them using filter and duplicated
df %>%
group_by(group_id) %>%
mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .]))) %>%
filter(!duplicated(distance))
I'm using LSD.test in agricolae packages
Below is a reproducible example
library('agricolae')
group <- c(1,1,1,2,2,2,3,3,3)
variable <- c(1,2,1.5,10,11,12,22,23,21)
df <- data.frame(cbind(group,variable))
model <- aov(variable~group,data=df)
LSD.test(model,"group",p.adj="bonferroni")
I'm getting the below output which is great
$statistics
MSerror Df Mean CV t.value MSD
0.8035714 7 11.5 7.794969 3.127552 2.289134
$parameters
test p.ajusted name.t ntr alpha
Fisher-LSD bonferroni group 3 0.05
$means
variable std r LCL UCL Min Max Q25 Q50 Q75
1 1.5 0.5 3 0.2761907 2.723809 1 2 1.25 1.5 1.75
2 11.0 1.0 3 9.7761907 12.223809 10 12 10.50 11.0 11.50
3 22.0 1.0 3 20.7761907 23.223809 21 23 21.50 22.0 22.50
$comparison
NULL
$groups
variable groups
3 22.0 a
2 11.0 b
1 1.5 c
attr(,"class")
[1] "group"
I wanted to extract the median and letter from this output.
To extract the median of group 3 for example, I used this function
output [[5]][[1]][[1]]
that gives this output
[1] 22
Till now, everything is fine. I'll explain the problem and ask the question below.
Now, I need to extract the letter as well.
I tried the following code
output [[5]][[2]][[1]]
[1] a
Levels: a b c
My question is:
Is there any way to get rid of the Levels: a b c statement in the code and get only the letter?
Many thanks in advance.
as.character(output [[5]][[2]][[1]])
Solved it, thanks to #Tim Biegeleisen's comment
I am looking for an explicit function to subscript elements in R, say subscript(x,i) to mean x[i].
The reason that I need this traces back to a piece of code using dplyr and magrittr pipe operator, which is not a pipe, and where I need to divide by the first element of each column.
pipedDF <- rawdata %>% filter, merge, summarize, dcast %>%
mutate_each( funs(./subscript(., 1) ), -index)
I think this would do the trick and keep that pipe syntax which people like.
Without dplyr it would look like this...
Example,
> df
index a b c
1 1 6.00 5.0 4
2 2 7.50 6.0 5
3 3 5.00 4.5 6
4 4 9.00 7.0 7
> data.frame(sapply(df, function(x)x/x[1]))
index a b c
1 1 1.00 1.0 1.00
2 2 1.25 1.2 1.25
3 3 0.83 0.9 1.50
4 4 1.50 1.4 1.75
You should be able to use '[', as in
x<-5:1
'['(x,2)
# [1] 4