I am using the 'diamonds' dataset from ggplot2 and am wanting to find the average of the 'carat' column. However, I want to find the average every 0.1:
Between
0.2 and 0.29
0.3 and 0.39
0.4 and 0.49
etc.
You can use function aggregate to mean by group which is calculated with carat %/% 0.1
library(ggplot2)
averageBy <- 0.1
aggregate(diamonds$carat, list(diamonds$carat %/% averageBy * averageBy), mean)
Which gives mean by 0.1
Group.1 x
1 0.2 0.2830764
2 0.3 0.3355529
3 0.4 0.4181711
4 0.5 0.5341423
5 0.6 0.6821408
6 0.7 0.7327491
...
Related
I am trying to remove outliers from a dataset s consisted of 3 variables:
id consumption period
a 0.1 summer
a 0.2 summer
b 0.3 summer
a 0.4 winter
b 10 winter
c 12 winter
I used outliers <- s$consumption[!s$consumption %in% boxplot.stats(s$consumption)$out] to remove the outliers from s and got something like this:
consumption
0.1
0.2
0.3
0.4
However, I want to get something like this below:
id consumption period
a 0.1 summer
a 0.2 summer
b 0.3 summer
a 0.4 winter
But the $out function only allows me to remove the column with numbers (not with factors).
I found a solution which is to find the min of the output I got from outliers <- s$consumption[!s$consumption %in% boxplot.stats(s$consumption)$out], which is l in this case:
consumption
0.1
0.2
0.3
0.4
By knowing my min value, I can then take a subset of s by setting a condition where consumption has to be less than min(l).
new <- subset(s, consumption < min(l))
I have a data frame that looks like this:
Subject Time Freq1 Freq2 ...
A 6:20 0.6 0.1
A 6:30 0.1 0.5
A 6:40 0.6 0.1
A 6:50 0.6 0.1
A 7:00 0.3 0.4
A 7:10 0.1 0.5
A 7:20 0.1 0.5
B 6:00 ... ...
I need to delete the rows in the time range it is not from 7:00 to 7:30.So in this case, all the 6:00, 6:10, 6:20...
I have tried creating a data frame with just the times I want to keep but I does not seem to recognize the times as a number nor as a name. And I get the same error when trying to directly remove the ones I don't need. It is probably quite simple but I haven't found any solution.
Any suggestions?
We can convert the time column to a Period class under the package lubridate and then filter the data frame based on that column.
library(dplyr)
library(lubridate)
dat2 <- dat %>%
mutate(HM = hm(Time)) %>%
filter(HM < hm("7:00") | HM > hm("7:30")) %>%
select(-HM)
dat2
# Subject Time Freq1 Freq2
# 1 A 6:20 0.6 0.1
# 2 A 6:30 0.1 0.5
# 3 A 6:40 0.6 0.1
# 4 A 6:50 0.6 0.1
# 5 B 6:00 NA NA
DATA
dat <- read.table(text = "Subject Time Freq1 Freq2
A '6:20' 0.6 0.1
A '6:30' 0.1 0.5
A '6:40' 0.6 0.1
A '6:50' 0.6 0.1
A '7:00' 0.3 0.4
A '7:10' 0.1 0.5
A '7:20' 0.1 0.5
B '6:00' NA NA",
header = TRUE)
I am trying to meta analyze p values from different studies. I have data frame
DF1
p-value1 p-value2 pvalue3 m
0.1 0.2 0.3 a
0.2 0.3 0.4 b
0.3 0.4 0.5 c
0.4 0.4 0.5 a
0.6 0.7 0.9 b
0.6 0.7 0.3 c
I am trying to get fourth column of meta analyzed p-values1 to p-value3.
I tried to use metap package
p<–rbind(DF1$p-value1,DF1$p-value2,DF1$p-value3)
pv–split (p,p$m)
library(metap)
for (i in 1:length(pv))
{pvalue <- sumlog(pv[[i]]$pvalue)}
But it results in one p value. Thank you for any help.
You can try
apply(DF1[,1:3], 1, sumlog)
I am trying to simulate using the following code and data.For each of the iterations it is simulating n from rpois and then n values from function rbeta.This bit is working fine.
The only issue is that for each of n it should getid from the table below based on probability weighted(id_prob) sampling using function sample but for some reason it is only getting one ID for all values of n.
cl <- makeCluster(num_cores)
clusterEvalQ(cl,library(evir))
clusterExport(cl, varlist=c("Sims","ID","id_prob","beta_a","beta_b")
Sims<-10000
set.seed(0)
system.time(x1<-parLapply(cl,1:Sims, function(i){
id<-sample(ID,1,replace=TRUE,prob=id_prob)
n<-rpois(1,9)
rbeta(n,beta_a[id],beta_b[id])
}
))
ID Rate id_prob Beta_a Beta_b
1 1.5 16.7% 0.5 0.5
2 2 22.2% 0.4 0.4
3 1 11.1% 0.3 0.3
4 1.5 16.7% 0.6 0.6
5 2 22.2% 0.1 0.1
6 1 11.1% 0.2 0.2
I've combined the outputs for each user and item (for a recommendation system) into this all x all R data.table. For each row in this table, I need to calculate the correlation between user scores 1,2,3 & item scores 1,2,3 (e.g. for the first row what is the correlation between 0.5,0.6,-0.2 and 0.2,0.8,-0.3) to see how well the user and the item match.
user item user_score_1 user_score_2 user_score_3 item_score_1 item_score_2 item_score_3
A 1 0.5 0.6 -0.2 0.2 0.8 -0.3
A 2 0.5 0.6 -0.2 0.4 0.1 -0.8
A 3 0.5 0.6 -0.2 -0.2 -0.4 -0.1
B 1 -0.6 -0.1 0.9 0.2 0.8 -0.3
B 2 -0.6 -0.1 0.9 0.4 0.1 -0.8
B 3 -0.6 -0.1 0.9 -0.2 -0.4 -0.1
I have a solution that works - which is:
scoresDT[, cor(c(user_score_1,user_score_2,user_score_3), c(item_score_1,item_score_2,item_score_3)), by= .(user, item)]
...where scoresDT is my data.table.
This is all well and good, and it works...but I can't get it to work with dynamic variables instead of hard coding in the variable names.
Normally in a data.frame I could create a list and just input that, but as it's character format, the data.table doesn't like it. I've tried using a list with "with=FALSE" and have had some success when trying basic subsetting of the data.table but not with the correlation syntax that I need...
Any help is much, much appreciated!
Thanks,
Andrew
Here's what I would do:
mDT = melt(scoresDT,
id.vars = c("user","item"),
measure.vars = patterns("item_score_", "user_score_"),
value.name = c("item_score", "user_score")
)
mDT[, cor(item_score, user_score), by=.(user,item)]
user item V1
1: A 1 0.8955742
2: A 2 0.9367659
3: A 3 -0.8260332
4: B 1 -0.6141324
5: B 2 -0.9958706
6: B 3 0.5000000
I'd keep the data in its molten/long form, which fits more naturally with R and data.table functionality.