Find unique combinations based on two columns and calculate the mean - r

I have a problem in R, which I can't seem to solve.
I have the following dataframe:
Image 1
I would like to:
Find the unique combinations of the columns 'Species' and 'Effects'
Report the concentration belonging to this unique combination
If this unique combination is present more than one time, calculate the mean concentration
And would like to get the following dataframe:
Image 2
I have tried next script to get the unique combinations:
UniqueCombinations <- Data[!duplicated(Data[,1:2]),]
but don't know how to proceed from there.
Thanks in advance for your answers!
Tina

Create some example data:
dat <- data.frame(Species = rep.int(LETTERS[1:4], c(4, 1, 3, 2)),
Effect = c(rep("Reproduction", 3), "Growth", "Growth",
"Reproduction", "Mortality", "Mortality",
"Growth", "Growth"),
Concentration = rnorm(10))
You can use the function aggregate:
aggregate(Concentration ~ Species + Effect, dat, mean)

Try the following (Thanks Brandon Bertelsen for nice comment):
Creating your data:
foo = data.frame(Species=c(rep("A",4),"B",rep("C",3),"D","D"),
Effect=c(rep("Reproduction",3), rep("Growth",2),
"Reproduction", rep("Mortality",2), rep("Growth",2)),
Concentration=c(1.2,1.4,1.3,1.5,1.6,1.2,1.1,1,1.3,1.4))
Using great package plyr for a bit of magic :)
library(plyr)
ddply(foo, .(Species,Effect), function(x) mean(x[,"Concentration"]))
And this is a bit more complicated, but cleaner version (Thanks again to Brandon Bertelsen):
ddply(foo, .(Species,Effect), summarize, mean=mean(Concentration))

Just for fun before I call it a night.... Assuming your data.frame is called "dat", here are two more options:
A data.table solution.
library(data.table)
datDT <- data.table(dat, key="Species,Effect")
datDT[, list(Concentration = mean(Concentration)), by = key(datDT)]
# Species Effect Concentration
# 1: A Growth 1.50
# 2: A Reproduction 1.30
# 3: B Growth 1.60
# 4: C Mortality 1.05
# 5: C Reproduction 1.20
# 6: D Growth 1.35
An sqldf solution.
library(sqldf)
sqldf("select Species, Effect,
avg(Concentration) `Concentration`
from dat
group by Species, Effect")
# Species Effect Concentration
# 1 A Growth 1.50
# 2 A Reproduction 1.30
# 3 B Growth 1.60
# 4 C Mortality 1.05
# 5 C Reproduction 1.20
# 6 D Growth 1.35

Related

How to replicate this function in R many times?

I need to replicate the following function many times using different elements. I am very new to R and the only way I know how to is with copy-paste.
I need to calculate the proportion that each program represents in its area excluding the "Undecided" of the area. And I need proportions of each area stored in a separate list or vector for further calculations.
df2 = data.frame (area=rep(c("Eng", "Hum"),each=3), program=c("Chem", "Mech", "Undecided","Hist", "Law", "Undecided"))
df2
area program
1 Eng Chem
2 Eng Mech
3 Eng Undecided
4 Hum Hist
5 Hum Law
6 Hum Undecided
p.Mech = sum(program=="Mech" & area=="Eng") / (sum(area=="Eng")- sum(program=="Undecided" & area=="Eng"))
p.Chem = sum(program=="Chem" & area=="Eng") / (sum(area=="Eng")- sum(program=="Undecided" & area=="Eng"))
p.Hist = sum(program=="Hist" & area=="Hum") / (sum(area=="Hum")- sum(program=="Undecided" & area=="Hum"))
p.law = sum(program=="Law" & area=="Hum") / (sum(area=="Hum")- sum(program=="Undecided" & area=="Hum"))
In my real data I have 9 areas and about 5 programs for each area.
This is my first post ever on stack.exchange so sorry if the question is too dumb or doesn't belong here. Hope anyone can help.
I'd probably just do the following. Remove undecided entries, tally up the results by program and area, and calculate proportions by area.
df2= data.frame (area=rep(c("Eng", "Hum"),each=3), program=c("Chem", "Mech", "Undecided","Hist", "Law", "Undecided"))
library(data.table)
library(magrittr)
dt2 <- as.data.table(df2) # just converting to a data.table
dt2 %>%
.[program != "Undecided"] %>%
.[, .N, keyby = .(area, program)] %>%
.[, P := N / sum(N), keyby = "area"] %>%
.[] # just for displaying
#> area program N P
#> 1: Eng Chem 1 0.5
#> 2: Eng Mech 1 0.5
#> 3: Hum Hist 1 0.5
#> 4: Hum Law 1 0.5
Created on 2019-03-12 by the reprex package (v0.2.1)

R Viz for comparing multiple paired values

I have and excel with data for 2 DBs and multiple Measures (Msr) for each. There is classic ratio data Num/Denom=Ratio for each. Can anybody suggest what visualization I can use in R to graphically find big differences (let say 10%+) for each of measure between Test and X1 databases and then for each Measure.
So we compare Denom, Num, Rate between line 1 and 2.
..and then 3,4
..and then 5,6 etc
Tried to do in Excel but read that R could be much better for this purposes. But for now I can see most paired viz works for scattered display. I need something more traditional e.g. in my sample we can mark X1.SRB.Rare as low
In my example I have 3 measures, in reality could be 30. Thanks much for info.
M
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
df
db msr denom num rate
1 test BCS 11848 5255 44.35
2 x1 BCS 11049 6376 57.71
3 test CCS 35836 16908 47.18
4 x1 CCS 38458 18124 47.13
5 test SRB 54160 26253 48.47
6 x1 SRB 56387 15000 26.60
If I understood correctly, this should do what you want. I reshaped the data so you have one row per msr with separate columns for each db. I used data.table for it's performance.
library(data.table)
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
#set as a data.table
setDT(df)
#cast into one row per MSR - fill in with the "rate" variable
out <- dcast(msr ~ db, data = df, value.var = "rate")
#Compute difference
out[, test_x1_diff := test - x1]
#filter out diff >= 10
out[abs(test_x1_diff) >= 10]
#> msr test x1 test_x1_diff
#> 1: BCS 44.35 57.71 -13.36
#> 2: SRB 48.47 26.60 21.87
Created on 2019-01-11 by the reprex package (v0.2.1)

Graphing mutliple lines (999) on the same xy-plane in r

I have two variables that i want to bootstrap and graph the resulting linear regression line on the same xy plane from each new data set.
I was thinking that i can hold each resulting intercept and slope from the lm() but i dont know how i could graph that information for each resulting pair of information in the same graph. I know that abline() can do one pair but not all of them. Feel free to throw anything at me.
intercept_stuff<-rep(NA,T)
opp_stuff<-rep(NA,T)
N<-1000
for(t in 1:T){
idx <- sample(1:N, size =N, replace=TRUE)
intercept_stuff[t]<- lm(oppose_any~local_topic ,data = facebook[idx,
])$coefficient[1]
opp_stuff[t]<- lm(oppose_any~local_topic ,data = facebook[idx,
])$coefficient[2]
}
Here is an example of how to do multiple pairs of lines on ggplot with some simulated data. Hopefully this will give you some useful clues:
library(reshape2)
library(tibble)
# simulate some data
obs <- c(1:90)
values1 <- rnorm(90,mean=0,sd=1)
values2 <- rnorm(90,mean=5,sd=2)
values3 <- rnorm(90,mean=10,sd=3)
df <- as.tibble(cbind(obs,values1,values2,values3))
It looks like this:
> df
# A tibble: 90 x 4
obs values1 values2 values3
<dbl> <dbl> <dbl> <dbl>
1 1. -0.162 7.47 10.7
2 2. 0.518 5.17 7.61
3 3. 1.52 7.66 4.42
# ... with 80 more rows
Then melt it into long form:
m.df <- melt(df,id="obs",measures=c("values1","values2","values3"))
to look like this:
> m.df
obs variable value
1 1 values1 -0.16228714
2 2 values1 0.51755370
3 3 values1 1.52433685
4 4 values1 -1.82067006
5 5 values1 -1.42180601
...
Then plot many lines (as long as there are groups like the color group here, they will be unique lines):
ggplot(m.df,aes(x=obs,y=value,color=variable)) + geom_line()
And here is the plot:
Good luck!

Calculating the mean vectors for a two-factorial table

I am trying to calculate the mean reagent vectors across the variables RBC, WBC, and hemoglobin. I am fairly new to R so my question is: Can you show me an easier way to do the following calculations in R? The data is from Table 6.19 of Rencher. I am trying to practice doing the computations in R as I follow the examples in Rencher.
reagent.dat <- read.table("https://dl.dropboxusercontent.com/u/28713619/reagent.dat")
colnames(reagent.dat) <- c("reagent", "subject", "RBC", "WBC", "hemoglobin")
reagent.dat$reagent <- factor(reagent.dat$reagent)
reagent.dat$subject <- factor(reagent.dat$subject)
library(plyr)
library(dplyr)
library(reshape2)
# Calculate the means per variable, across reagents
reagent.datm <- melt(reagent.dat)
group.means <- ddply(reagent.datm, c("variable","reagent"), summarise,mean=mean(value))
group.means <- tbl_df(group.means)
newdata <- group.means %>% select(reagent, mean)
# Store the group means into a matrix
y_bar <- matrix(c(rep(NA, times=12)), ncol=4)
for (i in 1:4)
y_bar[,i] <- as.matrix(filter(newdata, reagent == i)$mean, ncol=1)
y_bar
The dplyr package can actually simplify your code quite easily and is definitely worth learning because of how powerful it can be. As an example:
reagent.dat <- read.table("https://dl.dropboxusercontent.com/u/28713619/reagent.dat")
colnames(reagent.dat) <- c("reagent", "subject", "RBC", "WBC", "hemoglobin")
#Using dplyr
library(dplyr)
reagentmeans <- reagent.dat %>% select(reagent, RBC, WBC, hemoglobin) %>%
group_by(reagent) %>%
summarize(mean_RBC = mean(RBC), mean_WBC = mean(WBC),
mean_hemoglobin = mean(hemoglobin))
> reagentmeans
Source: local data frame [4 x 4]
reagent mean_RBC mean_WBC mean_hemoglobin
(fctr) (dbl) (dbl) (dbl)
1 1 7.290 4.9535 15.310
2 2 7.210 4.8985 15.725
3 3 7.055 4.8810 15.595
4 4 7.025 4.8915 15.765
You can use data.table,
library(data.table)
setDT(reagent.dat)[, lapply(.SD, mean), by = reagent, .SDcols = c('RBC', 'WBC', 'hemoglobin')]
# reagent RBC WBC hemoglobin
#1: 1 7.290 4.9535 15.310
#2: 2 7.210 4.8985 15.725
#3: 3 7.055 4.8810 15.595
#4: 4 7.025 4.8915 15.765

Produce a precision weighted average among rows with repeated observations

I have a dataframe similar to the one generated below. Some individuals have more than one observation for a particular variable and each variable has an associated standard error (SE) for the estimate. I would like to create a new dataframe that contains only a single row for each individual. For individuals with more than one observation, such as Kim or Bob, I need to calculate a precision weighted average based on the standard errors of the estimates along with a variance for the newly calculated weighted mean. For example, for Bob, for var1, this means that I would want his var1 value in the new dataframe to be:
weighted.mean(c(example$var1[2], example$var1[10]),
c(1/example$SE1[2], 1/example$SE1[10]))
and for Bob's new SE1, which would be the variance of the weighted mean, to be:
1/sum(1/example$SE1[2] + 1/example$SE1[10])
I have tried using the aggregate function and am able to calculate the arithmetic mean of the values, but the simple function I wrote does not use the standard errors nor can it deal with the NAs.
aggregate(example[,1:4], by = list(example[,5]), mean)
Would appreciate any help in developing some code to work through this problem. Here is the example dataset.
set.seed(1562)
example=data.frame(rnorm(10,8,2))
colnames(example)[1]=("var1")
example$SE1=rnorm(10,2,1)
example$var2=rnorm(10,8,2)
example$SE2=rnorm(10,2,1)
example$id=
c ("Kim","Bob","Joe","Sam","Kim","Kim","Joe","Sara","Jeff","Bob")
example$SE1[5]=NA
example$var1[5]=NA
example$SE2[10]=NA
example$var2[10]=NA
example
var1 SE1 var2 SE2 id
1 9.777769 2.451406 6.363250 2.2739566 Kim
2 8.753078 2.174308 6.219770 1.4978380 Bob
3 7.977356 2.107739 6.835998 2.1647437 Joe
4 11.113048 2.713242 11.091650 1.7018666 Sam
5 NA NA 11.769884 -0.1310218 Kim
6 5.271308 1.831475 6.818854 3.0294338 Kim
7 7.770062 2.094850 6.387607 0.2272348 Joe
8 9.837612 1.956486 8.517445 3.5126378 Sara
9 4.637518 2.516896 7.173460 2.0292454 Jeff
10 9.004425 1.592312 NA NA Bob
I like the plyr package for these sorts of problems. It should be functionally equivalent to aggregate, but I think it is nice and convenient to use. There are lots of examples and a great ~20 page intro to plyr on the website. For this problem, since the data starts as a data.frame and you want another data.frame on the other end, we use ddply()
library(plyr)
#f1()
ddply(example, "id", summarize,
newMean = weighted.mean(x=var1, 1/SE1, na.rm = TRUE),
newSE = 1/sum(1/SE1, na.rm = TRUE)
)
Which returns:
id newmean newSE
1 Bob 8.8982 0.91917
2 Jeff 4.6375 2.51690
3 Joe 7.8734 1.05064
4 Kim 7.1984 1.04829
5 Sam 11.1130 2.71324
6 Sara 9.8376 1.95649
Also check out ?summarize and ?transform for some other good background. You can also pass an anonymous function to the plyr functions if necessary for more complicated tasks.
Or use data.table package which can prove faster for some tasks:
library(data.table)
dt <- data.table(example, key="id")
#f2()
dt[, list(newMean = weighted.mean(var1, 1/SE1, na.rm = TRUE),
newSE = 1/sum(1/SE1, na.rm = TRUE)),
by = "id"]
A quick benchmark:
library(rbenchmark)
#f1 = plyr, #f2 = data.table
benchmark(f1(), f2(),
replications = 1000,
order = "elapsed",
columns = c("test", "elapsed", "relative"))
test elapsed relative
2 f2() 3.580 1.0000
1 f1() 6.398 1.7872
So data.table() is ~ 1.8x faster for this dataset on my simple laptop.

Resources