Mean of variable by two factors - r

I have the following data:
a <- c(1,1,1,1,2,2,2,2)
b <- c(2,4,6,8,2,3,4,1)
c <- factor(c("A","B","A","B","A","B","A","B"))
df <- data.frame(
sp=a,
length=b,
method=c)
I can use the following to get a count of the number of samples of each species by method:
n <- with(df,tapply(sp,method,function(x) count(x)))
How do I also get the mean length by method for each species?

Personally I would use aggregate:
aggregate(length ~ sp, data = df, FUN= "mean" )
# by species only
# sp length
#1 1 5.0
#2 2 2.5
aggregate(length ~ sp + method, data = df, FUN= "mean" )
# by species and method
# sp method length
#1 1 A 4
#2 2 A 3
#3 1 B 6
#4 2 B 2
for everything together you may want:
aggregate(length ~ method, data = df, function(x) c(m = mean(x), counts = length(x)) )
# counts and mean for each method
# method length.m length.counts
#1 A 3.5 4.0
#2 B 4.0 4.0

The library plyr is very helpful for stuff like this
library(plyr)
new.df <- ddply(df, c("method", "sp"), summarise,
mean.length=mean(length),
max.length=max(length),
n.obs=length(length))
gives you
> new.df
method sp mean.length max.length n.obs
1 A 1 4 6 2
2 A 2 3 4 2
3 B 1 6 8 2
4 B 2 2 3 2
More examples at http://www.inside-r.org/packages/cran/plyr/docs/ddply.

Related

Adding column with information from another dataframe R

I have two dataframes and I need to join informations.
Here the first df where I have different points (1,2,3..):
eleno elety resno
1 N 1
2 CA 1
3 C 1
4 O 1
5 CB 1
6 CG 1
The second one indicates distances between points, "eleno" represents the first point and "ele2" the second one:
eleno ele2 values
<chr> <chr> <dbl>
1 2 1.46
1 3 2.46
1 4 2.86
1 5 2.46
1 6 3.83
1 7 4.47
I'd like to have in the 1st df a new column with info from df 2. For example, for point 1 I'd like to have -2(second point):1.46(distance) , -3:2.46, -4:2.86 and so on, preferable in a one column.
Something like this
eleno elety resno dist
1 N 1 -2:1.46, -3:2.46, -4:2.86 ...
2 CA 1
3 C 1
4 O 1
5 CB 1
6 CG 1
Thank you!
If I understand your preference to one column, then a possibility without dplyr is as follows. First, we create the new column by concatenating the ele2 and values columns from df2 using the paste() function, with a colon as the separator:
new_column <- paste(-df2$ele2, df2$values, sep = ":")
Then, we use cbind() to bind it to df1:
new_df1 <- cbind(df1, ele2_values = new_column)
This will give us a new data frame like so:
eleno elety resno ele2_values
1 1 N 1 -2:1.46
2 2 CA 1 -3:2.46
3 3 C 1 -4:2.86
4 4 O 1 -5:2.46
5 5 CB 1 -6:3.83
6 6 CG 1 -7:4.47
Here is the data that I used, based on what you have given:
df1 <- data.frame(
eleno = 1:6,
elety = c("N", "CA", "C", "O", "CB", "CG"),
resno = rep(1, 6)
)
df2 <- data.frame(
eleno = rep(1, 6),
ele2 = 2:7,
values = c(1.46, 2.46, 2.86, 2.46, 3.83, 4.47)
)
If we want to get this column as a single element for each point, we can modify our code in the following manner:
Instantiate new_column as an empty vector:
new_column <- vector()
Then call some variant of *apply() or use a for loop to subset the original data frame by points, while applying our original code and appending our singular character elements back to new_column:
lapply(unique(df2$eleno), FUN = function(x) {
subset <- subset(df2, eleno == x)
new_elem <- paste(-subset$ele2, subset$values, sep = ":", collapse = ", ")
new_column <<- c(new_column, new_elem)
})
Once this operation is complete, we use cbind() as before to bind new_column to df1:
new_df1 <- cbind(df1, ele2_values = new_column)
Our output is as follows,
eleno elety resno ele2_values
1 1 N 1 -2:1.13703411305323, -3:6.22299404814839, -4:6.09274732880294, -5:6.23379441676661, -6:8.60915383556858, -7:6.40310605289415
2 2 CA 1 -2:0.094957563560456, -3:2.32550506014377, -4:6.66083758231252, -5:5.14251141343266, -6:6.93591291783378, -7:5.44974835589528
3 3 C 1 -2:2.82733583590016, -3:9.23433484276757, -4:2.92315840255469, -5:8.37295628152788, -6:2.86223284667358, -7:2.66820780001581
4 4 O 1 -2:1.86722789658234, -3:2.32225910527632, -4:3.16612454829738, -5:3.02693370729685, -6:1.59046002896503, -7:0.399959180504084
5 5 CB 1 -2:2.18799541005865, -3:8.10598552459851, -4:5.25697546778247, -5:9.14658166002482, -6:8.3134504687041, -7:0.45770263299346
6 6 CG 1 -2:4.56091482425109, -3:2.65186671866104, -4:3.04672203026712, -5:5.0730687007308, -6:1.81096208281815, -7:7.59670635452494
Here is my random data that I used for df2 in this case:
set.seed(1234)
df2 <- data.frame(
eleno = rep(1:6, rep(6, 6)),
ele2 = 2:7,
values = runif(length(rep(1:6, rep(6, 6)))) * 10
)

calculate means and occurences from multiple matrices

I have a number of matrices that they all have the same type of elements but different lengths. Columns in all files are the same (lets call them "A" and "B") but rows between files are mostly the same elements but not always.
Here are some example data (in the form of dataframes)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
as you can see as far as the rows go even though "alpha","beta" and "gamma" are always present many of the others are not always there
I would like to calculate 2 things:
the average values of all A and B columns in all matrices and ideally that would be by creating an ave.matr that would have all rownames and the average/mean values of the columns "A" and "B"
A B
alpha 1 7
beta 2 6
delta 3 5
gamma 4 4
zeta 5 3
theta 6 2
epsilon 7 1
(where the above numbers are the mean values of all matrices)
and then an occurrence matrix, lets call it occur.matr that would count the number of occurrences of each row across all matrices and it should look like that
A B
alpha 3
beta 3
delta 2
gamma 3
zeta 2
theta 1
epsilon 1
I have started working on this today but I cannot figure out how to do it.
I started by creating a list and a matrix with the unique rownames from all matrices
list=c(rownames(df1),rownames(df2),rownames(df3))
unique=unique(list)
avematr<-matrix(NA,nrow=length(unique),ncol=2)
and my next step would be to make rownames of all matrices identical. I tried with match but i cannot figure it out but at this moment I dont even know if this is the best strategy...
And all similar questions out there are related to merging the matrices (which is not what I want to do).
Any help is greatly appreciated
Here is a tidyverse approach:
library(tidyverse)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
dat <- list(df1, df2, df3) %>%
map_dfr(rownames_to_column)
avg_dat <- dat %>%
group_by(id) %>%
summarise(A = mean(A),
B = mean(B))
#> `summarise()` ungrouping output (override with `.groups` argument)
avg_dat
#> # A tibble: 7 x 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 alpha 1 5
#> 2 beta 2 4
#> 3 delta 3 4
#> 4 epsilon 7 1
#> 5 gamma 3.67 2.33
#> 6 theta 6 2
#> 7 zeta 5 2
occ_dat <- dat %>% count(id)
occ_dat
#> id n
#> 1 alpha 3
#> 2 beta 3
#> 3 delta 2
#> 4 epsilon 1
#> 5 gamma 3
#> 6 theta 1
#> 7 zeta 2
Created on 2021-01-27 by the reprex package (v0.3.0)
If you want to stick to base R:
For the averaging task it makes things easier when you add your rowname as a column. This prevents autonumbering of rownames when combining the dataframes. You then can simply loop over every unique rowname and construct the averages. A quick and dirty solution could look like this:
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
add_row_names_to_df <- function(df) {
df$rn <- rownames(df)
return(df)
}
new_df <- rbind(add_row_names_to_df(df1),
add_row_names_to_df(df2),
add_row_names_to_df(df3))
avg_df <- as.data.frame(matrix(unique(new_df$rn),
nrow = length(unique(new_df$rn)),
ncol = 3))
for(i in 1:nrow(avg_df)) {
avg.df[i,] <- c(avg_df[i,1],
mean(new_df$A[new_df$rn==avg_df[i,1]]),
mean(new_df$B[new_df$rn==avg_df[i,1]]))
}
colnames(avg_df) <- c("rowname", "avgA", "avgB")
avg_df
results in:
rowname avgA avgB
1 alpha 1 5
2 beta 2 4
3 gamma 3.66666666666667 2.33333333333333
4 delta 3 4
5 zeta 5 2
6 theta 6 2
7 epsilon 7 1
For the occurence matrix you can use the table() function from R:
as.matrix(table(c(rownames(df1),rownames(df2),rownames(df3))))
yields:
[,1]
alpha 3
beta 3
delta 2
epsilon 1
gamma 3
theta 1
zeta 2

Conditional Subsetting based on column numbers

I need to subset data for when columns don't match. For example if I have an identifier in the first column X like 1 then all of the following examples in column Y should match:
X <- rep(1:4, times=2, each=2)
Y <- rep(c("Dave","Sam","Sam","Sam"))
Z <- as.data.frame(cbind(X,Y))
head(Z)
So on this one I would like to subset the data when X = 1 and 3 on this example since column y doesn't fully agree by not subset column 2. It would be great to get a function to subset for this type of problem I have on a larger dataframe
Thanks,
With dplyr:
df <- data.frame(x = rep(1:4, times=2, each=2),
y = rep(c("Dave","Sam","Sam","Sam")))
library(dplyr)
df %>%
group_by(x) %>%
filter(any(!y == lag(y), na.rm = T))
#> Source: local data frame [8 x 2]
#> Groups: x [2]
#>
#> x y
#> <int> <fctr>
#> 1 1 Dave
#> 2 1 Sam
#> 3 3 Dave
#> 4 3 Sam
#> 5 1 Dave
#> 6 1 Sam
#> 7 3 Dave
#> 8 3 Sam
I tested some cases, not sure if this holds a lot of edge cases
This is the way I would do it, though there may be a more elegant way. Is this what you need?
X <- rep(1:4, times=2, each=2)
Y <- rep(c("Dave","Sam","Sam","Sam"))
Z <- as.data.frame(cbind(X,Y))
head(Z)
# First Create Concatenated column
Z$XY <- paste(Z$X, Z$Y)
# Eliminate all duplicates
Z_unique <- unique(Z)
# Find number of occurences of each X value
n_occur <- data.frame(table(Z_unique$X))
# Pull only those that have occurred more than once
n_occur[n_occur$Freq > 1,]
# Subset the output to only those values
output <- Z[Z$X %in% n_occur$Var1[n_occur$Freq > 1],]
We can use data.table
library(data.table)
setDT(df)[, .SD[any(!y == shift(y))], x]
# x y
#1: 1 Dave
#2: 1 Sam
#3: 1 Dave
#4: 1 Sam
#5: 3 Dave
#6: 3 Sam
#7: 3 Dave
#8: 3 Sam
data
df <- data.frame(x = rep(1:4, times=2, each=2),
y = rep(c("Dave","Sam","Sam","Sam")))

Counting amount of zeros within a "melted" data frame

Hei, I learn R and I try to count how many zeros I have within the melted data. So, I want to know how many zeros corresponds to column a and b and print two results out.
I generated an example:
library(reshape)
library(plyr)
library(dplyr)
id = c(1,2,3,4,5,6,7,8,9,10)
b = c(0,0,5,6,3,7,2,8,1,8)
c = c(0,4,9,87,0,87,0,4,5,0)
test = data.frame(id,b,c)
test_melt = melt(test, id.vars = "id")
test_melt
I imagine for that I should create an if statement. Something with
if (test$value == 0){print()}, but how can I tell R to count zeros for a columns that have been melted?
With your data:
test_melt %>%
group_by(variable) %>%
summarize(zeroes = sum(value == 0))
# # A tibble: 2 x 2
# variable zeroes
# <fctr> <int>
# 1 b 2
# 2 c 4
Base R:
aggregate(test_melt$value, by = list(variable = test_melt$variable),
FUN = function(x) sum(x == 0))
# variable x
# 1 b 2
# 2 c 4
... and for curiosity:
library(microbenchmark)
microbenchmark(
dplyr = group_by(test_melt, variable) %>% summarize(zeroes = sum(value == 0)),
base1 = aggregate(test_melt$value, by = list(variable = test_melt$variable), FUN = function(x) sum(x == 0)),
# #PankajKaundal's suggested "formula" notation reads easier
base2 = aggregate(value ~ variable, test_melt, function(x) sum(x == 0))
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 916.421 986.985 1069.7000 1022.1760 1094.7460 2272.636 100
# base1 647.658 682.302 783.2065 715.3045 765.9940 1905.411 100
# base2 813.219 867.737 950.3247 897.0930 959.8175 2017.001 100
sum(test_melt$value==0)
This should do it.
This might help . Is this what you're looking for ?
> test_melt[4] <- 1
> test_melt2 <- aggregate(V4 ~ value + variable, test_melt, sum)
> test_melt2
value variable V4
1 0 b 2
2 1 b 1
3 2 b 1
4 3 b 1
5 5 b 1
6 6 b 1
7 7 b 1
8 8 b 2
9 0 c 4
10 4 c 2
11 5 c 1
12 9 c 1
13 87 c 2
V4 is the count

Simplifying the process of creating a summary table

I am pretty sure I am complicating things. I have a data frame with p variables (here: v1 to v3) and two factor variable (here: sex and unemp):
> head(df)
sex unemp v1 v2 v3
1 0 0 2 4 4
2 0 0 2 1 1
3 1 0 3 3 5
4 1 1 2 3 5
5 0 0 1 2 5
6 1 0 3 5 4
I now would like to modify (i.e. compute median and mean and then rearrange the summary table) my data in such way that the resulting data frame looks like this (for men or women):
> df.res.men
median.unemp.1 median.unemp.0 mean.unemp.1 mean.unemp.0
v1 2.0 2.0 2.666667 2.391304
v2 2.0 3.5 2.500000 3.369565
v3 4.5 3.0 4.166667 2.956522
Here is the full code:
library(plyr)
## generate data
set.seed(1)
df <- data.frame(sex=rbinom(100, 1, 0.5),
unemp=rbinom(100, 1, 0.2),
v1=sample(1:5, 100, replace=TRUE),
v2=sample(1:5, 100, replace=TRUE),
v3=sample(1:5, 100, replace=TRUE)
)
head(df)
## compute mean and median for all variables by sex and unemp
df.mean <- ddply(df, .(unemp, sex), .fun=colMeans, na.rm=TRUE)
df.mean
df.median <- ddply(df, .(unemp, sex), .fun=function(x)apply(x,2,median, na.rm=TRUE))
df.median
## rearrange summary table
df.res.men <- cbind(t(subset(df.median, sex==0 & unemp==1)),
t(subset(df.median, sex==0 & unemp==0)),
t(subset(df.mean, sex==0 & unemp==1)),
t(subset(df.mean, sex==0 & unemp==0)))
df.res.men <- df.res.men[-c(1:2),]
colnames(df.res.men) <- c("median.unemp.1", "median.unemp.0",
"mean.unemp.1", "mean.unemp.0")
df.res.men
Here is one approach
library(plyr); library(reshape2)
dfm <- melt(df, id = c('sex', 'unemp'))
df2 <- ddply(dfm, .(variable, unemp, sex), summarize,
avg = mean(value), med = median(value))
df2m <- melt(df2, id = 1:3, variable.name = 'sum_fun')
df_0 <- dcast(df2m, sex + variable ~ sum_fun + unemp, subset = .(sex == 0))
sex variable avg_0 avg_1 med_0 med_1
1 0 v1 2.794872 3.0000 3 3.5
2 0 v2 3.102564 2.8750 3 3.0
3 0 v3 3.205128 3.1875 3 4.0
Here's a two-line solution using reshape alone. The default column names need a bit of work, but the syntax of the melt() and cast() statements is nicely expressive.
(One important note -- unlike reshape, reshape2 can not take a vector of summary function names as its fun.aggregate argument, as I've done below with c(mean, median). Thanks to Ramnath for pointing that out.)
library(reshape)
dmelt <- melt(df, id=c('sex', 'unemp'))
# Results for sex 0
cast(dmelt, variable ~ unemp, c(mean, median), subset = sex==0)
# variable 0_mean 0_median 1_mean 1_median
# 1 v1 2.391304 2.0 2.666667 2.0
# 2 v2 3.369565 3.5 2.500000 2.0
# 3 v3 2.956522 3.0 4.166667 4.5
# Results for sex 1
cast(dmelt, variable ~ unemp, c(mean, median), subset = sex==1)
# variable 0_mean 0_median 1_mean 1_median
# 1 v1 3.027778 3 2.416667 2.0
# 2 v2 2.638889 2 2.750000 3.0
# 3 v3 3.027778 3 2.583333 2.5
Solution without reshaping data.
f <- function(x) rbind(each(mean,median)(na.omit(x)))
#
# This should work but it doesn't.
# It almost work except labelling output with function names
#
df.res <- ddply(df,.(unemp, sex),.fun=numcolwise(f))
#
# Some workaround
#
df.res <- dlply(df,.(unemp, sex),.fun=numcolwise(f))
df.res <- cbind(attr(df.res,"split_labels"),do.call(rbind,df.res))

Resources