I am trying to get means from a column in a data frame based on a unique value. So trying to get mean of column b and column c in this exampled based on the unique values in column a. I thought the .(a) would make it calculate by unique value in a (it gives the unique values of a) but it just gives a mean for the whole column b or c.
df2<-data.frame(a=seq(1:5),b=c(1:10), c=c(11:20))
simVars <- c("b", "c")
for ( var in simVars ){
print(var)
dat = ddply(df2, .(a), summarize, mean_val = mean(df2[[var]])) ## my script
assign(var, dat)
}
c
a mean_val
1 15.5
2 15.5
3 15.5
4 15.5
5 15.5
How can I have it take an average for the column based on the unique value from column a?
thanks
You don't need a loop. Just calculate the means of b and c within a single call to ddply and the means will be calculated separately for each value of a. And, as #Gregor said, you don't need to re-specify the data frame name inside mean():
ddply(df2, .(a), summarise,
mean_b=mean(b),
mean_c=mean(c))
a mean_b mean_c
1 1 3.5 13.5
2 2 4.5 14.5
3 3 5.5 15.5
4 4 6.5 16.5
5 5 7.5 17.5
UPDATE: To get separate data frames for each column of means:
# Add a few additional columns to the data frame
df2 = data.frame(a=seq(1:5),b=c(1:10), c=c(11:20), d=c(21:30), e=c(31:40))
# New data frame with means by each level of column a
library(dplyr)
dfmeans = df2 %>%
group_by(a) %>%
summarise_each(funs(mean))
# Separate each column of means into a separate data frame and store it in a list:
means.list = lapply(names(dfmeans)[-1], function(x) {
cbind(dfmeans[,"a"], dfmeans[,x])
})
means.list
[[1]]
a b
1 1 3.5
2 2 4.5
3 3 5.5
4 4 6.5
5 5 7.5
[[2]]
a c
1 1 13.5
2 2 14.5
3 3 15.5
4 4 16.5
5 5 17.5
[[3]]
a d
1 1 23.5
2 2 24.5
3 3 25.5
4 4 26.5
5 5 27.5
[[4]]
a e
1 1 33.5
2 2 34.5
3 3 35.5
4 4 36.5
5 5 37.5
Related
I have a wide table with more than 22 columns. This table is the result of fuzzymatch and that's why it's in wide format. The column names are shown below (in order) (I will try to create a sample data frame for better demonstration):
[1] "shift_date.x" "shift" "ageyrs" "site" "level"
[6] "crowded_shift" "time" "dd" "AE" "ageyrs_start"
[11] "ageyrs_end" "time_start" "time_end" "shift_date.y" "shift_n"
[16] "ageyrs_n" "site_n" "level_n" "crowded_shift_n" "los_n"
[21] "dd_n" "AE_n"
What I want to do is to break this data frame starting from column 14 to the end ("shift_date.y" to "AE_n") and add it as new rows to the bottom of first section of table (change it to long format). The problem is that the first section has 13 columns but the second part has 8 and I am not sure how I can combine them (that's why probably subsetting and rbind don't work).
As an example, imagine we have the following data frame:
shift <- c (2,1,0)
ageyrs <- c(12.2,13,14)
site <- c(0,1,3)
level <- c (1,5,6)
ageyrs_s <- c (2,4,5)
ageyrs_n <- c (4,6,8)
shift2 <- c (2,1,0)
ageyrs2 <- c(12.2,13,14)
site2 <- c(0,1,3)
level2 <- c (1,5,6)
a <- data.frame(shift, ageyrs, site, level, ageyrs_s, ageyrs2, shift2, ageyrs2, site2, level2)
shift ageyrs site level ageyrs_s ageyrs_n shift_n ageyrs_n site_n level_n
1 2 12.2 0 1 2 4 2 12.2 0 1
2 1 13.0 1 5 4 6 1 13.0 1 5
3 0 14.0 3 6 5 8 0 14.0 3 6
No I want to break this dataframe at "shift2" column and create a dataframe line shown below:
shift ageyrs site level ageyrs_s ageyrs_n
1 2 12.2 0 1 2 4
2 1 13.0 1 5 4 6
3 0 14.0 3 6 5 8
4 2 12.2 0 1 NA NA
5 1 13.0 1 5 NA NA
6 0 14.0 3 6 NA NA
Any suggestions on how to resolve this?
We can use split.default from base R to split the data into list of data.frames and then convert to a single data.frame after unlisting the list elements
nm1 <- sub("\\d+$", "", names(a))
lst1 <- lapply(split.default(a, nm1),
unlist, use.names = FALSE)
out <- data.frame(lapply(lst1, `length<-`, max(lengths(lst1))))[unique(nm1)]
-output
out
# shift ageyrs site level ageyrs_s ageyrs_n
#1 2 12.2 0 1 2 4
#2 1 13.0 1 5 4 6
#3 0 14.0 3 6 5 8
#4 2 12.2 0 1 NA NA
#5 1 13.0 1 5 NA NA
#6 0 14.0 3 6 NA NA
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
a %>%
rename_at(vars(shift:level), ~ str_c(., '1')) %>%
pivot_longer(cols = -c(ageyrs_s, ageyrs_n), names_to = c(".value", 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])")
Try this. You can use bind_rows() and setNames() to define common names so that the values can be joined properly:
library(dplyr)
#Code
newa <- a %>% select(shift:ageyrs_n) %>%
bind_rows(a %>% select(shift2:level2) %>% setNames(gsub('2','',names(.))))
Output:
shift ageyrs site level ageyrs_s ageyrs_n
1 2 12.2 0 1 2 4
2 1 13.0 1 5 4 6
3 0 14.0 3 6 5 8
4 2 12.2 0 1 NA NA
5 1 13.0 1 5 NA NA
6 0 14.0 3 6 NA NA
This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 2 years ago.
Probably this is not that complex, but I couldn't figure out how to write a concise title explaining it:
I'm trying to use the aggregate function in R to return (1) the lowest value of a given column (val) by category (cat.2) in a data frame and (2) the value of another column (cat.1) on the same row. I know how to do part #1, but I can't figure out part #2.
The data:
cat.1<-c(1,2,3,4,5,1,2,3,4,5)
cat.2<-c(1,1,1,2,2,2,2,3,3,3)
val<-c(10.1,10.2,9.8,9.7,10.5,11.1,12.5,13.7,9.8,8.9)
df<-data.frame(cat.1,cat.2,val)
> df
cat.1 cat.2 val
1 1 1 10.1
2 2 1 10.2
3 3 1 9.8
4 4 2 9.7
5 5 2 10.5
6 1 2 11.1
7 2 2 12.5
8 3 3 13.7
9 4 3 9.8
10 5 3 8.9
I know how to use aggregate to return the minimum value for each cat.2:
> aggregate(df$val, by=list(df$cat.2), FUN=min)
Group.1 x
1 1 9.8
2 2 9.7
3 3 8.9
The second part of it, which I can't figure out, is to return the value in cat.1 on the same row of df where aggregate found min(df$val) for each cat.2. Not sure I'm explaining it well, but this is the intended result:
> ...
Group.1 x cat.1
1 1 9.8 3
2 2 9.7 4
3 3 8.9 5
Any help much appreciated.
If we need the output after the aggregate, we can do a merge with original dataset
merge(aggregate(df$val, by=list(df$cat.2), FUN=min),
df, by.x = c('Group.1', 'x'), by.y = c('cat.2', 'val'))
# Group.1 x cat.1
#1 1 9.8 3
#2 2 9.7 4
#3 3 8.9 5
But, this can be done more easily with dplyr by using slice to slice the rows with the min value of 'val' after grouping by 'cat.2'
library(dplyr)
df %>%
group_by(cat.2) %>%
slice(which.min(val))
# A tibble: 3 x 3
# Groups: cat.2 [3]
# cat.1 cat.2 val
# <dbl> <dbl> <dbl>
#1 3 1 9.8
#2 4 2 9.7
#3 5 3 8.9
Or with data.table
library(data.table)
setDT(df)[, .SD[which.min(val)], cat.2]
Or in base R, this can be done with ave
df[with(df, val == ave(val, cat.2, FUN = min)),]
# cat.1 cat.2 val
#3 3 1 9.8
#4 4 2 9.7
#10 5 3 8.9
data.frame(c = c(1,7,11,4,5,5))
c
1 1
2 7
3 11
4 4
5 5
6 5
desired dataframe
c c.90th
1 1 NA
2 7 1
3 11 6.4
4 4 10.2
5 5 9.8
6 5 9.4
For the first row, I want it to look at the previous rows, none and get the 90th quantile, NA.
For the second row, I want it to look at the previous rows, 1 and get the 90th quantile, 1.
For the third row, I want it to look at the previous rows, 1, 7 and get the 90th quantile, 6.4.
etc.
A solution using data.table that also works by groups:
library(data.table)
dt <- data.table(c = c(1,7,11,4,5,5),
group = c(1, 1, 1, 2, 2, 2))
cumquantile <- function(y, prob) {
sapply(seq_along(y), function(x) quantile(y[0:(x - 1)], prob))
}
dt[, c90 := cumquantile(c, 0.9)]
dt[, c90_by_group := cumquantile(c, 0.9), by = group]
> dt
c group c90 c90_by_group
1: 1 1 NA NA
2: 7 1 1.0 1.0
3: 11 1 6.4 6.4
4: 4 2 10.2 NA
5: 5 2 9.8 4.0
6: 5 2 9.4 4.9
Try:
dff <- data.frame(c = c(1,7,11,4,5,5))
dff$c.90th <- sapply(1:nrow(dff),function(x) quantile(dff$c[0:(x-1)],0.9,names=F))
Output:
c c.90th
1 NA
7 1.0
11 6.4
4 10.2
5 9.8
5 9.4
I have such a data frame(df) which is just a sapmle:
group value
1 12.1
1 10.3
1 NA
1 11.0
1 13.5
2 11.7
2 NA
2 10.4
2 9.7
Namely,
df<-data.frame(group=c(1,1,1,1,1,2,2,2,2), value=c(12.1, 10.3, NA, 11.0, 13.5, 11.7, NA, 10.4, 9.7))
Desired output is:
group value order
1 12.1 3
1 10.3 1
1 NA NA
1 11.0 2
1 13.5 4
2 11.7 3
2 NA NA
2 10.4 2
2 9.7 1
Namely, I want to find the
rank of the "value"s from starting from the smallest value
within the "group"s.
How can I do that with R? I will be very glad for any help Thanks a lot.
We could use ave from base R to create the rank column ("order1") of "value" by "group". If we need to have NAs for corresponding NA in "value" column, this can be done (df$order[is.na(..)])
df$order1 <- with(df, ave(value, group, FUN=rank))
df$order1[is.na(df$value)] <- NA
Or using data.table
library(data.table)
setDT(df)[, order1:=rank(value)* NA^(is.na(value)), by = group][]
# group value order1
#1: 1 12.1 3
#2: 1 10.3 1
#3: 1 NA NA
#4: 1 11.0 2
#5: 1 13.5 4
#6: 2 11.7 3
#7: 2 NA NA
#8: 2 10.4 2
#9: 2 9.7 1
You can use the rank() function applied to each group at a time to get your desired result. My solution for doing this is to write a small helper function and call that function in a for loop. I'm sure there are other more elegant means using various R libraries but here is a solution using only base R.
df <- read.table('~/Desktop/stack_overflow28283818.csv', sep = ',', header = T)
#helper function
rankByGroup <- function(df = NULL, grp = 1)
{
rank(df[df$group == grp, 'value'])
}
# Remove NAs
df.na <- df[is.na(df$value),]
df.0 <- df[!is.na(df$value),]
# For loop over groups to list the ranks
for(grp in unique(df.0$group))
{
df.0[df.0$group == grp, 'order'] <- rankByGroup(df.0, grp)
print(grp)
}
# Append NAs
df.na$order <- NA
df.out <- rbind(df.0,df.na)
#re-sort for ordering given in OP (probably not really required)
df.out <- df.out[order(as.numeric(rownames(df.out))),]
This gives exactly the output desired, although I suspect that maintaining the position of the NAs in the data may not be necessary for your application.
> df.out
group value order
1 1 12.1 3
2 1 10.3 1
3 1 NA NA
4 1 11.0 2
5 1 13.5 4
6 2 11.7 3
7 2 NA NA
8 2 10.4 2
9 2 9.7 1
I have a data frame that looks like this:
site date var dil
1 A 7.4 2
2 A 6.5 2
1 A 7.3 3
2 A 7.3 3
1 B 7.1 1
2 B 7.7 2
1 B 7.7 3
2 B 7.4 3
I need add a column called wt to this dataframe that contains the weighting factor needed to calculate the weighted mean. This weighting factor has to be derived for each combination of site and date.
The approach I'm using is to first built a function that calculate the weigthing factor:
> weight <- function(dil){
dil/sum(dil)
}
then apply the function for each combination of site and date
> df$wt <- ddply(df,.(date,site),.fun=weight)
but I get this error message:
Error in FUN(X[[1L]], ...) :
only defined on a data frame with all numeric variables
You are almost there. Modify your code to use the transform function. This allows you to add columns to the data.frame inside ddply:
weight <- function(x) x/sum(x)
ddply(df, .(date,site), transform, weight=weight(dil))
site date var dil weight
1 1 A 7.4 2 0.40
2 1 A 7.3 3 0.60
3 2 A 6.5 2 0.40
4 2 A 7.3 3 0.60
5 1 B 7.1 1 0.25
6 1 B 7.7 3 0.75
7 2 B 7.7 2 0.40
8 2 B 7.4 3 0.60