Say I have a data frame of numeric values and a second dataframe of numeric values that are weights that is built like this:
Monday <- c(1, 1, 10)
Tuesday <- c(1, 2, 3)
df <- data.frame(Monday, Tuesday)
Monday <- c(10, 10, 1)
Tuesday <- c(1, 1, 1)
df_weights <- data.frame(Monday, Tuesday)
How can I summarize each column of the first data frame using weighted mean with the corresponding column in the second data frame as a source of the values for the weights?
In addition, I would like both the mean and the weighted mean in a single dataframe, how could I summarize_all with two functions like so?
Is it something like that?
library(dplyr)
library(Hmisc)
bind_cols(df,rename_all(df_weights,function(x) paste0(x,".wt"))) %>%
summarise(Monday=wtd.mean(Monday,w=Monday.wt),
Tuesday=wtd.mean(Tuesday,w=Tuesday.wt))
## Monday Tuesday
##1 1.428571 2
Or possibly something more general without dplyr :
Map(function(x) wtd.mean(df[[x]],w=df_weights[[x]]),colnames(df))
## $Monday
## [1] 1.428571
##
## $Tuesday
## [1] 2
Getting the mean and the weigthed mean together is a little more tricky but purrr can help to generalize the previous answer. I don't know if the structure of the result match your need :
purrr::map_dfr(colnames(df),
function(x) list(column=x,
mean=mean(df[[x]]),
wmean=wtd.mean(df[[x]],w=df_weights[[x]])))
### A tibble: 2 x 3
## column mean wmean
## <chr> <dbl> <dbl>
##1 Monday 4 1.43
##2 Tuesday 2 2
Related
I have a dataframe (df1) and have calculated the deciles for each row using the following:
#create a function to calculate the deciles
decilefun <- function(x) as.integer(cut(x, unique(quantile(x, probs=0:10/10)), include.lowest=TRUE))
# convert df1 to matrix
mat1 <- as.matrix(df1)
#apply the function I created above to calculate deciles
df1_deciles <- apply(mat1, 1, decilefun)
#add the rownames back in
rownames(df1_deciles) <- row.names(df1)
#convert to dataframe
df1_deciles <- as.data.frame(df1_deciles)
str(df1_deciles) # to show what the data looks like
#'data.frame': 157 obs. of 3321 variables:
# $ Variable1 : int 10 10 4 4 5 8 8 8 6 3 ...
# $ Variable2 : int 8 3 9 7 2 8 9 5 8 2 ...
# $ Variable3 : int 8 4 7 7 2 9 10 3 8 3 ...
I have another dataframe (df2) with the same rownames (Variable1, Variable2,etc...) but different number of columns.
I would like to use the same decile cuts which were used for df1 on this second dataframe but I'm not sure how to do it. I am actually not even sure how to determine/export what the cuts where on the original data which resulted on the df1_deciles dataframe I created. What I mean by this is, how do I export an object which tells me what range of values for Variable1 on df1 were assigned to a decile value = 1 or a decile value = 2, and so on.
I do not want to use the 'decilefun' function I created on df2, but instead want to use the variability and range information from df1.
This is my first question on the platform so I hope it is clear and I hope I have provided enough information. I have tried to find answers on the platform but have not found one. I appreciate any help on this.
Using data.table:
##
# create an artificial dataset with the structure you describe
#
set.seed(1)
df1 <- data.frame(Variable.1=rnorm(1000), variable.2=runif(1000), variable.3=rgamma(1000, scale=10, shape=5))
df1 <- t(df1)
##
#
df2 <- data.frame(Variable.1=rnorm(1000, -1), variable.2=runif(1000), variable.3=rgamma(1000, scale=20, shape=5))
df2 <- t(df2)
##
# you start here
# assumes df1 and df2 have structure described in problem
# data in rows, not columns
#
library(data.table)
df1 <- as.data.table(t(df1)) # transpose: put data in columns
brks <- lapply(df1, quantile, probs=(0:10)/10, labels=FALSE) # list of deciles for each row in df1
df2 <- as.data.table(df2, keep.rownames = TRUE) # keep df2 data in rows: 1000 columns here
result <- df2[ # this does all the work
, .(value= unlist(.SD),
decile=cut(unlist(.SD), breaks=c(-Inf, brks[[rn]], +Inf), labels=c('below', names(brks[[rn]])[2:11], 'above'))
)
, by=.(rn)]
result[, .N, keyby=.(rn, decile)] # validate that result is reasonable
Applying deciles from one dataset to another has the nuance the some values in the new dataset might be outside the range of the original data. The test data here demonstrates this problem. Variable.1 in df2 has values lower than any in df1, and variable.3 in df2 has values larger than any in df1.
I am extremely new to R and programming, so I don't even know how to describe my question very clearly, excuse me for using an example to further explain what I mean:
Say I have a data frame with 2 columns, first one being 10 different countries, second column is the rate of happiness (0-10). And country column could have lots of repeated ones, e.g.:
Column titles: Country Happiness
1st Column content: A,C,A,B,B,B,C,A,D,D....
2nd Column content: 10,9,3,4,4,5,6,9,10,6...
What I want to achieve is: get mean/median/mode for country A B C D respectively. So far using describe() function I can only get the MMM for all the numbers, rather than by country.
I wonder if there is a function to achieve this directly, or should I create subsets of each country first? How should I do it?
Many thanks.
You can do this best with dplyr but first you will have to write a function for the mode:
getmode <- function(v) {
uniqv <- unique(v[!is.na(v)])
uniqv[which.max(table(match(v, uniqv)))]
}
Now you can group_bythe grouping variable Country and use summarise to calculate the statistics:
library(dplyr)
df %>%
group_by(Country) %>%
summarise(Mean = mean(Happiness),
Median = median(Happiness),
Mode = getmode(Happiness))
Result:
# A tibble: 4 x 4
Country Mean Median Mode
* <chr> <dbl> <dbl> <int>
1 A 2.5 2.5 2
2 B 2 2 2
3 C 3 3 3
4 D 3.5 3.5 5
Data:
set.seed(12)
df <- data.frame(
Country = sample(LETTERS[1:4], 10, replace = T),
Happiness = sample(1:5, 10, replace = T)
)
This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 4 years ago.
I am new to R so any help is greatly appreciated!
I have a data frame of 278800 observations for each of my 10 variables, I am trying to create an 11th variable that sums every 200 observations (or rows) of a specific variable/column (sum(1:200, 201:399, 400:599 etc.) Similar to the offset function in excel.
I have tried subsetting my data to just the variable of interest with the aim of adding a new variable that continuously sums every 200 rows however I cannot figure it out. I understand my new "variable" will produce 1,394 data points (278,800/200). I have tried to use the rollapply function, however the output does not sum in blocks of 200, it sums 1:200, 2:201, 3:202 etc.)
Thanks,
E
rollapply has a by= argument for that. Here is a smaller example using n = 3 instead of n = 200. Note that 1+2+3=6, 4+5+6=15, 7+8+9=24 and 10+11+12=33.
# test data
DF <- data.frame(x = 1:12)
library(zoo)
n <- 3
rollapply(DF$x, n, sum, by = n)
## [1] 6 15 24 33
First let's generate some data and get a label for each group:
library(tidyverse)
df <-
rnorm(1000) %>%
as_tibble() %>%
mutate(grp = floor(1 + (row_number() - 1) / 200))
> df
# A tibble: 1,000 x 2
value grp
<dbl> <dbl>
1 -1.06 1
2 0.668 1
3 -2.02 1
4 1.21 1
...
1000 0.78 5
This creates 1000 random N(0,1) variables, turns it into a data frame, and then adds an incrementing numeric label for each group of 200.
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value))
# A tibble: 5 x 2
grp grp_sum
<dbl> <dbl>
1 1 9.63
2 2 -12.8
3 3 -18.8
4 4 -8.93
5 5 -25.9
Then we just need to do a group-by operation on the second column and sum the values. You can use the pull() operation to get a vector of the results:
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value)) %>%
pull(grp_sum)
[1] 9.62529 -12.75193 -18.81967 -8.93466 -25.90523
I created a vector with 278800 observations (a)
a<- rnorm(278800)
b<-NULL #initializing the column of interest
j<-1
for (i in seq(1,length(a),by=200)){
b[j]<-sum(a[i:i+199]) #b is your column of interest
j<-j+1
}
View(b)
I am trying to simplify a large dataset (52k+ rows) by finding the maximum value for every two week interval. I have already assigned week number values to every row and used the aggregate() function to find the maximum value for each week.
Simplified sample data:
week <- c(1:5, 5, 7:10)
conc <- rnorm(mean=50, sd=20, n=10)
df <- data.frame(week,conc)
aggregate(df, by=list(week), FUN=max)
However, I am stuck on how to further aggregate based on two-week intervals (ex: weeks 1&2, weeks 3&4...). It's not as simple as combining every other row since every week was sampled.
I'm assuming there's a simple solution, I just haven't found it yet.
Thanks!
week <- c(1:5, 5, 7:10)
bi_week <- (week+1)%/%2
conc <- rnorm(mean=50, sd=20, n=10)
df <- data.frame(week,bi_week,conc)
aggregate(df, by=list(bi_week), FUN=max)
Use pracma::ceil to grab each bi-weekly pair
library(pracma)
aggregate(df, by=list(ceil(df$week/2)), FUN=max)
Output
Group.1 week conc
1 1 2 76.09191
2 2 4 50.20154
3 3 5 54.93041
4 4 8 69.17820
5 5 10 74.67518
ceil(df$week/2)
# 1 1 2 2 3 3 4 4 5 5
library(purrr)
library(dplyr)
Odds<-seq(1:max(week),2)
Evens<-seq(2,max(week),2)
map2(.x=Odds,.y=Evens, .f=function(x,y) {df %>%
filter(week==x | week==y) %>% select(conc) %>% max})
I first made vectors of odds and even numbers. Then using the purrr package I fed these pairwise (1&2, then 3&4 etc) into a function that uses the dplyr package to get just the correct weeks, select the conc values and take the max.
Here is the output:
> map2(.x=Odds,.y=Evens, .f=function(x,y) {df %>% filter(week==x | week==y) %>% select(conc) %>% max})
[[1]]
[1] 68.38759
[[2]]
[1] 56.9231
[[3]]
[1] 77.23965
[[4]]
[1] 49.39443
[[5]]
[1] 49.38465
Note: you could use map2_dbl in place of map2and get a numeric vector instead
Edit: removed the part about df2 as that was an error.
Say I have a series of dates, and I want to break them into groups (let's call the groups "epochs"). My first idea of how to do this would be to create a variable that indicates which epoch a date belongs in. The following code shows what I want.
library(dplyr)
library(mosaic)
library(magrittr)
# Generate 1,000,000 random dates
set.seed(919)
df <- data.frame(dates = runif(1e6, -100, 100) + as.Date("2015-12-18"))
# Set two arbitrary dates as cutoffs
e1 <- as.Date("2015-10-01")
e2 <- as.Date("2015-12-20")
# Add a variable that indicates what the lowest cutoff date was
system.time(df %<>% mutate(epoch = derivedFactor(epoch.1 = dates < e1,
epoch.2 = dates < e2,
.method = "first",
.default = "epoch.3")))
# user system elapsed
# 341.86 0.16 344.70
But this is very slow -- about 5 minutes on my laptop. I imagine there is a faster way to do this. For example, my naive guess would be that you could sort the data by date, find the last row where dates < e1, and then mark all the preceding rows as a 1, etc. But maybe someone on here knows a better or more elegant way to do this?
I think you're overthinking this. I did it in base R, but presumably you could do this in dplyr too. Just order the data, and then set the factor levels you want in decreasing order.
Conceptually, you just set everything to the most recent epoch, 3. Then, you go through and find all the rows that are less than the epoch 2 cutoff, and then change those to 2. Then, repeat the same process with 1.
# Generate 1,000,000 random dates
set.seed(919)
test.data <- data.frame(row_id = 1:1000000,dates = runif(1e6, -100, 100) + as.Date("2015-12-18"))
# Set two arbitrary dates as cutoffs
e1 <- as.Date("2015-10-01")
e2 <- as.Date("2015-12-20")
test.data <- test.data[order(test.data$dates),]
test.data$epoch <- 3
test.data[test.data$dates < e2,"epoch"] <- 2
test.data[test.data$dates < e1,"epoch"] <- 1
table(test.data$epoch)
As Ben Bolker pointed out, you can use findInterval to do this:
df %<>% mutate(epoch = findInterval(df$dates, c(e1, e2)))
head(df, 10)
## dates epoch
## 1 2016-03-15 2
## 2 2016-01-02 2
## 3 2016-01-30 2
## 4 2015-10-03 1
## 5 2015-09-17 0
## 6 2016-02-11 2
## 7 2015-12-05 1
## 8 2015-12-15 1
## 9 2016-03-11 2
## 10 2015-10-21 1
On my machine, this takes much less than 0.1 second.