Restructuring correlation data in R - r

Let's say I have a time series and in each iteration, I take a fixed portion of it and calculate the correlation matrix. Also, assume three elements only, which are denoted with their names in the correlation matrix. I want to give them sequential numbers, meaning the first element is 1, second is 2 and so forth. Then I want to have a data frame in a way that expands these matrices. For example:
The first element is the element "from", the second one is "to", the third one is the correlation value and the fourth one is the time. I can give the times as input and repeat it twice many times as the elements. I realize that I will have duplicates for each correlation value, with a difference in "to" and "from" elements and that is what I am looking for. How can I construct this? Here is my data, where g.list is a list of correlation matrices:
> dput(g.list)
list(structure(c(1, 0.352209944821856, 0.802051885793422, 0.352209944821857,
1, 0.827370298950111, 0.802051885793422, 0.827370298950111, 1
), .Dim = c(3L, 3L), .Dimnames = list(c("jpm", "gs", "ms"), c("jpm",
"gs", "ms"))), structure(c(1, 0.670163753398499, 0.753168359152204,
0.6701637533985, 1, 0, 0.753168359152202, 0, 1), .Dim = c(3L,
3L), .Dimnames = list(c("jpm", "gs", "ms"), c("jpm", "gs", "ms"
))), structure(c(1, 0.681190013681026, 0.153608963486821, 0.681190013681026,
1, 0.82058156983829, 0.153608963486822, 0.82058156983829, 1), .Dim = c(3L,
3L), .Dimnames = list(c("jpm", "gs", "ms"), c("jpm", "gs", "ms"
))))

Are you looking for this ?
result <- do.call(rbind, Map(function(x, y)
cbind(which(x < 1, arr.ind = TRUE), value = x[x != 1], year = y),
g.list, 2018:2020))
result
# row col value year
#gs 2 1 0.352 2018
#ms 3 1 0.802 2018
#jpm 1 2 0.352 2018
#ms 3 2 0.827 2018
#jpm 1 3 0.802 2018
#gs 2 3 0.827 2018
#gs 2 1 0.352 2019
#ms 3 1 0.802 2019
#jpm 1 2 0.352 2019
#ms 3 2 0.827 2019
#jpm 1 3 0.802 2019
#gs 2 3 0.827 2019
#gs 2 1 0.352 2020
#ms 3 1 0.802 2020
#jpm 1 2 0.352 2020
#ms 3 2 0.827 2020
#jpm 1 3 0.802 2020
#gs 2 3 0.827 2020
To get only upper/lower triangle values to avoid duplicates you may use -
do.call(rbind, Map(function(x, y) {
x[upper.tri(x)] <- 1
cbind(which(x < 1, arr.ind = TRUE), value = x[x != 1], year = y)
}, g.list, 2018:2020))

Related

Concat two different Data frames horizontally [duplicate]

I have two lists named h and g.
They each contain 244 dataframes and they look like the following:
h[[1]]
year avg hr sal
1 2010 0.300 31 2000
2 2011 0.290 30 4000
3 2012 0.275 14 600
4 2013 0.280 24 800
5 2014 0.295 18 1000
6 2015 0.330 26 7000
7 2016 0.315 40 9000
g[[1]]
year pos fld
1 2010 A 0.990
2 2011 B 0.995
3 2013 C 0.970
4 2014 B 0.980
5 2015 D 0.990
I want to cbind these two dataframes.
But as you see, they have different number of rows.
I want to combine these dataframes so that the rows with the same year will be combined in one row. And I want the empty spaces to be filled with NA.
The result I expect looks like this:
year avg hr sal pos fld
1 2010 0.300 31 2000 A 0.990
2 2011 0.290 30 4000 B 0.995
3 2012 0.275 14 600 NA NA
4 2013 0.280 24 800 C 0.970
5 2014 0.295 18 1000 B 0.980
6 2015 0.330 26 7000 D 0.990
7 2016 0.315 40 9000 NA NA
Also, I want to repeat this for all the 244 dataframes in each list, h and g.
I'd like to make a new list named final which contains the 244 combined dataframes.
How can I do this...?
All answers will be greatly appreciated :)
I think you should instead use merge:
merge(df1, df2, by="year", all = T)
For your data:
df1 = data.frame(matrix(0, 7, 4))
names(df1) = c("year", "avg", "hr", "sal")
df1$year = 2010:2016
df1$avg = c(.3, .29, .275, .280, .295, .33, .315)
df1$hr = c(31, 30, 14, 24, 18, 26, 40)
df1$sal = c(2000, 4000, 600, 800, 1000, 7000, 9000)
df2 = data.frame(matrix(0, 5, 3))
names(df2) = c("year", "pos", "fld")
df2$year = c(2010, 2011, 2013, 2014, 2015)
df2$pos = c('A', 'B', 'C', 'B', 'D')
df2$fld = c(.99,.995,.97,.98,.99)
cbind is meant to column-bind two dataframes that are in all sense compatible. But what you aim to do is actual merge, where you want the elements from the two data frames not be discarded, and for missing values you get NA instead.
We can use Map with cbind.fill (from rowr) to cbind the corresponding 'data.frame' from 'h' and 'g'.
library(rowr)
Map(cbind.fill, h, g, MoreArgs = list(fill=NA))
Update
Based on the expected output showed, it seems like the OP wanted a merge instead of cbind
f1 <- function(...) merge(..., all = TRUE, by = 'year')
Map(f1, h, g)
#[[1]]
# year avg hr sal pos fld
#1 2010 0.300 31 2000 A 0.990
#2 2011 0.290 30 4000 B 0.995
#3 2012 0.275 14 600 <NA> NA
#4 2013 0.280 24 800 C 0.970
#5 2014 0.295 18 1000 B 0.980
#6 2015 0.330 26 7000 D 0.990
#7 2016 0.315 40 9000 <NA> NA
Or as #Colonel Beauvel mentioned, this can be made compact
Map(merge, h, g, by='year', all=TRUE)
data
h <- list(structure(list(year = 2010:2016, avg = c(0.3, 0.29, 0.275,
0.28, 0.295, 0.33, 0.315), hr = c(31L, 30L, 14L, 24L, 18L, 26L,
40L), sal = c(2000L, 4000L, 600L, 800L, 1000L, 7000L, 9000L)), .Names = c("year",
"avg", "hr", "sal"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7")))
g <- list(structure(list(year = c(2010L, 2011L, 2013L, 2014L, 2015L
), pos = c("A", "B", "C", "B", "D"), fld = c(0.99, 0.995, 0.97,
0.98, 0.99)), .Names = c("year", "pos", "fld"), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5")))
Here is how you could do this with tidyverse tools:
library(tidyverse)
h <- list()
g <- list()
h[[1]] <- tribble(
~year, ~avg, ~hr, ~sal,
2010, 0.300, 31, 2000,
2011, 0.290, 30, 4000,
2012, 0.275, 14, 600,
2013, 0.280, 24, 800,
2014, 0.295, 18, 1000,
2015, 0.330, 26, 7000,
2016, 0.315, 40, 9000
)
g[[1]] <- tribble(
~year, ~pos, ~fld,
2010, "A", 0.990,
2011, "B", 0.995,
2013, "C", 0.970,
2014, "B", 0.980,
2015, "D", 0.990
)
map2(h, g, left_join)
Which produces:
[[1]]
# A tibble: 7 x 6
year avg hr sal pos fld
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 2010 0.3 31 2000 A 0.99
2 2011 0.290 30 4000 B 0.995
3 2012 0.275 14 600 NA NA
4 2013 0.28 24 800 C 0.97
5 2014 0.295 18 1000 B 0.98
6 2015 0.33 26 7000 D 0.99
7 2016 0.315 40 9000 NA NA

Determining the percentage of values in each column for each cluster

I need to determine the percentage of values in each column for each cluster with condition. Reproducible example is below. I have a table like this:
> tab
GI RT TR VR Cluster_number
1 1000086986 0.5814 0.5814 0.628 1
10 1000728257 0.5814 0.5814 0.628 1
13 1000074769 0.7879 0.7879 0.443 2
14 1000498642 0.7879 0.7879 0.443 2
22 1000074765 0.7941 0.3600 0.533 3
26 1000597385 0.7941 0.3600 0.533 3
31 1000502373 0.5000 0.5000 0.607 4
32 1000532631 0.6875 0.7059 0.607 4
33 1000597694 0.5000 0.5000 0.607 4
34 1000598724 0.5000 0.5000 0.607 4
And i need table like this:
> tab1
Cluster_number RT_cond TR_cond VR_cond
1 1 0 0 100
2 2 100 100 0
3 3 100 0 0
4 4 25 25 100
Where the values in the corresponding column indicate the percentage of GI in the corresponding cluster, where RT >= 0.6, TR >= 0.6 and VR >= 0.6, respectively. I.e., in the first cluster, all RT <= 0.6, therefore, in the final table, the value 0 is written in the first row, and, for example, in the fourth cluster, one of the four values TR >= 0.6, so the corresponding value in the final table is 25. How can i do this?
You can group_by Cluster_number and use across to calculate percentage :
library(dplyr)
df %>%
group_by(Cluster_number) %>%
summarise(across(RT:VR, ~mean(. >= 0.6) * 100, .names = '{col}_cond'))
#In older version of dplyr use summarise_at
#summarise_at(vars(RT:VR), ~mean(. >= 0.6) * 100)
# Cluster_number RT_cond TR_cond VR_cond
# <int> <dbl> <dbl> <dbl>
#1 1 0 0 100
#2 2 100 100 0
#3 3 100 0 0
#4 4 25 25 100
In base R, we can use aggregate :
aggregate(cbind(RT, TR, VR)~Cluster_number, df, function(x) mean(x >= 0.6) * 100)
data
df <- structure(list(GI = c(1000086986L, 1000728257L, 1000074769L,
1000498642L, 1000074765L, 1000597385L, 1000502373L, 1000532631L,
1000597694L, 1000598724L), RT = c(0.5814, 0.5814, 0.7879, 0.7879,
0.7941, 0.7941, 0.5, 0.6875, 0.5, 0.5), TR = c(0.5814, 0.5814,
0.7879, 0.7879, 0.36, 0.36, 0.5, 0.7059, 0.5, 0.5), VR = c(0.628,
0.628, 0.443, 0.443, 0.533, 0.533, 0.607, 0.607, 0.607, 0.607
), Cluster_number = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L)),
class = "data.frame", row.names = c("1", "10", "13", "14", "22",
"26", "31", "32", "33", "34"))
With the dplyr package you can use a group_by statement followed by summarise, and then rename the columns of interest with the new rename_with function
library(dplyr)
tab %>%
group_by(Cluster_number) %>%
summarise(across(c(RT, TR, VR), ~mean(. >= 0.6)*100)) %>%
rename_with(~paste0(., "_cond"), c(RT, TR, VR))
# A tibble: 4 x 4
# Cluster_number RT_cond TR_cond VR_cond
# <int> <dbl> <dbl> <dbl>
# 1 1 0 0 100
# 2 2 100 100 0
# 3 3 100 0 0
# 4 4 25 25 100

R Aggregate over multiple columns

i´m currently working with a large dataframe of 75 columns and round about 9500 rows. This dataframe contains observations for every day from 1995-2019 for several observation points.
Edit: The print from dput(head(df))
> dput(head(df))
structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
The dataframe looks like this sample from it:
date x1 x2 x3 x4 x5 xn year month day
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995-01-01 50.8 62.2 90.2 60 NA 53.2 1995 1 1
2 1999-08-02 62.6 58.7 NA 72 NA 61.1 1999 8 2
3 2001-09-03 57.2 49.9 70.1 68.4 NA 56.6 2001 9 3
4 2008-05-04 56.6 56.4 75.8 65.5 NA 58.6 2008 5 4
5 2012-04-05 36.8 43.2 83.3 63.2 NA 36.2 2012 4 5
6 2019-12-31 39.1 41.6 98.5 55.9 NA 44.4 2019 12 31
str(df)
tibble [9,131 x 75] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:9131], format: "1995-01-01" "1995-01-02" ...
$ x1 : num [1:9131] 50.8 62.6 57.2 56.6 36.8 ...
$ x2 : num [1:9131] 62.2 58.7 49.9 56.4 43.2 ...
xn
$ year : num [1:9131] 1995 1995 1995 1995 1995 ...
$ month : num [1:9131] 1 1 1 1 1 1 1 1 1 1 ...
$ day : num [1:9131] 1 2 3 4 5 6 7 8 9 10 ...
My goal is to get for every observation point xn the count of all observations which cross a certain limit per year.
So far i tried to reach this with the Aggregate function.
To get the mean of every year i used the following command:
aggregate(list(df), by=list(year=df$year), mean, na.rm=TRUE)
this works perfect, i get the mean for every year for every observation point.
To get the sum of one station i used the following code
aggregate(list(x1=df$x1), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
which results in this print:
year x1
1 1995 52
2 1996 43
3 1997 44
4 1998 42
5 1999 38
6 2000 76
7 2001 52
8 2002 58
9 2003 110
10 2004 34
11 2005 64
12 2006 46
13 2007 46
14 2008 17
15 2009 41
16 2010 30
17 2011 40
18 2012 47
19 2013 40
20 2014 21
21 2015 56
22 2016 27
23 2017 45
24 2018 22
25 2019 45
So far, so good. I know i could expand the code by adding (..,x2=data$x2, x3=data$x3,..xn) to the list argument in code above. which i tried and they work.
But how do I get them all at once?
I tried the following codes:
aggregate(.~(date, year, month, day), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler: Unerwartete(s) ',' in "aggregate(.~(date,"
aggregate(.~date+year+month+day, by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ‘"function"’ to a data.frame
aggregate(. ~ date + year + month + day, data = df,by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) :
Argumente müssen dieselbe Länge haben
But unfortunately none of them works. Could someone please give me a hint where my mistake is?
Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.
data <- structure(
list(
date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"),
x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125),
x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625),
x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375),
year = c(1995, 1995, 1995, 1995,
1995, 1995),
month = c(1, 1, 1, 1, 1, 1),
day = c(1, 2, 3,
4, 5, 6)
),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl",
"data.frame"
))
First, we create a subset of the data that contains all columns containing x, and set them to TRUE or FALSE based on whether the value is greater than 70.
theCols <- data[,colnames(data)[grepl("x",colnames(data))]]
Second, we cbind() the year onto the matrix of logical values.
x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))
Finally, we use aggregate across all columns other than year and sum the columns.
aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)
...and the output:
Group.1 x1 x2 x3 x4 x5 xn
1 1995 0 0 5 1 0 0
>
Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.
Two tidyverse solutions
A tidyverse solution to the same problem is as follows. It includes the following steps.
Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.
Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.
Use pivot_wider() to convert the data back to one column for each x measurement.
...and the code is:
devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
select(-year,-month,-day) %>% group_by(date) %>%
pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>%
mutate(year = year(date)) %>% group_by(year,measure) %>%
select(-date) %>%
summarise(value = sum(value,na.rm=TRUE)) %>%
pivot_wider(id_cols = year,names_from = "measure",
values_from = value)
...and the output, which matches the Base R solution that I originally posted:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: year [1]
year x1 x2 x3 x4 x5 xn
<dbl> <int> <int> <int> <int> <int> <int>
1 1995 0 0 5 1 0 0
>
...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year,name) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)
...and the output:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups: year [1]
year x1 x2 x3 x4 x5
<dbl> <int> <int> <int> <int> <int>
1 1995 0 0 0 0 0
>
Conclusions
As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.
That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.
This should solve your problem
library(tidyverse)
library(lubridate)
df_example <- structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year(date)) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE))

Mutate top n rows without throwing away the other rows

I have the following data.frame below. I would like to create a new column w (for weight). w should equal 1 / n for the industries that have the n highest returns for each given date and should equal 0 for the rest of the industries. I can group_by(date) and use top_n(3, wt = return) to filter the top industries and then mutate(w = 1/n), but how can I mutate without throwing away the other industries where w = 0?
structure(list(date = structure(c(16556, 16556, 16556, 16556,
16556, 16556, 16556, 16556, 16556, 16556, 16587, 16587, 16587,
16587, 16587, 16587, 16587, 16587, 16587, 16587, 16617, 16617,
16617, 16617, 16617, 16617, 16617, 16617, 16617, 16617), class = "Date"),
industry = c("Hlth", "Txtls", "BusEq", "Fin", "ElcEq", "Food",
"Beer", "Books", "Cnstr", "Carry", "Clths", "Txtls", "Fin",
"Games", "Cnstr", "Meals", "Hlth", "Hshld", "Telcm", "Rtail",
"Smoke", "Games", "Clths", "Rtail", "Servs", "Meals", "Food",
"Hlth", "Beer", "Trans"), return = c(4.89, 4.37, 4.02, 2.99,
2.91, 2.03, 2, 1.95, 1.86, 1.75, 4.17, 4.09, 1.33, 1.26,
0.42, 0.29, 0.08, -0.11, -0.45, -0.48, 9.59, 6, 5.97, 5.78,
5.3, 4.15, 4.04, 3.67, 3.51, 3.27)), row.names = c(NA, -30L
), class = c("tbl_df", "tbl", "data.frame"))
# A tibble: 30 x 3
date industry return
<date> <chr> <dbl>
1 2015-05-01 Hlth 4.89
2 2015-05-01 Txtls 4.37
3 2015-05-01 BusEq 4.02
4 2015-05-01 Fin 2.99
5 2015-05-01 ElcEq 2.91
6 2015-05-01 Food 2.03
7 2015-05-01 Beer 2
8 2015-05-01 Books 1.95
9 2015-05-01 Cnstr 1.86
10 2015-05-01 Carry 1.75
# ... with 20 more rows
EDIT: How would you handle ties? Suppose there is a tie for third place. The third place weight should be split between 3rd and 4th place (assuming only 2 are tied) with weights of (1/n)/2. The 1st and 2nd place weights stay at 1/n.
EDIT: Suppose n = 3. The top 3 A2 values for each A1 should get a weight w of 1/3 if there are no ties. If there is a tie for 3rd place (T3), then we have (1st, 2nd, T3, T3) and I would like weights to be 1/3, 1/3, 1/6, 1/6 to maintain a total weight of 1. This is only for 3rd place however. (1st, T2, T2) should have weights of 1/3, 1/3, 1/3. (T1, T1, T2, T2) should have weights of 1/3, 1/3, 1/6, 1/6, etc.
structure(list(A1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), A2 = c(1, 3, 3,
4, 5, 6, 7, 8, 8)), row.names = c(NA, -9L), class = "data.frame")
The output for df should be:
> df
A1 A2 w
1 A 1 0
2 A 3 0.1666
3 A 3 0.1666
4 A 4 0.3333
5 A 5 0.3333
6 B 6 0
7 B 7 0.3333
8 B 8 0.3333
9 B 8 0.3333
We could create a condition with ifelse. After grouping by 'date', arrange the dataset based on the 'date', and 'return' in descending order, then create the 'w' by creating the condition that if the row_number() is less than 'n', then divide 'return' by 'n' or else return 0
n <- 3
df1 %>%
group_by(date) %>%
arrange(date, -return) %>%
mutate(w = ifelse(row_number() <= n, return/n, 0))
If we are using top_n, then create the column 'w' in the filtered dataset and join with the original
df1 %>%
group_by(date) %>%
top_n(return, n = 3) %>%
mutate(w = return/n()) %>%
right_join(df1) %>%
mutate(w = replace_na(w, 0))
We can group by date then sort the return variable get the last 3 enteries (top 3) and return return/n or else 0.
library(dplyr)
n <- 3
df %>%
group_by(date) %>%
mutate(w = ifelse(return %in% tail(sort(return), n), return/n, 0))
# date industry return w
# <date> <chr> <dbl> <dbl>
# 1 2015-05-01 Hlth 4.89 1.63
# 2 2015-05-01 Txtls 4.37 1.46
# 3 2015-05-01 BusEq 4.02 1.34
# 4 2015-05-01 Fin 2.99 0
# 5 2015-05-01 ElcEq 2.91 0
# 6 2015-05-01 Food 2.03 0
# 7 2015-05-01 Beer 2 0
#....
The base R equivalent of the same logic using ave
ave(df$return, df$date, FUN = function(x) ifelse(x %in% tail(sort(x), n), x/n, 0))
EDIT
As mentioned in comments, in case of ties OP wants to return (1/n)/2 or divide by number of ties we have.
For this I have created a new easier dataframe which makes it easy to understand what is going on.
df <- data.frame(A1 = rep(c("A", "B"),c(5, 4)), A2 = 1:9)
df$A2[2] <- 3
If we use the current code it gives
df %>%
group_by(A1) %>%
mutate(w = ifelse(A2 %in% tail(sort(A2), n), A2/n, 0))
# A tibble: 9 x 3
# Groups: A1 [2]
# A1 A2 w
# <fct> <int> <dbl>
#1 A 1 0
#2 A 3 1
#3 A 3 1
#4 A 4 1.33
#5 A 5 1.67
#6 B 6 0
#7 B 7 2.33
#8 B 8 2.67
#9 B 9 3
which is not what we want. To avoid that, we can group by A2 again and for only those rows where w!=0 we divide it by number of occurrences of A2.
df %>%
group_by(A1) %>%
mutate(w = ifelse(A2 %in% tail(sort(A2), n), A2/n, 0)) %>%
group_by(A2) %>%
mutate(w1 = ifelse(w != 0, w/n(), w)) %>%
ungroup()
# A1 A2 w w1
# <fct> <dbl> <dbl> <dbl>
#1 A 1 0 0
#2 A 3 1 0.5
#3 A 3 1 0.5
#4 A 4 1.33 1.33
#5 A 5 1.67 1.67
#6 B 6 0 0
#7 B 7 2.33 2.33
#8 B 8 2.67 2.67
#9 B 9 3 3
Another EDIT
Turns out we just want to divide w only for the last group present. Moreover, the sum of all the w in each group should sum up to 1. For the updated dataset we can do
n <- 3
temp_df <- df %>%
group_by(A1) %>%
top_n(n, A2)
temp_df %>%
arrange(A1, A2) %>%
mutate(w = ifelse(A2 == A2[1],
(1 - (1/n * sum(A2 != A2[1])))/sum(A2 == A2[1]), 1/n)) %>%
bind_rows(anti_join(df, temp_df) %>%
mutate(w = 0)
) %>%
arrange(A1, A2)
# A1 A2 w
# <fct> <dbl> <dbl>
#1 A 1 0
#2 A 3 0.167
#3 A 3 0.167
#4 A 4 0.333
#5 A 5 0.333
#6 B 6 0
#7 B 7 0.333
#8 B 8 0.333
#9 B 8 0.333
Let's try another variation where we keep all the values of the group same.
df1 = df
df1$A2[6:9] <- 10
temp_df <- df1 %>%
group_by(A1) %>%
top_n(n, A2)
temp_df %>%
arrange(A1, A2) %>%
mutate(w = ifelse(A2 == A2[1],
(1 - (1/n * sum(A2 != A2[1])))/sum(A2 == A2[1]), 1/n)) %>%
bind_rows(anti_join(df1, temp_df) %>%
mutate(w = 0)
) %>%
arrange(A1, A2)
# A1 A2 w
# <fct> <dbl> <dbl>
#1 A 1 0
#2 A 3 0.167
#3 A 3 0.167
#4 A 4 0.333
#5 A 5 0.333
#6 B 10 0.25
#7 B 10 0.25
#8 B 10 0.25
#9 B 10 0.25
The logic is we select the top 3 A2 values along with their groups using top_n. Using anti_join we get all the rows which are not in top 3 and assign a fixed weight w to them as 0. For the rows which are included in top 3 we get the last group rows and assign them the weight which is remaining after assigning the weights to non-last groups.

R - Adding numbers within a data frame cell together

I have a data frame in which the values are stored as characters. However, many values contain two numbers that need to be added together. Example:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 3+6 2+10 8 13+2
Product 2 6 4+0 <NA> 5
Product 3 <NA> 5+9 3+1 11
Is there a way to go through the whole data frame and replace all cells containing characters like "3+6" with new values equal to their sum? I assume this would involve coercing the characters to numeric or integers, but I don't know how that would be possible for values with the + sign in them. I would like the example data frame to end up looking like this:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 9 12 8 15
Product 2 6 4 <NA> 5
Product 3 <NA> 14 4 11
Here's an easier example:
dat <- data.frame(a=c("3+6", "10"), b=c("12", NA), c=c("3+4", "5+6"))
dat
## a b c
## 1 3+6 12 3+4
## 2 10 <NA> 5+6
apply(dat, 1:2, function(x) eval(parse(text=x)))
## a b c
## [1,] 9 12 7
## [2,] 10 NA 11
Using R itself to do the computation with eval and parse does the trick.
Here is one option with gsubfn without using eval(parse. We convert the 'data.frame' to 'matrix' (as.matrix(dat)). We match the numbers ([0-9]+), capture it as a group using parentheses ((..)) followed by +, followed by second set of numbers, and replace it by converting to numeric class and then do the +. The output can be assigned back to the original dataset to get the same structure as in 'dat'.
library(gsubfn)
dat[] <- as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.matrix(dat)))
dat
# 2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
#Product 1 9 12 8 15
#Product 2 6 4 NA 5
#Product 3 NA 14 4 11
Or we can loop the columns with lapply and perform the replacement with gsubfn for each of the columns.
dat[] <- lapply(dat, function(x) as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.character(x))))
data
dat <- structure(list(`2014 Q1 Sales` = structure(c(1L, 2L, NA), .Label = c("3+6",
"6"), class = "factor"), `2014 Q2 Sales` = structure(1:3, .Label = c("2+10",
"4+0", "5+9"), class = "factor"), `2014 Q3 Sales` = structure(c(2L,
NA, 1L), .Label = c("3+1", "8"), class = "factor"), `2014 Q4 Sales` = structure(c(2L,
3L, 1L), .Label = c("11", "13+2", "5"), class = "factor")), .Names = c("2014 Q1 Sales",
"2014 Q2 Sales", "2014 Q3 Sales", "2014 Q4 Sales"), class = "data.frame", row.names = c("Product 1",
"Product 2", "Product 3"))

Resources