I have the following two data frames:
df1 <- data.frame(Category = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
Date = c(2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003),
Beta1 = c(1, 3, 4, 4, 5, 3, 5, 3, 1),
Beta2 = c(2, 4, 6, 1, 1, 2, 5, 4, 2))
df2 <- data.frame(Date = c(2001, 2002, 2003),
Column1 = c(10, 20, 30),
Column2 = c(40, 50, 60))
Say I assign category A to Column1 and and category C to Column2. I want to multiply the row value from Column1 with the row betas from category A, if the dates match. Similarly, I want to multiply the row value from Column2 with the row betas from category C, if the dates match.
The match between a category and a column is of my own choosing. Assigning this myself won’t be a problem I think because I have relatively few columns.
Preferably, I want the output to look like this:
results <- data.frame(Date = c(2001, 2002, 2003),
Column1_categoryA_beta1 = c(10, 60, 120),
Column1_categoryA_beta2 = c(20, 80, 180),
Column2_categoryC_beta1 = c(200, 150, 60),
Column2_categoryC_beta2 = c(200, 200, 120))
Any help in how I best can approach this problem is very much appreciated!
With some data wrangling using tidyr and dplyr this can be achieved like so:
df1 <- data.frame(Category = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
Date = c(2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003),
Beta1 = c(1, 3, 4, 4, 5, 3, 5, 3, 1),
Beta2 = c(2, 4, 6, 1, 1, 2, 5, 4, 2))
df2 <- data.frame(Date = c(2001, 2002, 2003),
Column1 = c(10, 20, 30),
Column2 = c(40, 50, 60))
library(dplyr)
library(tidyr)
df2_long <- df2 %>%
pivot_longer(-Date, names_to = "Column", values_to = "Value") %>%
mutate(Category = ifelse(Column == "Column1", "A", "C"))
df2_long %>%
left_join(df1) %>%
mutate(Beta1 = Value * Beta1,
Beta2 = Value * Beta2) %>%
select(Date, Category, Column, Beta1, Beta2) %>%
pivot_wider(id_cols = Date, names_from = c("Column", "Category"), values_from = c("Beta1", "Beta2"))
#> Joining, by = c("Date", "Category")
#> Warning: Column `Category` joining character vector and factor, coercing into
#> character vector
#> # A tibble: 3 x 5
#> Date Beta1_Column1_A Beta1_Column2_C Beta2_Column1_A Beta2_Column2_C
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001 10 200 20 200
#> 2 2002 60 150 80 200
#> 3 2003 120 60 180 120
Created on 2020-04-14 by the reprex package (v0.3.0)
One way to get there while keeping the Category variable in the final data frame is the following:
df3 <- left_join(df1, df2, by="Date")
df4 <- df3 %>%
group_by(Date, Category) %>%
mutate(Col1Bet1 = Column1 * Beta1, Col1Bet2 = Column1 * Beta2, Col2Bet1 = Column2 * Beta1, Col2Bet2 = Column2 * Beta2)
which gives the following:
# A tibble: 9 x 10
# Groups: Date, Category [9]
Category Date Beta1 Beta2 Column1 Column2 Col1Bet1 Col1Bet2 Col2Bet1 Col2Bet2
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2001 1 2 10 40 10 20 40 80
2 A 2002 3 4 20 50 60 80 150 200
3 A 2003 4 6 30 60 120 180 240 360
4 B 2001 4 1 10 40 40 10 160 40
5 B 2002 5 1 20 50 100 20 250 50
6 B 2003 3 2 30 60 90 60 180 120
7 C 2001 5 5 10 40 50 50 200 200
8 C 2002 3 4 20 50 60 80 150 200
9 C 2003 1 2 30 60 30 60 60 120
This could be a start. The result data.table has all information you want just in another format.
df3 <- merge(df1, df2)
df3$b1 <- ifelse(df3$Category=="A", df3$Beta1*df3$Column1, ifelse(df3$Category=="C", df3$Beta1*df3$Column2, NA))
df3$b2 <- ifelse(df3$Category=="A", df3$Beta2*df3$Column1, ifelse(df3$Category=="C", df3$Beta2*df3$Column2, NA))
# Date Category Beta1 Beta2 Column1 Column2 b1 b2
# 1 2001 A 1 2 10 40 10 20
# 2 2001 C 5 5 10 40 200 200
# 3 2001 B 4 1 10 40 NA NA
# 4 2002 A 3 4 20 50 60 80
# 5 2002 B 5 1 20 50 NA NA
# 6 2002 C 3 4 20 50 150 200
# 7 2003 B 3 2 30 60 NA NA
# 8 2003 A 4 6 30 60 120 180
# 9 2003 C 1 2 30 60 60 120
Related
library(data.table)
table1 <- data.table(id1 = c(1324, 2324, 29, 29, 1010, 1010),
type = c(1, 1, 2, 1, 1, 1),
class = c("A", "A", "B", "D", "D", "A"),
number = c(1, 98, 100, 100, 70, 70))
table2 <- data.table(id2 = c(1998, 1998, 2000, 2000, 2000, 2010, 2012, 2012),
type = c(1, 1, 3, 1, 1, 5, 1, 1),
class = c("D", "A", "D", "D", "A", "B", "A", "A"),
min_number = c(34, 0, 20, 45, 5, 23, 1, 1),
max_number = c(50, 100, 100, 100, 100, 9, 10, 100))
> table1
id1 type class number
1: 1324 1 A 1
2: 2324 1 A 98
3: 29 2 B 100
4: 29 1 D 100
5: 1010 1 D 70
6: 1010 1 A 70
> table2
id2 type class min_number max_number
1: 1998 1 D 34 50
2: 1998 1 A 0 100
3: 2000 3 D 20 100
4: 2000 1 D 45 100
5: 2000 1 A 5 100
6: 2010 5 B 23 9
7: 2012 1 A 1 10
8: 2012 1 A 1 100
I have two tables, and I would like to merge them based on type, class, and whether number lies between min_number and max_number. Then I would like to create a new variable nMatch that stores the number of unique id2s that match with each id1.
setindexv(table2, c("type", "class"))
for (t1_row in seq_len(nrow(table1))) {
print(t1_row)
set(
table1, t1_row, "matches",
table2[table1[t1_row], on = c("type", "class", "max_number >= number", "min_number <= number"), .(list(id2))]
)
}
> table1[, .(nMatch = uniqueN(unlist(matches), na.rm = TRUE)), by = .(id1)]
id1 nMatch
1: 1324 2
2: 2324 3
3: 29 1
4: 1010 3
The approach above is row-by-row as suggested here, but my real dataset has millions of rows. What's another way of doing this that's faster?
You can try data.table with on = .(...) to merge two data tables
table1[
table2,
.(id1, id2),
on = .(type, class, number >= min_number, number <= max_number),
nomatch = NULL
][
,
.(nMatch = uniqueN(id2)),
id1
]
and will get
id1 nMatch
1: 1324 2
2: 1010 3
3: 2324 3
4: 29 1
An option with tidyverse
library(dplyr)
library(tidyr)
left_join(table1, table2, by =
join_by(type, class, number >= min_number, number <= max_number)) %>%
distinct(id1, id2) %>%
drop_na %>%
count(id1, name = "nMatch")
-output
id1 nMatch
<num> <int>
1: 29 1
2: 1010 3
3: 1324 2
4: 2324 3
I am trying to subtract the value of one group from another. I am hoping to use tidyverse
structure(list(A = c(1, 1, 1, 2, 2, 2, 3, 3, 3), group = c("a",
"b", "c", "a", "b", "c", "a", "b", "c"), value = c(10, 11, 12,
11, 40, 23, 71, 72, 91)), class = "data.frame", row.names = c(NA,
-9L))
That is my data, and I want to subtract all values of group A from B and C, and store the difference in one variable.
baseR solution
df$new <- df$value - ave(df$value, df$A, FUN = function(x) mean(x[df$group == 'a'], na.rm = T) )
> df
A group value new
1 1 a 10 0
2 1 b 11 1
3 1 c 12 2
4 2 a 11 0
5 2 b 40 29
6 2 c 23 12
7 3 a 71 0
8 3 b 72 1
9 3 c 91 20
dplyr method (assumption there is not more than one a value per group, else R will confuse which value to substract and result in error)
df %>% group_by(A) %>% mutate(new = ifelse(group != 'a', value - value[group == 'a'], value) )
# A tibble: 9 x 4
# Groups: A [3]
A group value new
<dbl> <chr> <dbl> <dbl>
1 1 a 10 10
2 1 b 11 1
3 1 c 12 2
4 2 a 11 11
5 2 b 40 29
6 2 c 23 12
7 3 a 71 71
8 3 b 72 1
9 3 c 91 20
or if you want to change all values
df %>% group_by(A) %>% mutate(new = value - value[group == 'a'] )
# A tibble: 9 x 4
# Groups: A [3]
A group value new
<dbl> <chr> <dbl> <dbl>
1 1 a 10 0
2 1 b 11 1
3 1 c 12 2
4 2 a 11 0
5 2 b 40 29
6 2 c 23 12
7 3 a 71 0
8 3 b 72 1
9 3 c 91 20
I only used data.table rather than data.frame because I'm more familiar.
library(data.table)
data <- setDT(structure(list(A = c(1, 1, 1, 2, 2, 2, 3, 3, 3), group = c("a",
"b", "c", "a", "b", "c", "a", "b", "c"), value = c(10, 11, 12,
11, 40, 23, 71, 72, 91)), class = "data.frame", row.names = c(NA,-9L)))
for (i in 1:length(unique(data$A))){
data[A == i, substraction := data[A == i, 'value'] - data[A == i & group == 'a', value]]
}
Here is a piece of my data:
data_x <- tribble(
~price, ~bokey, ~id, ~cost, ~revenue,
1, "a", 10, 0.20, 30,
2, "b", 20, 0.30, 60,
3, "c", 20, 0.30, 40,
4, "d", 10, 0.20, 100,
5, "e", 30, 0.10, 40,
6, "f", 10, 0.20, 10,
1, "g", 20, 0.30, 80,
2 , "h", 10, 0.20, 20,
3, "h", 30, 0.10, 20,
3, "i", 20, 0.30, 40,
)
As you see, there are three different type of IDs: 10, 20, 30. But in the real data, there are almost 100 ids. I want to aggregate the data based on these ids. Because I don't know how to do it in loop, I basically created some subsets:
data_10 <- data_x %>% filter(id == 10)
data_20 <- data_x %>% filter(id == 20)
data_30 <- data_x %>% filter(id == 30)
Here is the aggregated data:
data_agg <- data_10 %>%
group_by(priceseg = cut(as.numeric(price), c(0, 1, 3, 5, 6))) %>%
summarise(price_n = n_distinct(bokey),
Cost = sum(cost, na.rm = T),
Revenue = sum(revenue, na.rm = T),
clicks = n_distinct(bokey)) %>%
mutate(price_n2 = round(100 * prop.table(price_n), 2),
(zet = Cost/Revenue))
But I want to have one more column that shows the id. Here is the desired data:
data_desired <- tribble(
~id, ~priceseg, ~price_n, ~Cost, ~Revenue, ~clicks, ~price_n2, ~`(zet = Cost/Revenue)`
10, (0,1] 1 0.2 30 1 25 0.00667
10, (1,3] 1 0.2 20 1 25 0.01
10, (3,5] 1 0.2 100 1 25 0.002
10, (5,6] 1 0.2 10 1 25 0.02
20,
20,
.
.
) 30,
How can I get it?
Since you are already using dplyr, just add id as one of the grouping variables (no need to previously separate your data):
data_agg <- data_x %>%
group_by(id, priceseg = cut(as.numeric(price), c(0, 1, 3, 5, 6))) %>%
summarise(price_n = n_distinct(bokey),
Cost = sum(cost, na.rm = T),
Revenue = sum(revenue, na.rm = T),
clicks = n_distinct(bokey)) %>%
mutate(price_n2 = round(100 * prop.table(price_n), 2),
(zet = Cost/Revenue))
# A tibble: 8 x 8
# Groups: id [3]
# id priceseg price_n Cost Revenue clicks price_n2 `(zet = Cost/Revenue)`
# <dbl> <fct> <int> <dbl> <dbl> <int> <dbl> <dbl>
# 1 10 (0,1] 1 0.2 30 1 25 0.00667
# 2 10 (1,3] 1 0.2 20 1 25 0.01
# 3 10 (3,5] 1 0.2 100 1 25 0.002
# 4 10 (5,6] 1 0.2 10 1 25 0.02
# 5 20 (0,1] 1 0.3 80 1 25 0.00375
# 6 20 (1,3] 3 0.900 140 3 75 0.00643
# 7 30 (1,3] 1 0.1 20 1 50 0.005
# 8 30 (3,5] 1 0.1 40 1 50 0.0025
An option is to split and loop over with map while specifying the .id
library(dplyr)
library(purrr)
data_x %>%
split(.$id) %>%
map_dfr(~
.x %>%
group_by(priceseg = cut(as.numeric(price), c(0, 1, 3, 5, 6))) %>%
summarise(price_n = n_distinct(bokey),
Cost = sum(cost, na.rm = T),
Revenue = sum(revenue, na.rm = T),
clicks = n_distinct(bokey)) %>%
mutate(price_n2 = round(100 * prop.table(price_n), 2),
(zet = Cost/Revenue)), .id = "id" )
# A tibble: 8 x 8
# id priceseg price_n Cost Revenue clicks price_n2 `(zet = Cost/Revenue)`
# <chr> <fct> <int> <dbl> <dbl> <int> <dbl> <dbl>
#1 10 (0,1] 1 0.2 30 1 25 0.00667
#2 10 (1,3] 1 0.2 20 1 25 0.01
#3 10 (3,5] 1 0.2 100 1 25 0.002
#4 10 (5,6] 1 0.2 10 1 25 0.02
#5 20 (0,1] 1 0.3 80 1 25 0.00375
#6 20 (1,3] 3 0.900 140 3 75 0.00643
#7 30 (1,3] 1 0.1 20 1 50 0.005
#8 30 (3,5] 1 0.1 40 1 50 0.0025
The cut step can also be changed with findInterval
NOTE: The idea of split/map is based on the OP's title about looping and getting the output
I am really an r newbie, but please help me to complete this assignment.
I have the following sample of a dataset on "polityScore", and I need to create a new variable called "politicalChange" that is based on the yearly changes in the first variable, following these conditions:
if polityScore in A in year1 + 1 > polityScore in A in year1---> "democratization"
if polityScore in A in year1 + 1 < polityScore in A in year1---> "autocratization"
if polityScore in A in year1 + 1 = polityScore in A in year1---> "no change"
the data:
country, date, polityScore, politicalChange
A ,2000 ,5 ,
A ,2001 ,6 ,
A ,2002 ,4 ,
A ,2003 ,5 ,
A ,2004 ,5 ,
A ,2005 ,7 ,
B ,2000 ,5 ,
B ,2001 ,6 ,
B ,2002 ,4 ,
B ,2003 ,5 ,
B ,2004 ,5 ,
B ,2005 ,7 ,
Thank you!
You probably want something like below. The dplyr package can help with this. First group by country so that the following if_else statement is done over each country. In the if_else compares the polityScore with the polityScore from 1 year before and based on that fills in "democratization", "autocratization" or "no change". The first value of the group will be NA.
If you do not want the NA's, but "no change" instead, add default = first(polityScore) to the lag function.
library(dplyr)
df1 %>%
group_by(country) %>%
mutate(politicalChange = if_else(polityScore > lag(polityScore), "democratization",
ifelse(polityScore < lag(polityScore), "autocratization", "no change")))
# A tibble: 12 x 4
# Groups: country [2]
country date polityScore politicalChange
<chr> <dbl> <dbl> <chr>
1 A 2000 5 NA
2 A 2001 6 democratization
3 A 2002 4 autocratization
4 A 2003 5 democratization
5 A 2004 5 no change
6 A 2005 7 democratization
7 B 2000 5 NA
8 B 2001 6 democratization
9 B 2002 4 autocratization
10 B 2003 5 democratization
11 B 2004 5 no change
12 B 2005 7 democratization
For readability of your rules you could also use case_when instead of if_else. case_when also fills in the NA's with the TRUE rule.
df1 %>%
group_by(country) %>%
mutate(politicalChange = case_when(polityScore > lag(polityScore) ~ "democratization",
polityScore < lag(polityScore) ~ "autocratization",
TRUE ~ "no change"))
# A tibble: 12 x 4
# Groups: country [2]
country date polityScore politicalChange
<chr> <dbl> <dbl> <chr>
1 A 2000 5 no change
2 A 2001 6 democratization
3 A 2002 4 autocratization
.....
data:
df1 <- structure(list(country = c("A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B"), date = c(2000, 2001, 2002, 2003, 2004,
2005, 2000, 2001, 2002, 2003, 2004, 2005), polityScore = c(5,
6, 4, 5, 5, 7, 5, 6, 4, 5, 5, 7), politicalChange = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -12L
), class = "data.frame")
P.S.
check bookdown.org for a lot of books on R which can help you further.
I have a data frame, and I'd like to create a new column that gives the sum of a numeric variable grouped by factors. So something like this:
BEFORE:
data1 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60))
AFTER:
data2 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60),
sum = c(30, 30, 70, 70, 110, 110))
In Stata you can do this with the egen command quite easily. I've tried the aggregate function, and the ddply function but they create entirely new data frames, and I just want to add a column to the existing one.
You are looking for ave
> data2 <- transform(data1, sum=ave(value, month, FUN=sum))
month sex value sum
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110
data1$sum <- ave(data1$value, data1$month, FUN=sum) is useful if you don't want to use transform
Also data.table is helpful
library(data.table)
DT <- data.table(data1)
DT[, sum:=sum(value), by=month]
UPDATE
We can also use a tidyverse approach which is simple, yet elegant:
> library(tidyverse)
> data1 %>%
group_by(month) %>%
mutate(sum=sum(value))
# A tibble: 6 x 4
# Groups: month [3]
month sex value sum
<dbl> <fct> <dbl> <dbl>
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110