I conducted an analysis for some M&A-Deals. My current output looks like this:
Deal-Nr | Event-Date | Target-Nation | CAR | SIC
----------------------------------------------------
1 | 01-01-1999 | Italy | 5.1% | 201
2 | 02-01-1999 | Germany | 2.3% | 202
3 | 06-01-1999 | Spain | 1.5% | 201
4 | 10-09-1999 | Germany | 0.3% | 201
5 | 15-09-1999 | UK | 1.1% | 201
6 | 25-10-2000 | Spain | 0.8% | 201
However, for my final analysis I want to exclude all deals within the same SIC-Code, which do not have at least 180 trading days between them. So in this case, I would want to exclude my deal 3 from the analysis (as they have the same SIC-code and do not have 180 days between them). Then the code should continue and check the next deal within that SIC-Code industry and remove (<180 days) or keep it (>180 days). This should be done for all the different SIC codes in my analysis.
As I'm rather new in R, I'm reaching out for help. Thank you so much for your support.
Edit:
As indicated below I provide some further information. I'm interested in the deals that are in the same SIC-Code and >180 days apart. This would mean in the table to remove row (3) and row (5). If one deal is more than 180 days apart the subsequent dates should be checked.
First, your Event.Date column needs to be a real date, not a string. I'm inferring month-day-year. From there, we need to group by SIC and calculate the difference in dates.
base R
dat$Event.Date <- as.Date(dat$Event.Date, format = "%d-%m-%Y")
keep <- ave(as.numeric(dat$Event.Date), dat$SIC, FUN = function(z) c(TRUE, diff(z) >= 180)) > 0
dat[keep,]
# Deal.Nr Event.Date Target.Nation CAR SIC
# 1 1 1999-01-01 Italy 5.1% 201
# 2 2 1999-01-02 Germany 2.3% 202
# 4 4 1999-09-10 Germany 0.3% 201
# 6 6 2000-10-25 Spain 0.8% 201
dplyr
library(dplyr)
dat %>%
# mutate(Event.Date = as.Date(Event.Date, format = "%d-%m-%Y")) %>%
# group_by(SIC) %>%
# filter(c(TRUE, diff(Event.Date) >= 180)) %>%
# ungroup()
# . + # A tibble: 4 x 5
# Deal.Nr Event.Date Target.Nation CAR SIC
# <int> <date> <chr> <chr> <int>
# 1 1 1999-01-01 Italy 5.1% 201
# 2 2 1999-01-02 Germany 2.3% 202
# 3 4 1999-09-10 Germany 0.3% 201
# 4 6 2000-10-25 Spain 0.8% 201
data.table
library(data.table)
as.data.table(dat
# )[, Event.Date := as.Date(Event.Date, format = "%d-%m-%Y")
# ][, .SD[c(TRUE, diff(Event.Date) >= 180),], by = .(SIC)]
+ > SIC Deal.Nr Event.Date Target.Nation CAR
# 1: 201 1 1999-01-01 Italy 5.1%
# 2: 201 4 1999-09-10 Germany 0.3%
# 3: 201 6 2000-10-25 Spain 0.8%
# 4: 202 2 1999-01-02 Germany 2.3%
Data
dat <- structure(list(Deal.Nr = 1:6, Event.Date = c("01-01-1999", "02-01-1999", "06-01-1999", "10-09-1999", "15-09-1999", "25-10-2000"), Target.Nation = c("Italy", "Germany", "Spain", "Germany", "UK", "Spain"), CAR = c("5.1%", "2.3%", "1.5%", "0.3%", "1.1%", "0.8%"), SIC = c(201L, 202L, 201L, 201L, 201L, 201L)), row.names = c(NA, -6L), class = "data.frame")
I have a dataframe that looks like this:
+------------+
|site |
+------------+
|JPN Tokyo |
|AUS Sydney |
|CHN Beijing |
But I'd like to make duplicate rows of the existing rows but with the 2nd and 3rd character changed to lowercase such that the dataframe becomes like this:
+------------+
|site |
+------------+
|JPN Tokyo |
|Jpn Tokyo |
|AUS Sydney |
|Aus Sydney |
|CHN Beijing |
|Chn Beijing |
Would anyone have an idea how to do that?
We expand the rows with uncount, then create a logical condition with duplicated on the 'site', replace the substring values to lower case using sub within case_when
library(dplyr)
library(tidyr)
library(stringr)
df1 <- df1 %>%
uncount(2) %>%
mutate(site = case_when(duplicated(site)
~ sub("^(.)(\\w+)", "\\1\\L\\2", site, perl = TRUE),
TRUE ~ site))
-output
df1
# A tibble: 6 x 1
site
<chr>
1 JPN Tokyo
2 Jpn Tokyo
3 AUS Sydney
4 Aus Sydney
5 CHN Beijing
6 Chn Beijing
data
df1 <- structure(list(site = c("JPN Tokyo", "AUS Sydney", "CHN Beijing"
)), class = "data.frame", row.names = c(NA, -3L))
edit: #AnilGoyal suggested the use of map_dfr, that reduced the call to only one line.
library(tidyverse)
data <-
tribble(
~site,
'JPN Tokyo',
'AUS Sydney',
'CHN Beijing' )
#option1
map_dfr(data$site, ~list(sites = c(.x, str_to_title(.x))))
#> # A tibble: 6 x 1
#> sites
#> <chr>
#> 1 JPN Tokyo
#> 2 Jpn Tokyo
#> 3 AUS Sydney
#> 4 Aus Sydney
#> 5 CHN Beijing
#> 6 Chn Beijing
#option2
map(data$site, ~rbind(.x, str_to_title(.x))) %>%
reduce(rbind) %>%
tibble(site = .)
#> # A tibble: 6 x 1
#> site[,1]
#> <chr>
#> 1 JPN Tokyo
#> 2 Jpn Tokyo
#> 3 AUS Sydney
#> 4 Aus Sydney
#> 5 CHN Beijing
#> 6 Chn Beijing
Created on 2021-06-08 by the reprex package (v2.0.0)
You can use substr to replace characters at specific position.
df1 <- df
substr(df1$site, 2, 3) <- tolower(substr(df1$site, 2, 3))
df1
# site
#1 Jpn Tokyo
#2 Aus Sydney
#3 Chn Beijing
res <- rbind(df1, df)
res[order(res$site), , drop = FALSE]
# site
#2 Aus Sydney
#5 AUS Sydney
#3 Chn Beijing
#6 CHN Beijing
#1 Jpn Tokyo
#4 JPN Tokyo
I have the following data and looking to create the "Final Col" shown below using dplyr in R. I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
It can be done with a group_by operation i.e. grouped by 'Year', 'MainCat', divide the 'Qty' by the sum of 'Qty' to create the 'Final' column
library(dplyr)
df1 <- df1 %>%
group_by(Year, MainCat) %>%
mutate(Final = Qty/sum(Qty))
You can use prop.table :
library(dplyr)
df %>% group_by(Year, MainCat) %>% mutate(Final = prop.table(Qty))
# Year Week MainCat Qty Final
# <int> <int> <chr> <int> <dbl>
#1 2017 1 Edible 69 0.852
#2 2017 2 Edible 12 0.148
#3 2017 1 Flowers 88 0.652
#4 2017 2 Flowers 47 0.348
#5 2018 1 Edible 90 0.72
#6 2018 2 Edible 35 0.28
#7 2018 1 Flowers 78 0.479
#8 2018 2 Flowers 85 0.521
You can also do this in base R :
df$Final <- with(df, ave(Qty, Year, MainCat, FUN = prop.table))
This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)
I've hit a bit of a brick wall with my code below. Essentially, dftable should be a filtered dataframe containing clicks on a widget (I loop through the columns for each widget).
I then want to get the sum of all pageviews the widget was active on (it's not on all pages, and I filter as such to exclude those where it is NA). However, dfviews just returns all pageviews, as opposed to filtering on where the widget is not NA.
Any guidance would be appreciated:
mixpanelData example:
--------------------------------------------------------------
| Group | Date | WidgetClick | Widget2Click | ViewedPageResult
--------------------------------------------------------------
| ABC | 01/01/2017 | 123456 | NA | 1450544
--------------------------------------------------------------
| ABN | 01/01/2017 | NA | 1245 | 4560000
--------------------------------------------------------------
| ABN | 01/02/2017 | NA | 1205 | 4561022
--------------------------------------------------------------
| BNN | 01/02/2017 | 1044 | NA | 4561021
--------------------------------------------------------------
An my ideal output would be along the lines of... (with proportions, which is fine as I can handle these)
WidgetClick CSV
--------------------------------------------------------------
Date | WidgetClick | ViewedPageResult
--------------------------------------------------------------
01/01/2017 | 123455 | 1450544
------------------------------------------------------------
01/02/2017 | 1044 | 4561021
--------------------------------------------------------------
WidgetClick 2 CSV
--------------------------------------------------------------
|Date | Widget2Click | ViewedPageResult
--------------------------------------------------------------
01/01/2017 | 1245 | 4560000
--------------------------------------------------------------
01/02/2017 | 1205 | 4561022
--------------------------------------------------------------
Code is provided below...
vars = colnames(mixpanelData)
vars =vars[-c(1,2)]
k = 1
for (v in vars) {
filename <- paste(v,k,".csv",sep="")
dftable <- mixpanelData %>% filter(!is.na(v)) %>% group_by(Date) %>% summarise_(clicksum=interp(~sum(var, na.rm = TRUE), var = as.name(v)))
dfviews <- mixpanelData %>% filter(!is.na(v)) %>% group_by(Date) %>% summarise(viewsum=sum((ViewedPageResult)))
total <- merge(dftable,dfviews,by="Date")
total <- mutate(total, proportion = clicksum / viewsum * 100)
write.csv(total, file = filename,row.names=FALSE, na="")
k <- k +1 }
In your desired results, you show two separate tables. But you also mention that you have several widgets, so separate table might not be ideal. I'll show how you can get separate tables and then I'll show how you can calculate for all widgets at once.
Separate tables
Using dplyr and tidyr, you can use filter to get your two tables like so:
library(dplyr);library(tidyr)
df <- read.table(text="Group Date WidgetClick Widget2Click ViewedPageResult
ABC 01/01/2017 123456 NA 1450544
ABN 01/01/2017 NA 1245 4560000
ABN 01/02/2017 NA 1205 4561022
BNN 01/02/2017 1044 NA 4561021",header=TRUE,
stringsAsFactors=FALSE)
df%>% filter(!is.na(WidgetClick)) %>% select(-Widget2Click)
Group Date WidgetClick ViewedPageResult
1 ABC 01/01/2017 123456 1450544
2 BNN 01/02/2017 1044 4561021
df%>% filter(!is.na(Widget2Click)) %>% select(-WidgetClick)
Group Date Widget2Click ViewedPageResult
1 ABN 01/01/2017 1245 4560000
2 ABN 01/02/2017 1205 4561022
Single table
To get all the results in a single table, you first need to gather the Widget*Click column and then filter:
df%>%
gather(Widget_number,Click,starts_with("Widget"))%>%
filter(!is.na(Click))
Group Date ViewedPageResult Widget_number Click
1 ABC 01/01/2017 1450544 WidgetClick 123456
2 BNN 01/02/2017 4561021 WidgetClick 1044
3 ABN 01/01/2017 4560000 Widget2Click 1245
4 ABN 01/02/2017 4561022 Widget2Click 1205
EDIT
To summarise the number of clicks per month per widget, you can mutate to add a Year_mon column using as.yearmon from package zoo. Then, group_by Widget_number and Year_month, then summarise to get the total clicks per month. You can do other calculations such as proportion inside the summarise statement. I assumed the date was "%m/%d/%Y". Make sure it's the case.
library(zoo)
df%>%
gather(Widget_number,Click,starts_with("Widget"))%>%
filter(!is.na(Click)) %>%
mutate(Year_month=as.yearmon(as.Date(Date,"%m/%d/%Y"))) %>%
group_by(Widget_number,Year_month) %>%
summarise(Sum_clicks=sum(Click,na.rm=TRUE))
Widget_number Year_month Sum_clicks
<chr> <S3: yearmon> <int>
1 Widget2Click Jan 2017 2450
2 WidgetClick Jan 2017 124500