I have a data file that look like this The question: At what date is the total incidence (cumulative sum) between Venus and Mar is more than 2000?
I've created a simple example using an index instead of a Date column:
df <- data.frame(country = c(rep("Mar",10), rep("Venus",10)),
incidence = runif(20,0,30),
index=seq(1,20,1))
library(dplyr)
df %>%
group_by(country) %>%
mutate(cumInc = cumsum(incidence)) %>%
filter(cumInc > 100) %>%
filter(index==min(index))
country incidence index cumInc
<fct> <dbl> <dbl> <dbl>
1 Mar 29.2 10 108.
2 Venus 22.5 16 110.
You can just change 100 to your threshold and change index to date to get the first Date for Venus and for Mar when the cumulative sum exceeds the given threshold. So e.g.:
df %>%
group_by(country) %>%
mutate(cumInc = cumsum(incidence)) %>%
filter(cumInc > **Your Threshold**) %>%
filter(date==min(date))
If you want to obtain a data.frame later you can add simply %>% as.data.frame().
If you want to save you information just use something like:
result <- df %>%
group_by(...
Related
Background
I've got this dataset d:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
It's got 2 people (IDs) in it, and they each have some events.
The problem
I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).
In other words I'm looking for something like this:
| Mean | SD |
|------|------|
| 4.00 | 2.83 |
What I've tried
I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event))
# A tibble: 1 x 1
ratio
<dbl>
1 4
And here's the SD:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(sd = sd(event))
# A tibble: 1 x 1
sd
<dbl>
1 2.83
But I when I try to pipe them together like so...
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event)) %>%
summarise(sd = sd(event))
... I get an error:
Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found
Any insight?
You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.
library(dplyr)
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
d %>%
group_by(ID) %>%
# the next summarise will be within ID
summarise(event = length(event)) %>%
# this summarise is overall
summarise(sd = sd(event),
ratio = mean(event))
#> # A tibble: 1 × 2
#> sd ratio
#> <dbl> <dbl>
#> 1 2.83 4
The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:
d %>%
count(ID) %>%
summarise(sd = sd(n),
ratio = mean(n))
Created on 2022-05-25 by the reprex package (v2.0.1)
I apologize for my bad English, but I really need your help.
I have a .csv dataset with two columns - year and value. There is data about height of precipitation monthly from 1900 to 2019.
It looks like this:
year value
190001 100
190002 39
190003 78
190004 45
...
201912 25
I need to create two new datasets: the first one with the data for every year from July (07) to September (09) and the second one from January (01) to March (03).
Also I need to summarize this data for every year (it means I need only one value per year).
So I have data for summer 1900-2019 and winter 1900-2019.
You can use the dplyr and stringr packages to achive what you need. I created a mock data set first:
library(dplyr)
library(stringr)
df <- data.frame(time = 190001:201219, value=runif(length(190001:201219), 0, 100))
After that, we create two separate columns for month and year:
df$year <- as.numeric(str_extract(df$time, "^...."))
df$month <- as.numeric(str_extract(df$time, "..$"))
At this point, we can filter:
df_1 <- df %>% filter(between(month,7,9))
df_2 <- df %>% filter(between(month,1,3))
... and summarize:
df <- df %>% group_by(year) %>% summarise(value = sum(value))
library(tidyverse)
dat <- tribble(
~year, ~value,
190001, 100,
190002, 39,
190003, 78,
190004, 45)
Splitting the year variable into a month and year variable:
dat_prep <- dat %>%
mutate(month = str_remove(year, "^\\d{4}"), # Remove the first 4 digits
year = str_remove(year, "\\d{2}$"), # Remove the last 2 digits
across(everything(), as.numeric))
dat_prep %>%
filter(month %in% 7:9) %>% # For months Jul-Sep. Repeat with 1:3 for Jan-Mar
group_by(year) %>%
summarize(value = sum(value))
I am trying to create the following formula:
Interest expense / (Total Debt(for all years)) / # number of years
The data looks like the following;
GE2017 GE2016 GE2015 GE2014
Interest Expense -2753000 -2026000 -1706000 -1579000
Long Term Debt 108575000 105080000 144659000 186596000
Short/Current Long Term Debt 134591000 136211000 197602000 261424000
Total_Debt 243166000 241291000 342261000 448020000
GOOG2017 GOOG2016 GOOG2015 GOOG2014
Interest Expense -109000 -124000 -104000 -101000
Long Term Debt 3943000 3935000 1995000 2992000
Short/Current Long Term Debt 3969000 3935000 7648000 8015000
Total_Debt 7912000 7870000 9643000 11007000
NVDA2018 NVDA2017 NVDA2016 NVDA2015
Interest Expense -61000 -58000 -47000 -46000
Long Term Debt 1985000 1985000 7000 1384000
Short/Current Long Term Debt 2000000 2791000 1434000 1398000
Total_Debt 3985000 4776000 1441000 2782000
That is, for GE, I am trying to take interest expense for the latest year -2753000 divide this by the average of Total Debt for all 4 years for GE.
So;
-2753000 / AVERAGE(243166000 + 241291000 + 342261000 + 448020000) = 0.0086
However I am running into problems with group_by() when taking the average since GE and the other firms have different column names due to the different years.
cost_of_debt %>%
t() %>%
data.frame() %>%
rownames_to_column('rn') %>%
group_by(rn)
#Calcualtion here
Secondly; If possible, I would like to do the same calculation as above but use only the last two years of each firm.
-2753000 / AVERAGE(243166000 + 241291000) = 0.01136
Would perhaps a grepl function work here?
I have a vector called symbols.
symbols <- c("NVDA", "GOOG", "GE")
Data:
cost_of_debt <- structure(list(GE2017 = c(-2753000, 108575000, 134591000, 243166000
), GE2016 = c(-2026000, 105080000, 136211000, 241291000), GE2015 = c(-1706000,
144659000, 197602000, 342261000), GE2014 = c(-1579000, 186596000,
261424000, 448020000), GOOG2017 = c(-109000, 3943000, 3969000,
7912000), GOOG2016 = c(-124000, 3935000, 3935000, 7870000), GOOG2015 = c(-104000,
1995000, 7648000, 9643000), GOOG2014 = c(-101000, 2992000, 8015000,
11007000), NVDA2018 = c(-61000, 1985000, 2e+06, 3985000), NVDA2017 = c(-58000,
1985000, 2791000, 4776000), NVDA2016 = c(-47000, 7000, 1434000,
1441000), NVDA2015 = c(-46000, 1384000, 1398000, 2782000)), .Names = c("GE2017",
"GE2016", "GE2015", "GE2014", "GOOG2017", "GOOG2016", "GOOG2015",
"GOOG2014", "NVDA2018", "NVDA2017", "NVDA2016", "NVDA2015"), row.names = c("Interest Expense",
"Long Term Debt", "Short/Current Long Term Debt", "Total_Debt"
), class = "data.frame")
For the first case, after creating row names as a column (rownames_to_column - from tibble), separate that to 'firm' and 'year' by splitting at the junction between the start of the 'year' and the end of the firm, name, grouped by 'firm', create a 'New' column by taking the proportion of 'Interest.Expense' with the mean value of 'Total_Debt'. Then, we can arrange by 'year', get the mean of the last two 'Total_Debt' for each 'firm' and divide with 'Interest.Expense
library(dplyr)
cost_of_debt %>%
t() %>%
data.frame() %>%
rownames_to_column('rn') %>%
separate(rn, into = c("firm", "year"),
"(?<=[A-Z])(?=[0-9])", convert = TRUE) %>%
group_by(firm) %>%
mutate(New = Interest.Expense/mean(Total_Debt)) %>%
arrange(firm, year) %>%
mutate(NewLast = Interest.Expense/mean(tail(Total_Debt, 2)))
I think you need to clean your data first so that it is easier to understand what is an observation and what is a variable. Google tidy data :) Here is my solution. First I make the data tidy, then the calculations are straightforward.
library(tidyverse)
library(stringr)
), class = "data.frame")
# Clean and make the data tidy
cost_of_debt <- cost_of_debt %>%
as_tibble() %>%
rownames_to_column(var = "indicator") %>%
mutate(indicator = str_replace_all(indicator, regex("\\s|\\/"), "_")) %>%
gather(k, value, -indicator) %>%
separate(k, into = c("company", "year"), -4) %>%
spread(indicator, value) %>%
rename_all(tolower)
Results in the data looking like this:
company year interest_expense long_term_debt short_current_long_term_debt total_debt
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 GE 2014 -1579000 186596000 261424000 448020000
2 GE 2015 -1706000 144659000 197602000 342261000
3 GE 2016 -2026000 105080000 136211000 241291000
4 GE 2017 -2753000 108575000 134591000 243166000
5 GOOG 2014 -101000 2992000 8015000 11007000
Then we can answer your question:
cost_of_debt <- cost_of_debt %>%
group_by(company) %>%
mutate(int_over_totdept4 = interest_expense / mean(total_debt),
int_over_totdept2 = interest_expense / mean(total_debt[year %in% c("2017", "2016")]))
Which gives a dataframe (with your new varibles furthest to the right):
company year interest_expense long_term_debt short_current_long_term_debt total_debt int_over_totdept4 int_over_totdept2
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GE 2014 -1579000 186596000 261424000 448020000 -0.00495 -0.00652
2 GE 2015 -1706000 144659000 197602000 342261000 -0.00535 -0.00704
3 GE 2016 -2026000 105080000 136211000 241291000 -0.00636 -0.00836
4 GE 2017 -2753000 108575000 134591000 243166000 -0.00864 -0.0114
5 GOOG 2014 -101000 2992000 8015000 11007000 -0.0111 -0.0128
And if you want the summarized form of your questions:
# First question:
cost_of_debt %>% filter(company == "GE", year == "2017") %>% select(company, year, int_over_totdept4)
# Second question:
cost_of_debt %>% filter(year == "2017") %>% select(company, year, int_over_totdept2)
I have the code below that gives me the time series results of a stock and groups everything into 'buys' and 'sells' buckets (based on closing prices higher or lower than opening prices).
library(dplyr)
library(data.table)
library(quantmod)
library(zoo)
# enter tickers to download time-series data
e <- new.env()
getSymbols("SBUX", env = e)
pframe <- do.call(merge, as.list(e))
#head(pframe)
# get a subset of data
df = pframe$SBUX.Close
colnames(df)[1] <- "Close"
head(df)
# Assign groupings
addGrps <- transform(df,Group = ifelse(Close < lead(Close), "S", "B"))
# create subsets
buys <- addGrps[addGrps$Group == 'B',]
sells <- addGrps[addGrps$Group == 'S',]
Now, I am trying to group the results by daily profits (Diff) and losses and find the cumulative sum of each (profits and losses).
I think it should be something like this, but something is off, and I'm not sure what it is.
# find daily differences
df <- df %>%
mutate(Diff = addGrps$Close - lead(addGrps$Close))
# get up and down price movements
ups <- filter(df, Diff > 0 )
downs <- filter(df, Diff <= 0 )
# cumulative sums of longs and shorts
longs<-cumsum(ups$Diff)
shorts<-cumsum(downs$Diff)
I'm not sure if I'm totally following your question/problem, and it seems like there is some unnecessary code. For example, all those packages aren't needed (at least, not yet),
and it's not clear why the two subset data frames for the buys and sells are needed. At the very least, the following cleans up some of what you've done so far, and gets the data in an easy to work with data frame. With some clarification, maybe this is a start.
library(quantmod)
library(tidyverse) # rather than just dplyr
# pull the SBUX data as a data frame and create the necessary new columns:
df <- data.frame(getSymbols(Symbols = 'SBUX', env = NULL)) %>% # pull the raw data
rownames_to_column('date') %>% # convert the row index to a column
select(date, close = SBUX.Close) %>% # select only the SBUX.Close column and rename it
mutate(group = ifelse(close < lead(close), 's', 'b')) %>% # assign the sell or buy group
mutate(diff = close - lead(close)) %>% # create the diff calculation
mutate(movement = ifelse(diff > 0, 'up', 'down')) %>% # create the movement classification
tbl_df()
# just to view the new data frame:
df %>% head(5)
# A tibble: 5 x 5
date close group diff movement
<chr> <dbl> <chr> <dbl> <chr>
1 2007-01-03 17.6 s -0.0200 down
2 2007-01-04 17.6 b 0.0750 up
3 2007-01-05 17.6 b 0.0650 up
4 2007-01-08 17.5 b 0.0750 up
5 2007-01-09 17.4 b 0.0550 up
# calculate the sums of the diff by the movement up or down:
df %>%
filter(!is.na(movement)) %>% # this removes the last date from the data - it cannot have a lead closing price
group_by(movement) %>%
summarize(cum_sum = sum(diff))
# A tibble: 2 x 2
movement cum_sum
<chr> <dbl>
1 down -489.
2 up 455.
I've got a data frame (dfdat) with two categorical variables, location and employmentstatus.
I'd like to generate a data frame with the proportions of employment status for each location.
mydf_wide (achieved outcome) is almost what I'm looking for. The problem's that employmentstatus is a variable with two levels, yet there're three rows in mydf_wide. I don't understand why that is, because I'd have expected something similar to mytable (expected outcome).
Any help would be much appreciated.
Starting point (df):
dfdat <- data.frame(location=c("GA","GA","MA","OH","RI","GA","AZ","MA","OH","RI"),employmentstatus=c(1,2,1,2,1,1,1,2,1,1))
Expected outcome (table):
mytable <- table(dfdat$employmentstatus,dfdat$location)
mytable <- round(100*(prop.table(mytable, 2)),1)
Achieved outcome (df):
library(dplyr)
mydf <- dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1))
library(tidyr)
mydf_wide <- spread(mydf, location, freq)
mydf_wide <- as.data.frame(mydf_wide)
We need to do a second group_by with 'location' to get the sum. Also, instead of grouping and then creating the 'n', count function can be used
dfdat %>%
count(location, employmentstatus) %>%
group_by(location) %>%
mutate(n = round(100*n/sum(n), 2)) %>%
spread(location, n, fill = 0)
# A tibble: 2 x 6
# employmentstatus AZ GA MA OH RI
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 100 66.67 50 50 100
#2 2 0 33.33 50 50 0
If we are using the OP's code, then remove the 'n' column and then do the spread
dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1)) %>%
select(-n) %>%
spread(location, freq, fill =0)
or update the 'n' column with the output of round and then spread. An extra column in 'n' made sure that the combinations exist in the dataset