I want to ask for ideas on creating a syntax to pivot_longer given on this.
I've already tried researching in the internet but I can't seem to find any examples that is similar to my data given where it has a Metric column which is also seperated in 3 different columns of months.
My desire final output is to have seven columns consisting of (regions,months, and the five Metrics)
How to formulate the pivot_longer and pivot_wider syntax to clean my data in order for me to visualize it?
The tricky part isn't pivot_longer. You first have to clean your Excel spreadsheet, i.e. get rid of empty rows and merge the two header rows containing the names of the variables and the dates.
One approach to achieve your desired result may look like so:
library(readxl)
library(tidyr)
library(janitor)
library(dplyr)
x <- read_excel("data/Employment.xlsx", skip = 3, col_names = FALSE) %>%
# Get rid of empty rows and cols
janitor::remove_empty()
# Make column names
col_names <- data.frame(t(x[1:2,])) %>%
fill(1) %>%
unite(name, 1:2, na.rm = TRUE) %>%
pull(name)
x <- x[-c(1:2),]
names(x) <- col_names
# Convert to long and values to numerics
x %>%
pivot_longer(-Region, names_to = c(".value", "months"), names_sep = "_") %>%
separate(months, into = c("month", "year")) %>%
mutate(across(!c(Region, month, year), as.numeric))
#> # A tibble: 6 × 8
#> Region month year `Total Population … `Labor Force Part… `Employment Rat…
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Philippin… April 2020f 73722. 55.7 82.4
#> 2 Philippin… Janu… 2021p 74733. 60.5 91.3
#> 3 Philippin… April 2021p 74971. 63.2 91.3
#> 4 National … April 2020f 9944. 54.2 87.7
#> 5 National … Janu… 2021p 10051. 57.2 91.2
#> 6 National … April 2021p 10084. 60.1 85.6
#> # … with 2 more variables: Unemployment Rate <dbl>, Underemployment Rate <dbl>
Related
I am using dplyr to aggregate my dataframe, so it shows percentages of people choosing specific protein design tasks by company size. I have different dummy variables for protein design tasks, because this was a multiple choice question in a survey.
I figured out a way to do this, but my code is very long, because I aggregate the data per task and then join all these separate dataframes together into one. I’m curious whether there is a more elegant (shorter) way to do this?
library(tidyverse)
EarlyAccess <- read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1")
#################### STABILITY ################################################
Proportions_tasks_stability <- EarlyAccess %>%
select(size, Improving.stability..generic..thermal..pH.) %>%
group_by(size, Improving.stability..generic..thermal..pH.) %>%
summarise(count_var_stability=n())%>%
mutate(total_group_by_size = sum(count_var_stability)) %>%
mutate(pc_var_stability=count_var_stability/sum(count_var_stability)*100) %>%
filter(Improving.stability..generic..thermal..pH.=="Improving stability (generic, thermal, pH)") %>%
select(size, Improving.stability..generic..thermal..pH., pc_var_stability)
######################## ACTIVITY #############################################
Proportions_tasks_activity <- EarlyAccess %>%
select(size, Improving.activity ) %>%
group_by(size, Improving.activity) %>%
summarise(count_var_activity=n())%>%
mutate(total_group_by_size = sum(count_var_activity)) %>%
mutate(pc_var_activity=count_var_activity/sum(count_var_activity)*100) %>%
filter(Improving.activity=="Improving activity") %>%
select(size, Improving.activity, pc_var_activity)
######################## BINDING AFFINITY ######################################
Proportions_tasks_binding.affinity<- EarlyAccess %>%
select(size, Improving.binding.affinity ) %>%
group_by(size, Improving.binding.affinity) %>%
summarise(count_var_binding.affinity=n())%>%
mutate(total_group_by_size = sum(count_var_binding.affinity)) %>%
mutate(pc_var_binding.affinity=count_var_binding.affinity/sum(count_var_binding.affinity)*100) %>%
filter(Improving.binding.affinity=="Improving binding affinity") %>%
select(size, Improving.binding.affinity, pc_var_binding.affinity)
# Then join them
Protein_design_tasks <- Proportions_tasks_stability %>%
inner_join(Proportions_tasks_activity, by = "size") %>%
inner_join(Proportions_tasks_binding.affinity, by = "size")
Using the datafile you provided, this should give the percentages of the selected category within each column for each size:
library(tidyverse)
df <-
read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1")
df |>
group_by(size) |>
summarise(
pc_var_stability = sum(
Improving.stability..generic..thermal..pH. == "Improving stability (generic, thermal, pH)",
na.rm = TRUE
) / n() * 100,
pc_var_activity = sum(Improving.activity == "Improving activity",
na.rm = TRUE) / n() * 100,
pc_var_binding.affinity = sum(
Improving.binding.affinity == "Improving binding affinity",
na.rm = TRUE
) / n() * 100
)
#> # A tibble: 7 × 4
#> size pc_var_stability pc_var_activity pc_var_binding.affinity
#> <chr> <dbl> <dbl> <dbl>
#> 1 1000-10000 43.5 47.8 34.8
#> 2 10000+ 65 65 70
#> 3 11-50 53.8 53.8 46.2
#> 4 2-10 51.1 46.8 46.8
#> 5 200-1000 64.7 52.9 52.9
#> 6 50-200 42.1 42.1 36.8
#> 7 Just me 48.5 39.4 54.5
Looking at your data, each column has either the string value you're testing for or NA, so you could make it even shorter/tidier just by counting non-NAs in relevant columns:
df |>
group_by(size) |>
summarise(across(
c(
Improving.stability..generic..thermal..pH.,
Improving.activity,
Improving.binding.affinity
),
\(val) 100 * sum(!is.na(val)) / n()
))
If what you're aiming to do is summarise across all columns then the latter method may work best - there are several ways of specifying which columns you want and so you don't necessarily need to type all names and values in. You might also find it clearest to make calculating and formatting all percentages a named function to call:
library(tidyverse)
df <-
read_csv("https://dropbox.com/s/antzwk1jh4ldrhi/EarlyAccess_anon.csv?dl=1",
show_col_types = FALSE)
perc_nonmissing <- function(val) {
sprintf("%.1f%%", 100 * sum(!is.na(val)) / n())
}
df |>
group_by(size) |>
summarise(across(-c(1:2), perc_nonmissing))
#> # A tibble: 7 × 12
#> size Disco…¹ Searc…² Under…³ Impro…⁴ Impro…⁵ Impro…⁶ Impro…⁷ Impro…⁸ Impro…⁹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1000-… 21.7% 17.4% 43.5% 47.8% 39.1% 43.5% 30.4% 39.1% 39.1%
#> 2 10000+ 40.0% 55.0% 55.0% 65.0% 70.0% 65.0% 20.0% 30.0% 40.0%
#> 3 11-50 30.8% 26.9% 42.3% 53.8% 38.5% 53.8% 15.4% 30.8% 38.5%
#> 4 2-10 38.3% 40.4% 48.9% 46.8% 36.2% 51.1% 23.4% 31.9% 42.6%
# etc.
I'm trying to make a dataframe pulled from an excel file more user-friendly by creating a "Type" column.
The data can be found here: https://www.dmo.gov.uk/data/pdfdatareport?reportCode=D1A (direct download excel link here: https://www.dmo.gov.uk/umbraco/surface/DataExport/GetDataExport?reportCode=D1A&exportFormatValue=xls¶meters=%26COBDate%3D11%2F04%2F2011)
As you can probably see, the type of data is all grouped together in column A, like so:
What I'd like to do is is change title "Conventional Gilts" to being "Name", and create a "Type" column that has the different categories pulled from their grouped title. In the linked file, the "Types" would be: "Ultra-Short", "Short", "Medium", "Long", "Index-linked Gilts (3-month Indexation Lag)", "Undated Gilts (non "rump")", and ""Rump" Gilts".
While I feel I would need to do some form of pattern recognition using a package like grepl, I'm not sure how I can achieve this from a 'dynamic' perspective (changing if new categories are created).
Any advice on how to achieve this (or even achieve this in a function) would be greatly appreciated.
I don't know about a single function to do all this; the data is haphazardly arranged and needs to be fixed "manually", for example:
library(readxl)
library(tidyverse)
gilts <- read_xls("C:/Users/Administrator/Documents/gilts.xls")
gilts %>%
filter(!apply(gilts, 1, function(x) all(is.na(x)))) %>%
filter(seq(nrow(.)) < 44) %>%
select(1:7) %>%
filter(seq(nrow(.)) != 1) %>%
setNames(unlist(slice(., 1))) %>%
filter(seq(nrow(.)) != 1) %>%
mutate(splitter = cumsum(is.na(`ISIN Code`))) %>%
group_by(splitter) %>%
mutate(Type = first(`Conventional Gilts`)) %>%
summarize(across(everything(), ~.x[-1])) %>%
ungroup() %>%
select(-1) %>%
select(c(8, 1:7)) %>%
rename(Name = `Conventional Gilts`) %>%
mutate(across(c(4, 5, 7),
~ as.Date(as.numeric(.x), origin = "1899-12-30"))) %>%
mutate(across(contains("million"), as.numeric))
#> `summarise()` has grouped output by 'splitter'. You can override using the
#> `.groups` argument.
#> # A tibble: 37 x 8
#> Type Name ISIN ~1 Redempti~2 First Is~3 Divid~4 Current/~5 Total~6
#> <chr> <chr> <chr> <date> <date> <chr> <date> <dbl>
#> 1 Ultra-Short 9% Conv~ GB0002~ 2011-07-12 1987-07-12 12 Jan~ 2011-07-01 7312.
#> 2 Ultra-Short 3¼% Tre~ GB00B3~ 2011-12-07 2008-11-14 7 Jun/~ 2011-05-26 15747
#> 3 Ultra-Short 5% Trea~ GB0030~ 2012-03-07 2001-05-25 7 Mar/~ 2011-08-26 26867.
#> 4 Ultra-Short 5¼% Tre~ GB00B1~ 2012-06-07 2007-03-16 7 Jun/~ 2011-05-26 25612.
#> 5 Ultra-Short 4½% Tre~ GB00B2~ 2013-03-07 2008-03-05 7 Mar/~ 2011-08-26 33787.
#> 6 Ultra-Short 8% Trea~ GB0008~ 2013-09-27 1993-04-01 27 Mar~ 2011-09-16 8378.
#> 7 Ultra-Short 2¼% Tre~ GB00B3~ 2014-03-07 2009-03-20 7 Mar/~ 2011-08-26 29123.
#> 8 Short 5% Trea~ GB0031~ 2014-09-07 2002-07-25 7 Mar/~ 2011-08-26 36579.
#> 9 Short 2¾% Tre~ GB00B4~ 2015-01-22 2009-11-04 22 Jan~ 2011-07-13 28181.
#> 10 Short 4¾% Tre~ GB0033~ 2015-09-07 2003-09-26 7 Mar/~ 2011-08-26 33650.
#> # ... with 27 more rows, and abbreviated variable names 1: `ISIN Code`,
#> # 2: `Redemption Date`, 3: `First Issue Date`, 4: `Dividend Dates`,
#> # 5: `Current/Next \nEx-dividend Date`,
#> # 6: `Total Amount in Issue \n(£ million nominal)`
Created on 2022-10-30 with reprex v2.0.2
Different approach, premised on the fact that all the gilts start with numbers and the types do not. Makes use of janitor which has super helpful functions for cleaning up messy imported data like this.
library(tidyverse)
library(readxl)
library(janitor)
import_gilts <- read_excel("20221031 - Gilts in Issue.xls.xls", skip = 7)
gilts <- import_gilts %>%
filter(!str_detect(1, "^Note|^Page")) %>%
rename(Name = `Conventional Gilts`) %>%
remove_empty(which = "rows") %>%
mutate(Type = case_when(str_detect(Name, "^[^0-9]") ~ Name,
TRUE ~ NA_character_),
.before = Name) %>%
fill(Type, .direction = "down") %>%
arrange(desc(...9)) %>%
row_to_names(row_number = 2) %>%
rename(Type = 1,
Name = 2) %>%
filter(Type != Name)
Quick draft so there's certainly room for improvement.
Should be able to be turned into a function as long as the number of imported columns and number of rows to skip reading in the file stay the same.
I am a beginner with R and have found myself repeatedly running into a problem of this kind. Say I have a dataframe with columns:
company, shares_2010, shares_2011, ... , shares_2020, share_price_2010, ... , share_price_2020
TeslaInc 1000 1200 2000 8 40
.
.
.
I then want to go ahead and calculate the market value in each year. Ordinarily I would do it this way:
dataframe <- dataframe %>%
mutate(value_2010 = shares_2010*share_price_2010,
value_2011 = shares_2011*share_price_2011,
.
:
value_2020 = shares_2020*share_price_2020)
Clearly, all of this is rather cumbersome to type out each time and it cannot be made dynamic with respect to the number of time periods included. Is there any clever way to do these operations in one line instead? I am suspecting something may be possible to do with a combination of starts_with() and some lambda function, but I just haven't been able to figure out how to make the correct things multiply yet. Surely the tidyverse must have a better way to do this?
Any help is much appreciated!
You're right, this is a very common situation in data management.
Let's make a minimal, reproducible example:
dat <- data.frame(
company = c("TeslaInc", "Merta"),
shares_2010 = c(1000L, 1500L),
shares_2011 = c(1200L, 1100L),
shareprice_2010 = 8:7,
shareprice_2011 = c(40L, 12L)
)
dat
#> company shares_2010 shares_2011 shareprice_2010 shareprice_2011
#> 1 TeslaInc 1000 1200 8 40
#> 2 Merta 1500 1100 7 12
This dataset has two issues:
It's in a wide format. This is relatively easy to visualise for humans, but it's not ideal for data analysis. We can fix this with pivot_longer() from tidyr.
Each column actually contains two variables: measure (share or share price) and year. We can fix this with separate() from the same package.
library(tidyr)
dat_reshaped <- dat |>
pivot_longer(shares_2010:shareprice_2011) |>
separate(name, into = c("name", "year")) |>
pivot_wider(everything(), values_from = value, names_from = name)
dat_reshaped
#> # A tibble: 4 × 4
#> company year shares shareprice
#> <chr> <chr> <int> <int>
#> 1 TeslaInc 2010 1000 8
#> 2 TeslaInc 2011 1200 40
#> 3 Merta 2010 1500 7
#> 4 Merta 2011 1100 12
The last pivot_wider() is needed to have shares and shareprice as two separate columns, for ease of further calculations.
We can finally use mutate() to calculate in one go all the new values.
dat_reshaped |>
dplyr::mutate(value = shares * shareprice)
#> # A tibble: 4 × 5
#> company year shares shareprice value
#> <chr> <chr> <int> <int> <int>
#> 1 TeslaInc 2010 1000 8 8000
#> 2 TeslaInc 2011 1200 40 48000
#> 3 Merta 2010 1500 7 10500
#> 4 Merta 2011 1100 12 13200
I recommend you read this chapter of R4DS to better understand these concepts - it's worth the effort!
I think further analysis will be simpler if you reshape your data long.
Here, we can extract the shares, share_price, and year from the header names using pivot_longer. Here, I specify that I want to split the headers into two pieces separated by _, and I want to put the name (aka .value) from the beginning of the header (that is, share or share_price) next to the year that came from the end of the header.
Then the calculation is a simple one-liner.
library(tidyr); library(dplyr)
data.frame(company = "Tesla",
shares_2010 = 5, shares_2011 = 6,
share_price_2010 = 100, share_price_2011 = 110) %>%
pivot_longer(-company,
names_to = c(".value", "year"),
names_pattern = "(.*)_(.*)") %>%
mutate(value = shares * share_price)
# A tibble: 2 × 5
company year shares share_price value
<chr> <chr> <dbl> <dbl> <dbl>
1 Tesla 2010 5 100 500
2 Tesla 2011 6 110 660
I agree with the other posts about pivoting this data into a longer format. Just to add a different approach that works well with this type of example: you can create a list of expressions and then use the splice operator !!! to evaluate these expressions within your context:
library(purrr)
library(dplyr)
library(rlang)
library(glue)
lexprs <- set_names(2010:2011, paste0("value_", 2010:2011)) %>%
map_chr(~ glue("shares_{.x} * share_price_{.x}")) %>%
parse_exprs()
df %>%
mutate(!!! lexprs)
Output
company shares_2010 shares_2011 share_price_2010 share_price_2011 value_2010
1 TeslaInc 1000 1200 8 40 8000
2 Merta 1500 1100 7 12 10500
value_2011
1 48000
2 13200
Data
Thanks to Andrea M
structure(list(company = c("TeslaInc", "Merta"), shares_2010 = c(1000L,
1500L), shares_2011 = c(1200L, 1100L), share_price_2010 = 8:7,
share_price_2011 = c(40L, 12L)), class = "data.frame", row.names = c(NA,
-2L))
How it works
With this usage, the splice operator takes a named list of expressions. The names of the list become the variable names and the expressions are evaluated in the context of your mutate statement.
> lexprs
$value_2010
shares_2010 * share_price_2010
$value_2011
shares_2011 * share_price_2011
To see how this injection will resolve, we can use rlang::qq_show:
> rlang::qq_show(df %>% mutate(!!! lexprs))
df %>% mutate(value_2010 = shares_2010 * share_price_2010, value_2011 = shares_2011 *
share_price_2011)
It is indeed likely you may need to have your data in a long format. But in case you don't, you can do this:
# thanks Andrea M!
df <- data.frame(
company=c("TeslaInc", "Merta"),
shares_2010=c(1000L, 1500L),
shares_2011=c(1200L, 1100L),
share_price_2010=8:7,
share_price_2011=c(40L, 12L)
)
years <- sub('shares_', '', grep('^shares_', names(df), value=T))
for (year in years) {
df[[paste0('value_', year)]] <-
df[[paste0('shares_', year)]] * df[[paste0('share_price_', year)]]
}
If you wanted to avoid the loop (for (...) {...}) you can use this instead:
sp <- df[, paste0('shares_', years)] * df[, paste0('share_price_', years)]
names(sp) <- paste0('value_', years)
df <- cbind(df, sp)
I have a dataframe with variables from COMPUSTAT containing data on various accounting items, including SG&A expenses from different companies.
I want to create a new variable in the dataframe which accumulates the SG&A expenses for each company in chronological order. I use PERMNO codes as the unique ID for each company.
I have tried this code, however it does not seem to work:
crsp.comp2$cxsgaq <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate_at(vars(xsgaq), cumsum(xsgaq))
(xsgag is the COMPUSTAT variable for SG&A expenses)
Thank you very much for your help
Your example code is attempting write the entire dataframe crsp.comp2, into a variable crsp.comp2$cxsgaq.
Usually the vars() function variables needs to be "quoted"; though in your situation, use the standard mutate() function and assign the cxsgaq variable there.
crsp.comp2 <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate(cxsgaq = cumsum(xsgaq))
Reproducible example with iris dataset:
library(tidyverse)
iris %>%
group_by(Species) %>%
arrange(Sepal.Length) %>%
mutate(C.Sepal.Width = cumsum(Sepal.Width))
Building on the answer from #m-viking, if using the WRDS PostgreSQL server, you would simply use window_order (from dplyr) in place of arrange. (I use the Compustat firm identifier gvkey in place of permno so that this code works, but the idea is the same.)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer", sslmode='allow')
fundq <- tbl(pg, sql("SELECT * FROM comp.fundq"))
comp2 <-
fundq %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D")
comp2 <-
comp2 %>%
group_by(gvkey) %>%
dbplyr::window_order(datadate) %>%
mutate(cxsgaq = cumsum(xsgaq))
comp2 %>%
filter(!is.na(xsgaq)) %>%
select(gvkey, datadate, xsgaq, cxsgaq)
#> # Source: lazy query [?? x 4]
#> # Database: postgres [iangow#wrds-pgdata.wharton.upenn.edu:9737/wrds]
#> # Groups: gvkey
#> # Ordered by: datadate
#> gvkey datadate xsgaq cxsgaq
#> <chr> <date> <dbl> <dbl>
#> 1 001000 1966-12-31 0.679 0.679
#> 2 001000 1967-12-31 1.02 1.70
#> 3 001000 1968-12-31 5.86 7.55
#> 4 001000 1969-12-31 7.18 14.7
#> 5 001000 1970-12-31 8.25 23.0
#> 6 001000 1971-12-31 7.96 30.9
#> 7 001000 1972-12-31 7.55 38.5
#> 8 001000 1973-12-31 8.53 47.0
#> 9 001000 1974-12-31 8.86 55.9
#> 10 001000 1975-12-31 9.59 65.5
#> # … with more rows
Created on 2021-04-05 by the reprex package (v1.0.0)
I have written down the following script to get the data in longer format. How i can get the data.frame arrange by variables and not by Date?. That means first i should get the data for Variable A for all the dates followed by Variable X.
library(lubridate)
library(tidyverse)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1979-01-01"), to = as.Date("1979-12-31"), by = "day"),
A = runif(365,1,10), X = runif(365,5,15)) %>%
pivot_longer(-Date, names_to = "Variables", values_to = "Values")
Maybe I not understood wrigth, but you can arrange your data according to the variables column, through the arrange() function.
library(tidyverse)
DF <- DF %>%
arrange(Variables)
Resulting this
# A tibble: 730 x 3
Date Variables Values
<date> <chr> <dbl>
1 1979-01-01 A 3.59
2 1979-01-02 A 8.09
3 1979-01-03 A 4.68
4 1979-01-04 A 8.95
5 1979-01-05 A 9.46
6 1979-01-06 A 1.41
7 1979-01-07 A 5.75
8 1979-01-08 A 9.03
9 1979-01-09 A 5.96
10 1979-01-10 A 5.11
# ... with 720 more rows
In base R, we can use
DF1 <- DF[order(DF$Variables),]
Am I missing something? This is it.
arrange (DF,Variables,Date) %>% select(Variables,everything())