Efficient way to repeat operations with columns with similar name in R - r

I am a beginner with R and have found myself repeatedly running into a problem of this kind. Say I have a dataframe with columns:
company, shares_2010, shares_2011, ... , shares_2020, share_price_2010, ... , share_price_2020
TeslaInc 1000 1200 2000 8 40
.
.
.
I then want to go ahead and calculate the market value in each year. Ordinarily I would do it this way:
dataframe <- dataframe %>%
mutate(value_2010 = shares_2010*share_price_2010,
value_2011 = shares_2011*share_price_2011,
.
:
value_2020 = shares_2020*share_price_2020)
Clearly, all of this is rather cumbersome to type out each time and it cannot be made dynamic with respect to the number of time periods included. Is there any clever way to do these operations in one line instead? I am suspecting something may be possible to do with a combination of starts_with() and some lambda function, but I just haven't been able to figure out how to make the correct things multiply yet. Surely the tidyverse must have a better way to do this?
Any help is much appreciated!

You're right, this is a very common situation in data management.
Let's make a minimal, reproducible example:
dat <- data.frame(
company = c("TeslaInc", "Merta"),
shares_2010 = c(1000L, 1500L),
shares_2011 = c(1200L, 1100L),
shareprice_2010 = 8:7,
shareprice_2011 = c(40L, 12L)
)
dat
#> company shares_2010 shares_2011 shareprice_2010 shareprice_2011
#> 1 TeslaInc 1000 1200 8 40
#> 2 Merta 1500 1100 7 12
This dataset has two issues:
It's in a wide format. This is relatively easy to visualise for humans, but it's not ideal for data analysis. We can fix this with pivot_longer() from tidyr.
Each column actually contains two variables: measure (share or share price) and year. We can fix this with separate() from the same package.
library(tidyr)
dat_reshaped <- dat |>
pivot_longer(shares_2010:shareprice_2011) |>
separate(name, into = c("name", "year")) |>
pivot_wider(everything(), values_from = value, names_from = name)
dat_reshaped
#> # A tibble: 4 × 4
#> company year shares shareprice
#> <chr> <chr> <int> <int>
#> 1 TeslaInc 2010 1000 8
#> 2 TeslaInc 2011 1200 40
#> 3 Merta 2010 1500 7
#> 4 Merta 2011 1100 12
The last pivot_wider() is needed to have shares and shareprice as two separate columns, for ease of further calculations.
We can finally use mutate() to calculate in one go all the new values.
dat_reshaped |>
dplyr::mutate(value = shares * shareprice)
#> # A tibble: 4 × 5
#> company year shares shareprice value
#> <chr> <chr> <int> <int> <int>
#> 1 TeslaInc 2010 1000 8 8000
#> 2 TeslaInc 2011 1200 40 48000
#> 3 Merta 2010 1500 7 10500
#> 4 Merta 2011 1100 12 13200
I recommend you read this chapter of R4DS to better understand these concepts - it's worth the effort!

I think further analysis will be simpler if you reshape your data long.
Here, we can extract the shares, share_price, and year from the header names using pivot_longer. Here, I specify that I want to split the headers into two pieces separated by _, and I want to put the name (aka .value) from the beginning of the header (that is, share or share_price) next to the year that came from the end of the header.
Then the calculation is a simple one-liner.
library(tidyr); library(dplyr)
data.frame(company = "Tesla",
shares_2010 = 5, shares_2011 = 6,
share_price_2010 = 100, share_price_2011 = 110) %>%
pivot_longer(-company,
names_to = c(".value", "year"),
names_pattern = "(.*)_(.*)") %>%
mutate(value = shares * share_price)
# A tibble: 2 × 5
company year shares share_price value
<chr> <chr> <dbl> <dbl> <dbl>
1 Tesla 2010 5 100 500
2 Tesla 2011 6 110 660

I agree with the other posts about pivoting this data into a longer format. Just to add a different approach that works well with this type of example: you can create a list of expressions and then use the splice operator !!! to evaluate these expressions within your context:
library(purrr)
library(dplyr)
library(rlang)
library(glue)
lexprs <- set_names(2010:2011, paste0("value_", 2010:2011)) %>%
map_chr(~ glue("shares_{.x} * share_price_{.x}")) %>%
parse_exprs()
df %>%
mutate(!!! lexprs)
Output
company shares_2010 shares_2011 share_price_2010 share_price_2011 value_2010
1 TeslaInc 1000 1200 8 40 8000
2 Merta 1500 1100 7 12 10500
value_2011
1 48000
2 13200
Data
Thanks to Andrea M
structure(list(company = c("TeslaInc", "Merta"), shares_2010 = c(1000L,
1500L), shares_2011 = c(1200L, 1100L), share_price_2010 = 8:7,
share_price_2011 = c(40L, 12L)), class = "data.frame", row.names = c(NA,
-2L))
How it works
With this usage, the splice operator takes a named list of expressions. The names of the list become the variable names and the expressions are evaluated in the context of your mutate statement.
> lexprs
$value_2010
shares_2010 * share_price_2010
$value_2011
shares_2011 * share_price_2011
To see how this injection will resolve, we can use rlang::qq_show:
> rlang::qq_show(df %>% mutate(!!! lexprs))
df %>% mutate(value_2010 = shares_2010 * share_price_2010, value_2011 = shares_2011 *
share_price_2011)

It is indeed likely you may need to have your data in a long format. But in case you don't, you can do this:
# thanks Andrea M!
df <- data.frame(
company=c("TeslaInc", "Merta"),
shares_2010=c(1000L, 1500L),
shares_2011=c(1200L, 1100L),
share_price_2010=8:7,
share_price_2011=c(40L, 12L)
)
years <- sub('shares_', '', grep('^shares_', names(df), value=T))
for (year in years) {
df[[paste0('value_', year)]] <-
df[[paste0('shares_', year)]] * df[[paste0('share_price_', year)]]
}
If you wanted to avoid the loop (for (...) {...}) you can use this instead:
sp <- df[, paste0('shares_', years)] * df[, paste0('share_price_', years)]
names(sp) <- paste0('value_', years)
df <- cbind(df, sp)

Related

How best to do this pivot operation in R

Below is the sample data and the desired outcome. This is a much simplified version of the actual data set. In the actual data set, there are 20 years and 4 quarters apiece. Looking to have each unique company entry listed once and the employment data series running from beginning to end from left to right. In the event that there is no data for Vision Inc in 2019 quarter 3, then I would want it to return a O and not an NA.
library(tidyverse)
library(dplyr)
legalname <- c("Vision Inc.","Expedia","Strong Enterprise","Vision Inc.","Expedia","Strong Enterprise")
year <- c(2019,2019,2019,2019,2019,2019)
quarter <- c(1,1,1,2,2,2)
cnty <- c(031,029,027,031,029,027)
naics <- c(345110,356110,362110,345110,356110,345110)
mnth1emp <- c (11,13,15,15,17,20)
mnth2emp <- c(12,14,15,16,18,22)
mnth3emp <-c(13,15,15,17,21,29)
employers <- data.frame(legalname,year,quarter,naics,mnth1emp,mnth2emp,mnth3emp)
Desired Outcome
legalname cnty naics 2019m1 2019m2 2019m3 2019m4 2019m5 2019m6
Vision Inc 031 345110 11 12 13 15 16 17
Expedia 029 356110 13 14 15 17 18 21
I first pivot to a long form, then arrange by legalname and year(just to double-check that they are in numerical order). Then, I create a unique month series for each year for each company. Then, I drop quarter and pivot back to wide form and put name and year together, and finally replace NA with 0. Here, I'm assuming that you want each unique naics on it's own row.
library(tidyverse)
employers %>%
pivot_longer(starts_with("mnth")) %>%
arrange(legalname, year) %>%
group_by(legalname, year, naics) %>%
mutate(name = paste0("m", 1:n())) %>%
select(-quarter) %>%
pivot_wider(names_from = c("year", "name"), names_sep = "", values_from = "value") %>%
mutate(across(everything(), ~replace_na(.,0)))
Output
legalname naics `2019m1` `2019m2` `2019m3` `2019m4` `2019m5` `2019m6`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Expedia 356110 13 14 15 17 18 21
2 Strong Enterprise 362110 15 15 15 0 0 0
3 Strong Enterprise 345110 0 0 0 20 22 29
4 Vision Inc. 345110 11 12 13 15 16 17
Does this work for you?
First pivot longer to get the months and values in a quarter; and then pivot wider to get the wide format you want.
employers %>%
filter(legalname != "Strong Enterprise") %>%
pivot_longer(mnth1emp:mnth3emp, names_to = "mnth", values_to = "value") %>%
mutate(month_in_quarter = as.numeric(str_extract(mnth, "\\d")),
month =str_c("m", month_in_quarter + 3*(quarter - 1))) %>%
select(-c(month_in_quarter, mnth)) %>%
pivot_wider(c(legalname,cnty, naics), names_from = c(year, month),
values_from = value,
values_fill = 0)
values_fill will fill NAs with 0s.
perhaps try this.
I found a way to get the pivot right in R. I used the library("pivottabler") with the data.frame "bhmtrains". This worked now.
library(pivottabler)
qhpvt(bhmtrains, c("=","TOC"), "TrainCategory",
c("Mean Speed"="mean(SchedSpeedMPH, na.rm=TRUE)", "Std Dev
Speed"="sd(SchedSpeedMPH, na.rm=TRUE)"),
formats=list("%.0f", "%.1f"), totals=list("", "TrainCategory"="All",
"Categories"))
my results out of the code

Pivot_longer and Pivot wider syntax

I want to ask for ideas on creating a syntax to pivot_longer given on this.
I've already tried researching in the internet but I can't seem to find any examples that is similar to my data given where it has a Metric column which is also seperated in 3 different columns of months.
My desire final output is to have seven columns consisting of (regions,months, and the five Metrics)
How to formulate the pivot_longer and pivot_wider syntax to clean my data in order for me to visualize it?
The tricky part isn't pivot_longer. You first have to clean your Excel spreadsheet, i.e. get rid of empty rows and merge the two header rows containing the names of the variables and the dates.
One approach to achieve your desired result may look like so:
library(readxl)
library(tidyr)
library(janitor)
library(dplyr)
x <- read_excel("data/Employment.xlsx", skip = 3, col_names = FALSE) %>%
# Get rid of empty rows and cols
janitor::remove_empty()
# Make column names
col_names <- data.frame(t(x[1:2,])) %>%
fill(1) %>%
unite(name, 1:2, na.rm = TRUE) %>%
pull(name)
x <- x[-c(1:2),]
names(x) <- col_names
# Convert to long and values to numerics
x %>%
pivot_longer(-Region, names_to = c(".value", "months"), names_sep = "_") %>%
separate(months, into = c("month", "year")) %>%
mutate(across(!c(Region, month, year), as.numeric))
#> # A tibble: 6 × 8
#> Region month year `Total Population … `Labor Force Part… `Employment Rat…
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Philippin… April 2020f 73722. 55.7 82.4
#> 2 Philippin… Janu… 2021p 74733. 60.5 91.3
#> 3 Philippin… April 2021p 74971. 63.2 91.3
#> 4 National … April 2020f 9944. 54.2 87.7
#> 5 National … Janu… 2021p 10051. 57.2 91.2
#> 6 National … April 2021p 10084. 60.1 85.6
#> # … with 2 more variables: Unemployment Rate <dbl>, Underemployment Rate <dbl>

Sum table values ​per day

I have a table as shown in the image, where each comment has a publication date, with year, month, day and time, I would like to add the sentiment values ​​by day.
this is how the table is composed
serie <- data.frame(comments$created_time,sentiment2$positive-sentiment2$negative)
Using dplyr you can do:
library(dplyr)
df %>%
group_by(as.Date(comments.created_time)) %>%
summarize(total = sum(sentiment))
Here is some sample data that will help others to troubleshoot and understand the data:
df <- tibble(comments.created_time = c("2015-01-26 22:43:00",
"2015-01-26 22:44:00",
"2015-01-27 22:43:00",
"2015-01-27 22:44:00",
"2015-01-28 22:43:00",
"2015-01-28 22:44:00"),
sentiment = c(1,3,5,1,9,1))
Using the sample data will yield:
# A tibble: 3 × 2
`as.Date(comments.created_time)` total
<date> <dbl>
1 2015-01-26 4
2 2015-01-27 6
3 2015-01-28 10

What is an efficient programming way to find the closest time of a dataset to a reference (larger) dataset

I am searching for an efficient way to find the closest times of a small dataset (x) in comparison to a large dataset (a). The result has to be an index of the length of (a). I have already created a function which works very nicely, however, it is absolutely useless for large data as it takes days to process.
Here is my function: function(x, a, which = TRUE,na.rm=FALSE){
if("POSIXt" %in% class(x)) x <- as.numeric(x)
if("POSIXt" %in% class(a)) a <- as.numeric(a)
sapply(a, function(y) DescTools::Closest(x, y, which = TRUE,na.rm=FALSE)[1])
}
both datasets x and a are filtered and therefore have no consistent time stemp but they are filtered after the same requirements.
vector a contains 20 Hz data with a length of 16020209 and x contains 30 sec data with a length of 26908.
Any suggestions are very much appreciated! Thank you :)
One can use a rolling join from data.table:
library(data.table)
set.seed(1) # reproduciblity on Stackoverflow
DF_A <- data.table(x = seq(-500, by = 0.5, length.out = 26908),
idx = seq_len(26908))
DF_HZ <- data.table(x = round(runif(16020209, first(DF_A$x), last(DF_A$x)), 3),
idx_hz = seq_len(16020209))
DF_HZ[, x_hz := x + 0] # so we can check
DF_A[, x_a := x + 0] # so we can check
setkey(DF_A, x)
setkey(DF_HZ, x)
# The order(idx_hz) returns the result in the same order as
# DF_HZ but it is not necessary to match joins.
DF_A[DF_HZ, roll = "nearest"][order(idx_hz)]
#> x idx x_a idx_hz x_hz
#> 1: 3072.021 7145 3072.0 1 3072.021
#> 2: 4506.369 10014 4506.5 2 4506.369
#> 3: 7206.883 15415 7207.0 3 7206.883
#> 4: 11718.574 24438 11718.5 4 11718.574
#> 5: 2213.328 5428 2213.5 5 2213.328
#> ---
#> 16020205: 10517.477 22036 10517.5 16020205 10517.477
#> 16020206: 11407.776 23817 11408.0 16020206 11407.776
#> 16020207: 12051.919 25105 12052.0 16020207 12051.919
#> 16020208: 3482.463 7966 3482.5 16020208 3482.463
#> 16020209: 817.366 2636 817.5 16020209 817.366
Created on 2020-11-11 by the reprex package (v0.3.0)
On my machine, the above (not including the creation of the dummy data) takes about 3 s.
I would use something like a SQL full join for this task since the second df is small - though it depends on your data size and ram. Here is a simple example with test data:
library(dplyr)
# demo tibbles
tab1 <- tibble::tribble(
~time_1, ~VALUE_1,
"2020-11-01", 268L,
"2020-11-02", 479L,
"2020-11-03", 345L,
"2020-11-04", 567L,
"2020-11-05", 567L) %>%
dplyr::mutate(time_1 = as.Date(time_1))
tab2 <- tibble::tribble(
~time_2, ~VALUE_2,
"2020-11-01", 268L,
"2020-11-02", 479L) %>%
dplyr::mutate(time_2 = as.Date(time_2))
# calculations
tab1 %>%
dplyr::mutate(ID = dplyr::row_number()) %>% # Build ID from row number
dplyr::full_join(tab2, by = character()) %>%
dplyr::mutate(DIF = abs(time_1 - time_2)) %>%
dplyr::group_by(ID) %>%
dplyr::slice_min(order_by = DIF, n = 1)
time_1 VALUE_1 ID time_2 VALUE_2 DIF
<date> <int> <int> <date> <int> <drtn>
1 2020-11-01 268 1 2020-11-01 268 0 days
2 2020-11-02 479 2 2020-11-02 479 0 days
3 2020-11-03 345 3 2020-11-02 479 1 days
4 2020-11-04 567 4 2020-11-02 479 2 days
5 2020-11-05 567 5 2020-11-02 479 3 days
If size turns out to be a problem you yould split the large data.frame in smaller once and the run it with a loop. In this case parallel processing would be a great option since by splitting the large DF calculations can be run independently.

Convert data.frame wide to long while concatenating date formats

In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.

Resources