Not being able to execute stl decomposition properly - r

from this ts:
australia_data <- tourism %>%
select(Quarter, Trips) %>%
summarise(TotalTrips = sum(Trips))
> head(australia_data)
# A tsibble: 6 x 4 [1D]
# Key: Region, Purpose [1]
# Groups: Region [1]
Region Purpose Quarter TotalTrips
<chr> <chr> <date> <dbl>
1 Adelaide Business 1998-01-01 135.
2 Adelaide Business 1998-04-01 110.
3 Adelaide Business 1998-07-01 166.
4 Adelaide Business 1998-10-01 127.
5 Adelaide Business 1999-01-01 137.
6 Adelaide Business 1999-04-01 200.
I want to do a STL decomposition, in order to get seasonally adjusted data :
australia_data_dcmp <- australia_data %>%
model(STL(TotalTrips))
but I'm not being able to get components
> components(australia_data_dcmp)
Error: Problem with `mutate()` column `cmp`.
i `cmp = map(.fit, components)`.
x no applicable method for 'components' applied to an object of class "null_mdl"
> head(augment(australia_data_dcmp))
# A tsibble: 6 x 8 [1D]
# Key: Region, Purpose, .model [1]
Region Purpose .model Quarter TotalTrips .fitted .resid .innov
<chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 Adelaide Business STL(TotalTrips) 1998-01-01 135. NA NA NA
2 Adelaide Business STL(TotalTrips) 1998-04-01 110. NA NA NA
3 Adelaide Business STL(TotalTrips) 1998-07-01 166. NA NA NA
4 Adelaide Business STL(TotalTrips) 1998-10-01 127. NA NA NA
5 Adelaide Business STL(TotalTrips) 1999-01-01 137. NA NA NA
6 Adelaide Business STL(TotalTrips) 1999-04-01 200. NA NA NA
can someone explain me the mistake I'm commiting please ?
Best regards

The tourism object you show is not what you get when using the latest versions of the various packages loaded by fpp3. This is what I get.
library(fpp3)
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✓ tibble 3.1.5 ✓ tsibble 1.1.0
#> ✓ dplyr 1.0.7 ✓ tsibbledata 0.3.0
#> ✓ tidyr 1.1.4 ✓ feasts 0.2.2
#> ✓ lubridate 1.8.0 ✓ fable 0.3.1
#> ✓ ggplot2 3.3.5
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> x lubridate::date() masks base::date()
#> x dplyr::filter() masks stats::filter()
#> x tsibble::intersect() masks base::intersect()
#> x tsibble::interval() masks lubridate::interval()
#> x dplyr::lag() masks stats::lag()
#> x tsibble::setdiff() masks base::setdiff()
#> x tsibble::union() masks base::union()
australia_data <- tourism %>%
select(Quarter, Trips) %>%
summarise(TotalTrips = sum(Trips))
australia_data
#> # A tsibble: 80 x 2 [1Q]
#> Quarter TotalTrips
#> <qtr> <dbl>
#> 1 1998 Q1 23182.
#> 2 1998 Q2 20323.
#> 3 1998 Q3 19827.
#> 4 1998 Q4 20830.
#> 5 1999 Q1 22087.
#> 6 1999 Q2 21458.
#> 7 1999 Q3 19914.
#> 8 1999 Q4 20028.
#> 9 2000 Q1 22339.
#> 10 2000 Q2 19941.
#> # … with 70 more rows
Created on 2021-11-01 by the reprex package (v2.0.1)
Perhaps you are over-writing the tourism object with a grouped version. Or perhaps you are using an old version of the tsibble package where the keys were not dropped using summarise().
In any case, without a reproducible example it is hard to provide more substantial help.

Related

How to forecast with lagged external regressors using fable::VAR

I'd like to use lagged external regressors in my VAR forecast. Using the VAR() function from the fable package, I am able to fit a model, but I can't use it to forecast, as I return NAs for the dependent variables. My reprex follows examples from Forecasting: Principles and Practice v3.
Thanks in advance for any guidance.
require(fpp3)
#> Loading required package: fpp3
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✔ tibble 3.1.7 ✔ tsibble 1.0.1
#> ✔ dplyr 1.0.9 ✔ tsibbledata 0.3.0
#> ✔ tidyr 1.1.3 ✔ feasts 0.2.2
#> ✔ lubridate 1.7.10 ✔ fable 0.3.1
#> ✔ ggplot2 3.3.5
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> ✖ lubridate::date() masks base::date()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tsibble::intersect() masks base::intersect()
#> ✖ tsibble::interval() masks lubridate::interval()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ tsibble::setdiff() masks base::setdiff()
#> ✖ tsibble::union() masks base::union()
us_change <- fpp3::us_change
fit <- us_change %>%
model(
xregs_lag1 = VAR(vars(Consumption, Income) ~ xreg(Unemployment, lag(Unemployment, 1)))
)
fit
#> # A mable: 1 x 1
#> xregs_lag1
#> <model>
#> 1 <VAR(5) w/ mean>
new_data_ex <- new_data(us_change, 4) %>%
mutate(Unemployment = mean(us_change$Unemployment))
#############
# Here I tried creating a new_data frame that included one lag of Unemployment, and pass that to the new_data argument of forecast, but it doesn't work either
#
# new_data_ex_lags <- us_change %>%
# tail(1) %>%
# bind_rows(new_data_ex) %>%
# select(colnames(new_data_ex))
#############
fit %>%
select(xregs_lag1) %>%
forecast(new_data = new_data_ex)
#> # A fable: 4 x 6 [1Q]
#> # Key: .model [1]
#> .model Quarter .distribution .mean_Consumption .mean_Income Unemployment
#> <chr> <qtr> <dist> <dbl> <dbl> <dbl>
#> 1 xregs_lag1 2019 Q3 MVN[2] NA NA 0.00101
#> 2 xregs_lag1 2019 Q4 MVN[2] NA NA 0.00101
#> 3 xregs_lag1 2020 Q1 MVN[2] NA NA 0.00101
#> 4 xregs_lag1 2020 Q2 MVN[2] NA NA 0.00101
fit %>%
select(xregs_lag1) %>%
report()
#> Series: Consumption, Income
#> Model: VAR(5) w/ mean
#>
#> Coefficients for Consumption:
#> lag(Consumption,1) lag(Income,1) lag(Consumption,2) lag(Income,2)
#> 0.1156 0.1062 0.1479 0.0079
#> s.e. 0.0772 0.0483 0.0753 0.0509
#> lag(Consumption,3) lag(Income,3) lag(Consumption,4) lag(Income,4)
#> 0.2248 -0.0207 -0.0729 -0.0544
#> s.e. 0.0730 0.0499 0.0746 0.0500
#> lag(Consumption,5) lag(Income,5) constant Unemployment
#> -0.0217 0.0327 0.3923 -0.8602
#> s.e. 0.0708 0.0491 0.0923 0.1331
#> lag(Unemployment, 1)
#> 0.4563
#> s.e. 0.1402
#>
#> Coefficients for Income:
#> lag(Consumption,1) lag(Income,1) lag(Consumption,2) lag(Income,2)
#> 0.3715 -0.2991 0.0836 -0.0410
#> s.e. 0.1212 0.0758 0.1182 0.0799
#> lag(Consumption,3) lag(Income,3) lag(Consumption,4) lag(Income,4)
#> 0.4531 -0.1445 0.2481 -0.2475
#> s.e. 0.1145 0.0783 0.1170 0.0785
#> lag(Consumption,5) lag(Income,5) constant Unemployment
#> -0.1270 -0.1878 0.6142 -0.1100
#> s.e. 0.1111 0.0771 0.1449 0.2089
#> lag(Unemployment, 1)
#> -0.0401
#> s.e. 0.2201
#>
#> Residual covariance matrix:
#> Consumption Income
#> Consumption 0.2602 0.1341
#> Income 0.1341 0.6410
#>
#> log likelihood = -350.43
#> AIC = 760.86 AICc = 772.34 BIC = 858.74
Created on 2022-07-22 by the reprex package (v2.0.0)
Using lag() with VAR() models was not fully implemented, but I have added support for this in the development version of the fable package (https://github.com/tidyverts/fable/commit/bb15c9462b80850565aee13d8f9b33e49dfd0f33).
There are some other changes not yet pushed to CRAN such as how forecast means are represented in the fable, but the code is otherwise the same.
require(fpp3)
#> Loading required package: fpp3
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✔ tibble 3.1.7 ✔ tsibble 1.1.1
#> ✔ dplyr 1.0.9 ✔ tsibbledata 0.4.0
#> ✔ tidyr 1.2.0 ✔ feasts 0.2.2
#> ✔ lubridate 1.8.0 ✔ fable 0.3.1.9000
#> ✔ ggplot2 3.3.6
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> ✖ lubridate::date() masks base::date()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ tsibble::intersect() masks base::intersect()
#> ✖ tsibble::interval() masks lubridate::interval()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ tsibble::setdiff() masks base::setdiff()
#> ✖ tsibble::union() masks base::union()
us_change <- fpp3::us_change
fit <- us_change %>%
model(
xregs_lag1 = VAR(vars(Consumption, Income) ~ xreg(Unemployment, lag(Unemployment, 1)))
)
fit
#> # A mable: 1 x 1
#> xregs_lag1
#> <model>
#> 1 <VAR(5) w/ mean>
new_data_ex <- new_data(us_change, 4) %>%
mutate(Unemployment = mean(us_change$Unemployment))
#############
# Here I tried creating a new_data frame that included one lag of Unemployment, and pass that to the new_data argument of forecast, but it doesn't work either
#
# new_data_ex_lags <- us_change %>%
# tail(1) %>%
# bind_rows(new_data_ex) %>%
# select(colnames(new_data_ex))
#############
fit %>%
select(xregs_lag1) %>%
forecast(new_data = new_data_ex)
#> Warning in if (is_transformed) {: the condition has length > 1 and only the
#> first element will be used
#> # A fable: 4 x 5 [1Q]
#> # Key: .model [1]
#> .model Quarter .distribution .mean[,"Consumption… [,"Income"] Unemployment
#> <chr> <qtr> <dist> <dbl> <dbl> <dbl>
#> 1 xregs_lag1 2019 Q3 MVN[2] 0.548 0.657 0.00101
#> 2 xregs_lag1 2019 Q4 MVN[2] 0.679 0.316 0.00101
#> 3 xregs_lag1 2020 Q1 MVN[2] 0.763 0.832 0.00101
#> 4 xregs_lag1 2020 Q2 MVN[2] 0.697 0.733 0.00101
fit %>%
select(xregs_lag1) %>%
report()
#> Series: Consumption, Income
#> Model: VAR(5) w/ mean
#>
#> Coefficients for Consumption:
#> lag(Consumption,1) lag(Income,1) lag(Consumption,2) lag(Income,2)
#> 0.1156 0.1062 0.1479 0.0079
#> s.e. 0.0772 0.0483 0.0753 0.0509
#> lag(Consumption,3) lag(Income,3) lag(Consumption,4) lag(Income,4)
#> 0.2248 -0.0207 -0.0729 -0.0544
#> s.e. 0.0730 0.0499 0.0746 0.0500
#> lag(Consumption,5) lag(Income,5) constant Unemployment
#> -0.0217 0.0327 0.3923 -0.8602
#> s.e. 0.0708 0.0491 0.0923 0.1331
#> lag(Unemployment, 1)
#> 0.4563
#> s.e. 0.1402
#>
#> Coefficients for Income:
#> lag(Consumption,1) lag(Income,1) lag(Consumption,2) lag(Income,2)
#> 0.3715 -0.2991 0.0836 -0.0410
#> s.e. 0.1212 0.0758 0.1182 0.0799
#> lag(Consumption,3) lag(Income,3) lag(Consumption,4) lag(Income,4)
#> 0.4531 -0.1445 0.2481 -0.2475
#> s.e. 0.1145 0.0783 0.1170 0.0785
#> lag(Consumption,5) lag(Income,5) constant Unemployment
#> -0.1270 -0.1878 0.6142 -0.1100
#> s.e. 0.1111 0.0771 0.1449 0.2089
#> lag(Unemployment, 1)
#> -0.0401
#> s.e. 0.2201
#>
#> Residual covariance matrix:
#> Consumption Income
#> Consumption 0.2602 0.1341
#> Income 0.1341 0.6410
#>
#> log likelihood = -350.43
#> AIC = 760.86 AICc = 772.34 BIC = 858.74
Created on 2022-07-23 by the reprex package (v2.0.1)

read file from google drive

I have spreadsheet uploaded as csv file in google drive unlocked so users can read from it.
This is the link to the csv file:
https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/edit?usp=sharing
I am trying to read it from R but I am getting a long list of error messages. I am using:
id = "170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk"
read.csv(sprint("https://docs.google.com/spreadsheets/d/uc?id=%s&export=download",id))
Could someone suggest how to read files from google drive directly into R?
I would try to publish the sheet as a CSV file (doc), and then read it from there.
It seems like your file is already published as a CSV. So, this should work. (Note that the URL ends with /pub?output=csv)
read.csv("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
To read the CSV file faster you can use vroom which is even faster than fread(). See here.
Now using vroom,
library(vroom)
vroom("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
#> Rows: 387048 Columns: 14
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): StationCode, SampleID, WeatherCode, OrganismCode, race, race2
#> dbl (7): WaterTemperature, Turbidity, Velocity, ForkLength, Weight, Count, ...
#> date (1): SampleDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 387,048 × 14
#> StationCode SampleDate SampleID WeatherCode WaterTemperature Turbidity
#> <chr> <date> <chr> <chr> <dbl> <dbl>
#> 1 Gate 11 2000-04-25 116_00 CLD 13.1 2
#> 2 Gate 5 1995-04-26 117_95 CLR NA 2
#> 3 Gate 2 1995-04-21 111_95 W 10.4 12
#> 4 Gate 6 2008-12-13 348_08 CLR 49.9 1.82
#> 5 Gate 5 1999-12-10 344_99 CLR 7.30 1.5
#> 6 Gate 6 2012-05-25 146_12 CLR 55.5 1.60
#> 7 Gate 10 2011-06-28 179_11 RAN 57.3 3.99
#> 8 Gate 11 1996-04-25 116_96 CLR 13.8 21
#> 9 Gate 9 2007-07-02 183_07 CLR 56.6 2.09
#> 10 Gate 6 2009-06-04 155_09 CLR 58.6 3.08
#> # … with 387,038 more rows, and 8 more variables: Velocity <dbl>,
#> # OrganismCode <chr>, ForkLength <dbl>, Weight <dbl>, Count <dbl>,
#> # race <chr>, year <dbl>, race2 <chr>
Created on 2022-07-08 by the reprex package (v2.0.1)

How to merge two columns with the same names from two different data frames and compare and print the ones that are similar

I currently have this code:
install.packages(c("httr", "jsonlite", "tidyverse"))
library(httr)
library(jsonlite)
library(tidyverse)
res1<-GET("https://rss.applemarketingtools.com/api/v2/us/music/most-played/100/songs.json")
res1
rawToChar(res1$content)
data1 = fromJSON(rawToChar(res1$content))
us100<-data1$feed$results
res2 <- GET("https://rss.applemarketingtools.com/api/v2/gb/music/most-played/100/songs.json")
data2<-fromJSON(rawToChar(res2$content))
uk100<-data2$feed$results
I want to compare the two data frames and make a new one printing the results of the artist names and name of the songs that both data frames have in common, how do I do this?
I think you're just looking for an inner_join
us100 %>% inner_join(uk100, by = "id") %>% as_tibble()
#> # A tibble: 16 x 21
#> artistName.x id name.x releaseDate.x kind.x artistId.x artistUrl.x
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Jack Harlow 1618~ First~ 2022-04-08 songs 1047679432 https://mu~
#> 2 Harry Styles 1615~ As It~ 2022-03-31 songs 471260289 https://mu~
#> 3 Lil Baby 1618~ In A ~ 2022-04-08 songs 1276656483 https://mu~
#> 4 Lauren Spencer-Smith 1618~ Flowe~ 2022-04-14 songs 1462708784 https://mu~
#> 5 Glass Animals 1508~ Heat ~ 2020-06-29 songs 528928008 https://mu~
#> 6 Carolina Gaitán - ~ 1594~ We Do~ 2021-11-19 songs 1227636438 https://mu~
#> 7 The Kid LAROI & Jus~ 1574~ STAY 2021-07-09 songs 1435848034 https://mu~
#> 8 Frank Ocean 1440~ Lost 2012-07-10 songs 442122051 https://mu~
#> 9 Elton John & Dua Li~ 1578~ Cold ~ 2021-08-13 songs 54657 https://mu~
#> 10 Tate McRae 1606~ she's~ 2022-02-04 songs 1446365464 https://mu~
#> 11 Adele 1590~ Easy ~ 2021-10-14 songs 262836961 https://mu~
#> 12 Lil Tjay 1613~ In My~ 2022-04-01 songs 1436446949 https://mu~
#> 13 Lizzo 1619~ About~ 2022-04-14 songs 472949623 https://mu~
#> 14 Tiësto & Ava Max 1590~ The M~ 2021-11-04 songs 4091218 https://mu~
#> 15 Ed Sheeran 1581~ Shive~ 2021-09-09 songs 183313439 https://mu~
#> 16 The Weeknd 1488~ Blind~ 2019-11-29 songs 479756766 https://mu~
#> # ... with 14 more variables: contentAdvisoryRating.x <chr>,
#> # artworkUrl100.x <chr>, genres.x <list>, url.x <chr>, artistName.y <chr>,
#> # name.y <chr>, releaseDate.y <chr>, kind.y <chr>, artistId.y <chr>,
#> # artistUrl.y <chr>, contentAdvisoryRating.y <chr>, artworkUrl100.y <chr>,
#> # genres.y <list>, url.y <chr>
``z

split char currency into two separate columns in data frame

I have this data frame df
Items Item Code Prices
1 Beds 1630 $135.60
2 Big Shelve 1229 89.5USD
3 Small Shelve 1229 ¥3680.03
4 Chair 445 92.63€
5 Desk 802 206.43 euro
6 Lamp 832 25307.1 JPY
I want to split the prices column into three column: Prices and Currency and Exchange rate from USD using
Items Item Code Prices Currency Exchange rates
1 Beds 1630 135.60 USD 1.00
2 Big Shelve 1229 89.50 USD 1.00
3 Small Shelve 1229 3680.03 JPY 115.71
4 Chair 445 92.63 EUR 0.90
5 Desk 802 206.43 EUR 0.90
6 Lamp 832 25307.10 JPY 115.71
I tried using dplyr::separate() but instead it would separate at comma instead.
If I try using the gsub() it gives me this error
> df2 <- df %>%
+ mutate(price = as.numeric(gsub'[$,€,¥,]','', df$Col3))
Error: unexpected string constant in:
"df2 <- df %>%
mutate(price = as.numeric(gsub'[$,€,¥,]'"
Any ideas what to do? Also, how would I able to reference the currency to correct items?
This should solve the problem. Using the quantmod package, you can get the current exchange rate and add that into the data:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stringr)
library(tidyr)
library(quantmod)
#> Loading required package: xts
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#>
#> Attaching package: 'xts'
#> The following objects are masked from 'package:dplyr':
#>
#> first, last
#> Loading required package: TTR
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
dat <- tibble::tribble(
~Items, ~"Item Code", ~Prices,
"Beds", 1630, "$135.60",
"Big Shelve", 1229, "89.5USD",
"Small Shelve", 1229, "¥3680.03",
"Chair", 445, "92.63€",
"Desk", 802, "206.43 euro",
"Lamp", 832, "25307.1 JPY")
dat <- dat %>%
mutate(currency = c(trimws(str_extract_all(Prices, "[^\\d\\.]+", simplify = TRUE))),
currency = case_when(currency %in% c("€", "euro") ~ "EUR",
currency == "$" ~ "USD",
currency == "¥" ~ "JPY",
TRUE ~ currency),
Prices = as.numeric(str_extract_all(Prices, "\\d+\\.\\d+", simplify=TRUE)),
xr = paste0("USD", currency, "=X")) %>%
left_join(getQuote(unique(.$xr)) %>% as_tibble(rownames = "xr") %>% select(xr, Last)) %>%
select(-xr) %>%
rename("Exchange rates" = "Last")
#> Joining, by = "xr"
dat
#> # A tibble: 6 × 5
#> Items `Item Code` Prices currency `Exchange rates`
#> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 Beds 1630 136. USD 1
#> 2 Big Shelve 1229 89.5 USD 1
#> 3 Small Shelve 1229 3680. JPY 116.
#> 4 Chair 445 92.6 EUR 0.902
#> 5 Desk 802 206. EUR 0.902
#> 6 Lamp 832 25307. JPY 116.
Created on 2022-03-03 by the reprex package (v2.0.1)

step_pca() arguments are not being applied

I'm new to tidymodels but apparently the step_pca() arguments such as nom_comp or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite setting nom_comp = 2.
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)
prep(rec) %>% tidy(number = 2, type = "coef") %>%
pivot_wider(names_from = component, values_from = value, id_cols = terms)
#> # A tibble: 4 x 5
#> terms PC1 PC2 PC3 PC4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Murder -0.536 0.418 -0.341 0.649
#> 2 Assault -0.583 0.188 -0.268 -0.743
#> 3 UrbanPop -0.278 -0.873 -0.378 0.134
#> 4 Rape -0.543 -0.167 0.818 0.0890
The full PCA is determined (so you can still compute the variances of each term) and num_comp only specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))
prep(rec) %>% tidy(number = 2, type = "coef")
#> # A tibble: 8 × 4
#> terms value component id
#> <chr> <dbl> <chr> <chr>
#> 1 Murder -0.536 PC1 pca_AoFOm
#> 2 Assault -0.583 PC1 pca_AoFOm
#> 3 UrbanPop -0.278 PC1 pca_AoFOm
#> 4 Rape -0.543 PC1 pca_AoFOm
#> 5 Murder 0.418 PC2 pca_AoFOm
#> 6 Assault 0.188 PC2 pca_AoFOm
#> 7 UrbanPop -0.873 PC2 pca_AoFOm
#> 8 Rape -0.167 PC2 pca_AoFOm
Created on 2022-01-12 by the reprex package (v2.0.1)
You could also control this via the tol argument from stats::prcomp(), also passed in as an option.
If you bake the recipe it seems to work as intended but I don't know what you aim to achieve afterward.
library(tidyverse)
library(tidymodels)
USArrests <- USArrests %>%
rownames_to_column("Countries")
rec <-
recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)
prep(rec) %>%
bake(new_data = NULL)
#> # A tibble: 50 x 3
#> Countries PC1 PC2
#> <fct> <dbl> <dbl>
#> 1 Alabama -0.976 1.12
#> 2 Alaska -1.93 1.06
#> 3 Arizona -1.75 -0.738
#> 4 Arkansas 0.140 1.11
#> 5 California -2.50 -1.53
#> 6 Colorado -1.50 -0.978
#> 7 Connecticut 1.34 -1.08
#> 8 Delaware -0.0472 -0.322
#> 9 Florida -2.98 0.0388
#> 10 Georgia -1.62 1.27
#> # ... with 40 more rows
Created on 2022-01-11 by the reprex package (v2.0.1)

Resources