Dividing each row of a dataframe by the sum of squared variables - r

Is there a clean way to divide each row of a dataframe by the sum of squared variables, in a time series database ---- where the events are the rows and the columns are the variables ---. At the moment, my method is the following
for(i in 1:nrow(base)){
base[i,] <- base[i,]/(as.numeric(t(c(base[i,]))) %*% as.numeric(t(c(base[i,]))))
}
In order to use the %*% operator the only way I found is the one shown above which uses the as.numeric %>% t %>% c mechanism, which it doesn't seems clean.
For the sake of a reproducible example:
base <-structure(list(Var1 = c(920, 734, 1001, 1033, 752, 837, 734,
817), Var2 = c(4861, 4966, 4855, 3835, 4782, 5348, 4648, 4595
), Var3 = c(5011, 4618, 2718, 4344, 4872, 5076, 4678, 4563),
Var4 = c(4785, 4610, 4697, 4149, 4693, 4866, 4517, 3271),
Var5 = c(5101, 4220, 4444, 4301, 965, 4557, 3524, 4201),
Var6 = c(5059, 4048, 4217, 4397, 3711, 4032, 5478, 4051),
Var7 = c(2134, 1766, 1640, 1837, 1662, 1711, 1838, 1625)), .Names = c("Var1",
"Var2", "Var3", "Var4", "Var5", "Var6", "Var7"), class = c("data.table",
"data.frame"), row.names = c(NA, -8L), .internal.selfref = <pointer: 0x1b8ec38>)

I would also question what you mean by standardize, however here is a simpler way to accomplish what you're doing in the for loop:
sweep(base, 1, rowSums(base^2), `/`)

Related

Forecasting with ARIMA and dummy variables

I am attempting to include a dummy regressor that notes the beginning of the pandemic and runs a regression with ARIMA errors.
My dataset revolves around breaking & entering's happening in Toronto from 2014 to 2021. The issue is that the trend takes a turn due to covid-19 around 2020.
Auto.arima provides me with a ARIMA(1,0,1) model as it is not taking into account the impact of covid-19 and is performing according to the implied return to the series average.
When trying to include a dummy regressor that notes the beginning of the pandemic and run a regression with ARIMA errors I get the following error:
In ifelse(time(BEDATA_GROUPEDtsssarima) >= yearmonth("2020-03"), :
Incompatible methods ("Ops.ts", ">=.vctrs_vctr") for ">="
Code:
# Create a binary time series that indicates the start of the pandemic
library(fpp3)
library(forecast)
library(zoo)
# Check if timeseries
class(BEDATA_GROUPED)
#Convert timeseries
BEDATA_GROUPEDtsssarima <- ts(BEDATA_GROUPED[,2], frequency = 12, start = c(2014, 1))
class(BEDATA_GROUPEDtsssarima)
#Plot
forecast::autoplot(BEDATA_GROUPEDtsssarima)
# Assume that the pandemic began in March 2020
pandemic_dummy <- ifelse(time(BEDATA_GROUPEDtsssarima) >= yearmonth("2020-03"), 1, 0)
# Use auto.arima() to fit an ARIMA model with the dummy variable as an exogenous variable
beddatamodel <- auto.arima(BEDATA_GROUPEDtsssarima, xreg = pandemic_dummy, ic="aic", trace = TRUE)
# Create a binary time series that indicates the start of the pandemic
# In this example, we will assume that the pandemic began in March 2020
pandemic_dummy <- ifelse(time(BEDATA_GROUPEDtsssarima) >= yearmonth("2020-03"), 1, 0)
# Use auto.arima() to fit an ARIMA model with the dummy variable as an exogenous variable
beddatamodel <- auto.arima(BEDATA_GROUPEDtsssarima, xreg = pandemic_dummy, ic="aic", trace = TRUE)
# Create a binary time series for the forecast period that includes the pandemic dummy variable
forecast_period <- time(BEDATA_GROUPEDtsssarima)["2022/01/01/":"2023/12/31/"]
pandemic_dummy_forecast <- ifelse(forecast_period >= yearmonth("2020-03"), 1, 0)
# Use the forecast()
forecast(pandemic_dummy_forecast)
Dataset:
structure(list(occurrence_yrmn = c("2014-January", "2014-February",
"2014-March", "2014-April", "2014-May", "2014-June", "2014-July",
"2014-August", "2014-September", "2014-October", "2014-November",
"2014-December", "2015-January", "2015-February", "2015-March",
"2015-April", "2015-May", "2015-June", "2015-July", "2015-August",
"2015-September", "2015-October", "2015-November", "2015-December",
"2016-January", "2016-February", "2016-March", "2016-April",
"2016-May", "2016-June", "2016-July", "2016-August", "2016-September",
"2016-October", "2016-November", "2016-December", "2017-January",
"2017-February", "2017-March", "2017-April", "2017-May", "2017-June",
"2017-July", "2017-August", "2017-September", "2017-October",
"2017-November", "2017-December", "2018-January", "2018-February",
"2018-March", "2018-April", "2018-May", "2018-June", "2018-July",
"2018-August", "2018-September", "2018-October", "2018-November",
"2018-December", "2019-January", "2019-February", "2019-March",
"2019-April", "2019-May", "2019-June", "2019-July", "2019-August",
"2019-September", "2019-October", "2019-November", "2019-December",
"2020-January", "2020-February", "2020-March", "2020-April",
"2020-May", "2020-June", "2020-July", "2020-August", "2020-September",
"2020-October", "2020-November", "2020-December", "2021-January",
"2021-February", "2021-March", "2021-April", "2021-May", "2021-June",
"2021-July", "2021-August", "2021-September", "2021-October",
"2021-November", "2021-December"), MCI = c(586, 482, 567, 626,
625, 610, 576, 634, 636, 663, 657, 556, 513, 415, 510, 542, 549,
618, 623, 666, 641, 632, 593, 617, 541, 523, 504, 536, 498, 552,
522, 519, 496, 541, 602, 570, 571, 492, 560, 525, 507, 523, 593,
623, 578, 657, 683, 588, 664, 582, 619, 512, 630, 644, 563, 654,
635, 732, 639, 748, 719, 567, 607, 746, 739, 686, 805, 762, 696,
777, 755, 675, 704, 617, 732, 609, 464, 487, 565, 609, 513, 533,
505, 578, 526, 418, 428, 421, 502, 452, 509, 492, 478, 469, 457,
457)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-96L))
I see you have used the fpp3 library, so I've had a go using the tidyverts tools. I've had a go at three models: a plain ARIMA, a plain regression using the pandemic dummy variable, and a dynamic model using both ARIMA and the dummy variable.
Hope this helps! :-)
library(tsibble)
library(fable)
library(fabletools)
library(feasts)
library(dplyr)
Create a tsibble:
BEDATA_GROUPED <- BEDATA_GROUPED |>
mutate(Month = yearmonth(occurrence_yrmn)) |>
as_tsibble(index = Month)
autoplot(BEDATA_GROUPED)
Assume that the pandemic began in March 2020
and create a dummy variable:
pandemic_start <- yearmonth("2020-03-01")
BEDATA_GROUPED <- BEDATA_GROUPED |>
mutate(pandemic_dummy = ifelse(Month >= pandemic_start, 1, 0))
Work up a plain ARIMA:
BEDATA_GROUPED_arima <- BEDATA_GROUPED |>
model(ARIMA(MCI, stepwise = FALSE))
BEDATA_GROUPED_arima |>
gg_tsresiduals()
BEDATA_GROUPED_arima |>
forecast(h = 5) |>
autoplot()
Work up a plain regression:
BEDATA_GROUPED_TSLM <- BEDATA_GROUPED |>
model(TSLM(MCI ~ pandemic_dummy)) |>
report()
BEDATA_GROUPED_TSLM |>
gg_tsresiduals()
Make a data set to predict on:
new_data <- structure(list(Month = structure(c(18993, 19024, 19052, 19083,
19113), class = c("yearmonth", "vctrs_vctr")), pandemic_dummy = c(1,
1, 1, 1, 1)), class = c("tbl_ts", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), key = structure(list(.rows = structure(list(
1:5), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr",
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-1L)), index = structure("Month", ordered = TRUE), index2 = "Month", interval = structure(list(
year = 0, quarter = 0, month = 1, week = 0, day = 0, hour = 0,
minute = 0, second = 0, millisecond = 0, microsecond = 0,
nanosecond = 0, unit = 0), .regular = TRUE, class = c("interval",
"vctrs_rcrd", "vctrs_vctr")))
Forecast plain regression:
BEDATA_GROUPED_TSLM |>
forecast(new_data = new_data) |>
autoplot()
Work up a dynamic regression, with ARIMA and the pandemic dummy variablee:
BEDATA_GROUPED_dyn_ARIMA <- BEDATA_GROUPED |>
model(ARIMA(MCI ~ pandemic_dummy)) |>
report()
BEDATA_GROUPED_dyn_ARIMA |>
gg_tsresiduals()
BEDATA_GROUPED_dyn_ARIMA |>
forecast(new_data = new_data) |>
autoplot()

R:Duplicate column with NA values created with bind_rows

The combining of multiple df using bind_rows produced an unwanted duplicate column in the resulting df:
all_trips_raw <-
bind_rows(X2020_08, X2020_06, X2020_05,
X2020_04, X2020_03, X2020_02, X2020_01,
X2019_12_Dur, X2019_11, X2019_10, X2019_09)
Where the X df were the result of an import of 12 csv files that were uploaded into R studio cloud-both the csv files and resulting df's have 19 columns. The column in question is Distance_Miles. The column with data came from the separate data frames, and the new df has one with NA.
structure(list(RouteID = c(13442256, 13442257, 13442261, 13442275,
13442279), PaymentPlan = c("Casual", "Casual", "Subscriber",
"Subscriber", "Casual"), StartHub = c("SW Yamhill at Director Park",
"SW Yamhill at Director Park", NA, "NW Station at Irving", NA
), StartLatitude = c(45.51898132, 45.51898132, 45.5133558, 45.5282777,
45.5167987), StartLongitude = c(-122.6812685, -122.6812685, -122.6828884,
-122.6766282, -122.6729466), StartDate = c("8/1/2020", "8/1/2020",
"8/1/2020", "8/1/2020", "8/1/2020"), StartTime = structure(c(240,
300, 480, 1680, 2040), class = c("hms", "difftime"), units = "secs"),
EndHub = c("SW Yamhill at Director Park", "SW Yamhill at Director Park",
NA, NA, "SE Ladd at Hawthorne"), EndLatitude = c(45.51898132,
45.51898132, 45.5252069, 45.5266354, 45.5120818), EndLongitude = c(-122.6812685,
-122.6812685, -122.6765159, -122.6765624, -122.6533493),
EndDate = c("8/1/2020", "8/1/2020", "8/1/2020", "8/1/2020",
"8/1/2020"), EndTime = structure(c(2100, 2100, 1260, 1740,
2820), class = c("hms", "difftime"), units = "secs"), TripType = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), BikeID = c(5995, 6380, 7317, 6177, 6632), BikeName = c("0916 BIKETOWN",
"0694 BIKETOWN", "9890 ASCEND BIKE", "0367 PBOT BIKETOWN",
"0278 BIKETOWN"), Distance_Miles_ = c(1.85, 1.88, 1.05, 0.11,
1.27), Duration = structure(c(1837, 1771, 768, 110, 782), class = c("hms",
"difftime"), units = "secs"), RentalAccessPath = c("keypad",
"keypad", "keypad", "keypad", "mobile"), MultipleRental = c(FALSE,
FALSE, FALSE, FALSE, FALSE), Distance_Miles = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
Would importing csv as a dyplr data_frame make a difference, when using bind_rows, instead of the base data.frame implementation?
Should the bind_rows statement been written differently, to prevent the duplicate column with NA values?
I also tried this to remove the added column:
# Find Duplicate Column Names
duplicated_names <- duplicated(colnames(my_df))
# Remove Duplicate Column Names
my_df[!duplicated_names]
where my_df was all_trips_raw

R Function help to obtain only the Unique values to then obtain basic metrics

I am trying to obtain the unique values for Number.Full in the below.
n_distinct() brings me the distinct count of the Number.Full. But it doesn't feed that into the min()/max()/mean() counts.
I have tried putting distinct and unique as part of the filter() and placing it after the filter() as a new variable.
But I can't seem to get it to feed in/work properly.
Any help or suggestions are greatly welcome.
Edit 1 for dput data:
nRequests_byYearMth <- df_Raw_Data %>%
filter(Specimen.Number.Left.2 == "AB") %>%
group_by(Rec_Period_Month_Yr) %>%
summarise(Number.Full = n_distinct(Number.Full), min(TaT_Coll_to_Auth), max(TaT_Coll_to_Auth), mean(TaT_Coll_to_Auth)) %>%
arrange(Rec_Period_Month_Yr)
structure(list(Receive.Date = c("2019-09-20", "2019-09-20", "2019-06-24",
"2019-05-23", "2019-09-05", "2019-07-30"), Number.Full = c("04023119",
"04023119", "02634719", "02190819", "00273419",
"03234219"), Ex.No = c("", "", "19P08645QQ5",
"", "", ""), Order.Comment = c("CT11", "CT11", "HR", "SHU",
"", "ICCZZ"), Coll.Date.Source = c("1931-02-04", "1931-02-04",
"1949-01-04", "2000-12-23", "2012-09-05", "2015-05-02"), Location.Code = c("FH7895SS",
"FHSA785", "VB97S", "RV0158", "FH29567", "N1"), Loc.Des = c("FWC",
"FU", "VHB", "RDO",
"F29", "NSBRU"), Tissue.Code = c("LEX",
"LEX", "RC", "SKL", "NPL", "RC"), T.Name = c("ELung",
"ELung", "Referred", "Skin", "Pleural",
"Referred Case"), Current.Status = c("S", "S", "S", "S",
"S", "S"), Date.Updated = c("2019-10-20", "2019-10-20",
"2019-06-24", "2019-05-28", "2019-09-13", "2019-08-07"), Reporting.1 = c("LYNN",
"LYNN", "ROBCM", "HUSA", "SPOE", "CPATH"), Reporting.2 = c("MAJJ",
"MAJJ", "", "", "ROBB", ""), Reporting.3 = c("",
"", "", "", "FERB", ""), Reporting.4 = c("", "",
"", "", "", ""), Reporting.5 = c("", "", "", "",
"", ""), Number.Left.2 = c("AB", "AB", "AB", "AB", "CN",
"AB"), Auth_Period_Month_Yr = c("2019-10", "2019-10", "2019-06",
"2019-05", "2019-09", "2019-08"), Rec_Period_Month_Yr = c("2019-09",
"2019-09", "2019-06", "2019-05", "2019-09", "2019-07"), TaT_Coll_to_Auth = structure(c(32400,
32400, 25738, 6730, 2564, 1558), class = "difftime", units = "days"),
M.Weighting = c(50L, 50L, 0L, 30L, NA, 0L)), row.names = c(NA,
6L), class = "data.frame")
From the nRequests_byYearMth formula I was expecting it to filter() to only show the AB entries, then group those by the Rec_Period_Moth_Yr, when it was summerised I had it count the distinct entries (n_distinct())and then the min()/max()/mean() would also show the data relating to the filtered results.
But when I've used Excel to look at the data extract I'm using the it doesn't seem to be filtering correctly.
I am thinking that I need to have the filter applied to the summerise() somehow.
Edit with outputs:
The resulting output is:
structure(list(Rec_Period_Month_Yr = c("2019-04", "2019-05",
"2019-06", "2019-07", "2019-08", "2019-09", "2019-10", "2019-11",
"2019-12", "2020-01", "2020-02", "2020-03"), Specimen.Number.Full = c(4881L,
4929L, 4902L, 5289L, 4815L, 5043L, 5697L, 5051L, 4552L, 5434L,
4917L, 4556L), `min(TaT_Coll_to_Auth)` = structure(c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), class = "difftime", units = "days"),
`max(TaT_Coll_to_Auth)` = structure(c(368, 6730, 25738, 1558,
222, 32400, 374, 150, 320, 97, 382, 60), class = "difftime", units = "days"),
`mean(TaT_Coll_to_Auth)` = structure(c(9.80235422940049,
10.768904109589, 14.8278848840458, 10.0686706074708, 10.2533425223983,
19.6828624240824, 11.8121527777778, 10.4033579583613, 10.4007004231723,
9.04840344652813, 8.94940393678958, 8.2197571578474), class = "difftime", units = "days")), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
The expected output I want is below. But I can only create this if I only look at the AB entries and Summarise() doesn't seem to do that for the min()/max()/mean() and instead looks at the entire entires for the column.
I need it to look at only the entries relating to the AB filter() (for all the summarised items.)
(The last Max entry shows as 60 in R but if properly filtered would show as 50)
structure(list(Year.and.Mth = c("2019-4", "2019-5", "2019-6",
"2019-7", "2019-8", "2019-9", "2019-10", "2019-11", "2019-12",
"2020-1", "2020-2", "2020-3"), Number.Full = c(4881, 4929, 4902,
5289, 4815, 5043, 5697, 5051, 4552, 5434, 4917, 4556), Max = c(113,
6730, 25738, 1558, 156, 32400, 374, 109, 320, 97, 382, 50), Mean = c(7.97705388240115,
9.34286873605194, 13.514891880865, 8.39194554736245, 7.72294911734164,
15.2502478683323, 9.15850447604002, 8.85389031874876, 9.00021968365554,
7.76573426573427, 7.97335773845841, 7.350526778)), class = "data.frame", row.names = c(NA,
-12L))

Is there an explanation for this R function merge() error?

I am trying to use the R merge function to combine two data.frames, but keep getting the following error:
Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column
I am not sure what this error means or how to resolve it.
My code thus far is the following:
movies <- read_csv("movies.csv")
firsts = vector(length = nrow(movies))
for (i in 1:nrow(movies)) {
firsts[i] = movies$director[i] %>% str_split(" ", n = 2) %>% unlist %>% .[1]
}
movies$firsts = firsts
movies <- movies[-c(137, 147, 211, 312, 428, 439, 481, 555, 602, 830, 850, 1045, 1080, 1082, 1085, 1096, 1255, 1258, 1286, 1293, 1318, 1382, 1441, 1456, 1494, 1509, 1703, 1719, 1735, 1944, 1968, 1974, 1977, 2098, 2197, 2409, 2516, 2546, 2722, 2751, 2988, 3191,
3227, 3270, 3283, 3285, 3286, 3292, 3413, 3423, 3470, 3480, 3511, 3676, 3698, 3826, 3915, 3923, 3954, 4165, 4381, 4385, 4390, 4397, 4573, 4711, 4729, 4774, 4813, 4967, 4974, 5018, 5056, 5258, 5331, 5405, 5450, 5469, 5481, 4573, 5708, 5715, 5786, 5886, 5888, 5933, 5934, 6052, 6091, 6201, 6234, 6236, 6511, 6544, 6551, 6562, 6803, 4052, 4121, 4326),]
movies <- movies[-c(4521,5846),]
g <- gender_df(movies, name_col = "firsts", year_col = "year", method = c("ssa"))
merge(movies, g, by = c("firsts", "name"), all = FALSE)
I thinks you are trying to give the by argument a non-valid value. Indeed, the documentation tells:
By default the data frames are merged on the columns with names they
both have, but separate specifications of the columns can be given by
by.x and by.y. The rows in the two data frames that match on the
specified columns are extracted, and joined together. If there is more
than one match, all possible matches contribute one row each. For the
precise meaning of ‘match’, see match.
In your case, you shall try the following:
merge(x = movies,y = g, by.x = "firsts", by.y = "name", all = FALSE)

How to determine where first integer/float value starts in a list

I have a data frame with several columns. The last column has NA's for, say, the first 50 rows. There are brute methods, but how do I write something that can tell where the first integer/float value starts?
structure(list(col1 = c(646, 574, 590, 671, 618, 529), col2 = c(438,
744, 730, 748, 507, 479), col3 = c(493, 661, 651, 715, 582, 571
), col4 = c(1047, 1252, 1335, 1269, 1185, 1147), col5 = c(883,
1008, 996, 1019, 901, 846), col6 = c(824, 840, 766, 776, 868,
927), col7 = c(727, 685, 708, 779, 717, 721), col8 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("col1",
"col2", "col3", "col4", "col5", "col6", "col7", "col8"), row.names = c(NA,
6L), class = "data.frame")
For the first 7 columns I iterate through, isolate the column and put it into a time series model
for(colin 1:ncol(so)){
isoColumn<- so[,col]
model<-tbats(isoColumn)
}
Is there a programming method/algorithm I can use to tell where the first value is so I can truncate those rows before I plug it into the tbats model?
You could use which(!is.na(x))[1] to locate the first non-NA value, but why not just do
models <- lapply(so,function(x) tbats(na.omit(x)))
?
If dealing with large data, Position is considerably faster than which, because it only evaluates until a match is found, rather than evaluating the whole vector then subsetting
Position(function(x)!is.na(x), x)

Resources