Calculate formula over all rows and specific columns of dataframe - r

I have the following sample dataframe with prices of toys in different shops:
dfData <- data.frame(article = c("Fix", "Foxi", "Stan", "Olli", "Barbie", "Ken", "Hulk"),
priceToys1 = c(10, NA, 10.5, NA, 10.7, 11.2, 12.0),
priceAllToys = c(NA, 11.4, NA, 11.9, 11.7, 11.1, NA),
price123Toys = c(12, 12.4, 12.7, NA, NA, 11.0, 12.1))
Additionally I generate a min price column by adding:
dfData$MinPrice <- apply(dfData[, grep("price", colnames(dfData))], 1, FUN=min, na.rm = TRUE)
So I have this dataframe now:
# article priceToys1 priceAllToys price123Toys MinPrice
#1 Fix 10.0 NA 12.0 10.0
#2 Foxi NA 11.4 12.4 11.4
#3 Stan 10.5 NA 12.7 10.5
#4 Olli NA 11.9 NA 11.9
#5 Barbie 10.7 11.7 NA 10.7
#6 Ken 11.2 11.1 11.0 11.0
#7 Hulk 12.0 NA 12.1 12.0
How do I get additional columns into the dataframe that tell me the factor of all prices relatively to the minimum price in percentage? The new column names should also include the shop name.
The result should look like this:
# article priceToys1 PercToys1 priceAllToys PercAllToys price123Toys Perc123Toys MinPrice
#1 Fix 10.0 100.0 NA NA 12.0 120.0 10.0
#2 Foxi NA NA 11.4 100.0 12.4 108.8 11.4
#3 Stan 10.5 100.0 NA NA 12.7 121.0 10.5
#4 Olli NA NA 11.9 100.0 NA NA 11.9
#5 Barbie 10.7 100.0 11.7 109.4 NA NA 10.7
#6 Ken 11.2 101.8 11.1 100.9 11.0 100.0 11.0
#7 Hulk 12.0 100.0 NA NA 12.1 100.8 12.0

Two possible solutions:
1) With the data.table-package:
# load the 'data.table'-package
library(data.table)
# get the columnnames on which to operate
cols <- names(dfData)[2:4] # or: grep("price", names(dfData), value = TRUE)
# convert dfData to a 'data.table'
setDT(dfData)
# compute the 'fraction'-columns
dfData[, paste0('Perc', gsub('price','',cols)) := lapply(.SD, function(x) round(100 * x / MinPrice, 1))
, .SDcols = cols][]
which gives:
article priceToys1 priceAllToys price123Toys MinPrice PercToys1 PercAllToys Perc123Toys
1: Fix 10.0 NA 12.0 10.0 100.0 NA 120.0
2: Foxi NA 11.4 12.4 11.4 NA 100.0 108.8
3: Stan 10.5 NA 12.7 10.5 100.0 NA 121.0
4: Olli NA 11.9 NA 11.9 NA 100.0 NA
5: Barbie 10.7 11.7 NA 10.7 100.0 109.3 NA
6: Ken 11.2 11.1 11.0 11.0 101.8 100.9 100.0
7: Hulk 12.0 NA 12.1 12.0 100.0 NA 100.8
2) With base R:
cols <- names(dfData)[2:4] # or: grep("price", names(dfData), value = TRUE)
dfData[, paste0('Perc', gsub('price','',cols))] <- round(100 * dfData[, cols] / dfData$MinPrice, 1)
which will get you the same result.

We can use mutate_at from dplyr
library(dplyr)
library(magrittr)
dfData %<>%
mutate_at(vars(matches("^price")), funs(Perc = round(100* ./MinPrice, 1)))
dfData

Related

Time series forecasting by lm() using lapply

I was trying to forecast a time series problem using lm() and my data looks like below
Customer_key date sales
A35 2018-05-13 31
A35 2018-05-20 20
A35 2018-05-27 43
A35 2018-06-03 31
BH22 2018-05-13 60
BH22 2018-05-20 67
BH22 2018-05-27 78
BH22 2018-06-03 55
Converted my df to a list format by
df <- dcast(df, date ~ customer_key,value.var = c("sales"))
df <- subset(df, select = -c(dt))
demandWithKey <- as.list(df)
Trying to write a function such that applying this function across all customers
my_fun <- function(x) {
fit <- lm(ds_load ~ date, data=df) ## After changing to list ds_load and date column names
## are no longer available for formula
fit_b <- forecast(fit$fitted.values, h=20) ## forecast using lm()
return(data.frame(c(fit$fitted.values, fit_b[["mean"]])))
}
fcast <- lapply(df, my_fun)
I know the above function doesn't work, but basically I'm looking for getting both the fitted values and forecasted values for a grouped data.
But I've tried all other methods using tslm() (converting into time series data) and so on but no luck I can get the lm() work somehow on just one customer though. Also many questions/posts were on just fitting the model but I would like to forecast too at same time.
lm() is for a regression model
but here you have a time serie so for forecasting the serie you have to use one of the time serie model (ARMA ARCH GARCH...)
so you can use the function in r : auto.arima() in "forecast" package
I don't know what you're up to exactly, but you could make this less complicated.
Using by avoids the need to reshape your data, it splits your data e.g. by customer ID as in your case and applies a function on the subsets (i.e. it's a combination of split and lapply; see ?by).
Since you want to compare fitted and forecasted values somehow in your result, you probably need predict rather than $fitted.values, otherwise the values won't be of same length. Because your independent variable is a date in weekly intervals, you may use seq.Date and take the first date as a starting value; the sequence has length actual values (nrow each customer) plus h= argument of the forecast.
For demonstration purposes I add the fitted values as first column in the following.
res <- by(dat, dat$cus_key, function(x) {
H <- 20 ## globally define 'h'
fit <- lm(sales ~ date, x)
fitted <- fit$fitted.values
pred <- predict(fit, newdata=data.frame(
date=seq(x$date[1], length.out= nrow(x) + H, by="week")))
fcst <- c(fitted, forecast(fitted, h=H)$mean)
fit.na <- `length<-`(unname(fitted), length(pred)) ## for demonstration
return(cbind(fit.na, pred, fcst))
})
Result
res
# dat$cus_key: A28
# fit.na pred fcst
# 1 41.4 41.4 41.4
# 2 47.4 47.4 47.4
# 3 53.4 53.4 53.4
# 4 59.4 59.4 59.4
# 5 65.4 65.4 65.4
# 6 NA 71.4 71.4
# 7 NA 77.4 77.4
# 8 NA 83.4 83.4
# 9 NA 89.4 89.4
# 10 NA 95.4 95.4
# 11 NA 101.4 101.4
# 12 NA 107.4 107.4
# 13 NA 113.4 113.4
# 14 NA 119.4 119.4
# 15 NA 125.4 125.4
# 16 NA 131.4 131.4
# 17 NA 137.4 137.4
# 18 NA 143.4 143.4
# 19 NA 149.4 149.4
# 20 NA 155.4 155.4
# 21 NA 161.4 161.4
# 22 NA 167.4 167.4
# 23 NA 173.4 173.4
# 24 NA 179.4 179.4
# 25 NA 185.4 185.4
# ----------------------------------------------------------------
# dat$cus_key: B16
# fit.na pred fcst
# 1 49.0 49.0 49.0
# 2 47.7 47.7 47.7
# 3 46.4 46.4 46.4
# 4 45.1 45.1 45.1
# 5 43.8 43.8 43.8
# 6 NA 42.5 42.5
# 7 NA 41.2 41.2
# 8 NA 39.9 39.9
# 9 NA 38.6 38.6
# 10 NA 37.3 37.3
# 11 NA 36.0 36.0
# 12 NA 34.7 34.7
# 13 NA 33.4 33.4
# 14 NA 32.1 32.1
# 15 NA 30.8 30.8
# 16 NA 29.5 29.5
# 17 NA 28.2 28.2
# 18 NA 26.9 26.9
# 19 NA 25.6 25.6
# 20 NA 24.3 24.3
# 21 NA 23.0 23.0
# 22 NA 21.7 21.7
# 23 NA 20.4 20.4
# 24 NA 19.1 19.1
# 25 NA 17.8 17.8
# ----------------------------------------------------------------
# dat$cus_key: C12
# fit.na pred fcst
# 1 56.4 56.4 56.4
# 2 53.2 53.2 53.2
# 3 50.0 50.0 50.0
# 4 46.8 46.8 46.8
# 5 43.6 43.6 43.6
# 6 NA 40.4 40.4
# 7 NA 37.2 37.2
# 8 NA 34.0 34.0
# 9 NA 30.8 30.8
# 10 NA 27.6 27.6
# 11 NA 24.4 24.4
# 12 NA 21.2 21.2
# 13 NA 18.0 18.0
# 14 NA 14.8 14.8
# 15 NA 11.6 11.6
# 16 NA 8.4 8.4
# 17 NA 5.2 5.2
# 18 NA 2.0 2.0
# 19 NA -1.2 -1.2
# 20 NA -4.4 -4.4
# 21 NA -7.6 -7.6
# 22 NA -10.8 -10.8
# 23 NA -14.0 -14.0
# 24 NA -17.2 -17.2
# 25 NA -20.4 -20.4
As you can see, prediction and forecast yield the same values, since both methods are based on the same single explanatory variable date in this case.
Toy data:
set.seed(42)
dat <- transform(expand.grid(cus_key=paste0(LETTERS[1:3], sample(12:43, 3)),
date=seq.Date(as.Date("2018-05-13"), length.out=5, by="week")),
sales=sample(20:80, 15, replace=TRUE))

Where am I going wrong in spliting time series?

data<-c(10.0,11.1,12.3,13.2,14.8,15.6,16.7,17.5,18.9,19.7,20.7,21.1,22.6,23.5,24.9,25.1,26.3,27.8,28.8,29.6,30.2,31.6,32.1,33.7)
startDate <- '2013-01-01'
endDate <- '2013-01-01'
df <- ts(cbind(data, startDate, endDate))
df
################
smp_size <- 0.80
train_ind <- length(df) * smp_size
train_split <- seq(from = 1, to = train_ind)
test_split <- seq(from = train_ind +1, to = length(df))
train <- data[train_split]
test <- data[-test_split]
(c(train, test))
I have the above data and I am trying to split it into time series splits, i..e the first 80% as training and the remaining 20% as testing.
I keep getting weird results:
(c(train, test))
[1] 10.0 11.1 12.3 13.2 14.8 15.6 16.7 17.5 18.9 19.7 20.7 21.1 22.6 23.5 24.9 25.1 26.3 27.8 28.8 29.6 30.2
[22] 31.6 32.1 33.7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[43] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 10.0 11.1 12.3 13.2 14.8 15.6
[64] 16.7 17.5 18.9 19.7 20.7 21.1 22.6 23.5 24.9 25.1 26.3 27.8 28.8 29.6 30.2 31.6 32.1 33.7
Why are there NA values in the middle of the data?
You should use nrow(df), not length(df) for time-series objects.
data <- c(10.0, 11.1, 12.3, 13.2, 14.8, 15.6, 16.7, 17.5, 18.9,
19.7, 20.7, 21.1, 22.6, 23.5, 24.9, 25.1, 26.3, 27.8,
28.8, 29.6, 30.2, 31.6, 32.1, 33.7)
startDate <- '2013-01-01'
endDate <- '2013-01-01'
df <- ts(cbind(data, startDate, endDate))
train <- df[1:(nrow(df) * .8), ]
test <- df[-(1:(nrow(df) * .8)), ]
> all.equal(df, ts(rbind(train, test)))
[1] TRUE
> length(df)
[1] 72
> nrow(df)
[1] 24
Calculate the number of rows to include in test set and use window function to subset time-series
train_size <- ceiling(nrow(df) * 0.8)
train_set <- window(df, end = train_size)
test_set <- window(df, start = train_size + 1)
train_set
#Time Series:
#Start = 1
#End = 20
#Frequency = 1
# data startDate endDate
# 1 10 2013-01-01 2013-01-01
# 2 11.1 2013-01-01 2013-01-01
# 3 12.3 2013-01-01 2013-01-01
# 4 13.2 2013-01-01 2013-01-01
# 5 14.8 2013-01-01 2013-01-01
# 6 15.6 2013-01-01 2013-01-01
# 7 16.7 2013-01-01 2013-01-01
# 8 17.5 2013-01-01 2013-01-01
# 9 18.9 2013-01-01 2013-01-01
#10 19.7 2013-01-01 2013-01-01
#11 20.7 2013-01-01 2013-01-01
#12 21.1 2013-01-01 2013-01-01
#13 22.6 2013-01-01 2013-01-01
#14 23.5 2013-01-01 2013-01-01
#15 24.9 2013-01-01 2013-01-01
#16 25.1 2013-01-01 2013-01-01
#17 26.3 2013-01-01 2013-01-01
#18 27.8 2013-01-01 2013-01-01
#19 28.8 2013-01-01 2013-01-01
#20 29.6 2013-01-01 2013-01-01
test_set
#Time Series:
#Start = 21
#End = 24
#Frequency = 1
# data startDate endDate
#21 30.2 2013-01-01 2013-01-01
#22 31.6 2013-01-01 2013-01-01
#23 32.1 2013-01-01 2013-01-01
#24 33.7 2013-01-01 2013-01-01

Replace all duplicated with na

My question is similar to replace duplicate values with NA in time series data using dplyr but while applying to other time series which are like below :
box_num date x y
6-WQ 2018-11-18 20.2 8
6-WQ 2018-11-25 500.75 7.2
6-WQ 2018-12-2 500.75 23
25-LR 2018-11-18 374.95 4.3
25-LR 2018-11-25 0.134 9.3
25-LR 2018-12-2 0.134 4
73-IU 2018-12-2 225.54 0.7562
73-IU 2018-12-9 28 0.7562
73-IU 2018-12-16 225.54 52.8
library(dplyr)
df %>%
group_by(box_num) %>%
mutate_at(vars(x:y), funs(replace(., duplicated(.), NA)))
The above code can identify and replace with NA, but the underlying problem is I'm trying to replace all NA with a linear trend in the coming step. Since it's a time series.But when we see for box_num : 6-WQ after 20.2 we can see directly a large shift which we can say it's a imputed value so I would to replace both the imputed values as NA and the other case is like for box_num 73-IU imputed values got entered after one week so I would like to replace imputed values with NA
Expected output :
box_num date x y
6-WQ 2018-11-18 20.2 8
6-WQ 2018-11-25 NA 7.2
6-WQ 2018-12-2 NA 23
25-LR 2018-11-18 374.95 4.3
25-LR 2018-11-25 NA 9.3
25-LR 2018-12-2 NA 4
73-IU 2018-12-2 NA NA
73-IU 2018-12-9 28 NA
73-IU 2018-12-16 NA 52.8
foo = function(x){
replace(x, ave(x, x, FUN = length) > 1, NA)
}
myCols = c("x", "y")
df1[myCols] = lapply(df1[myCols], foo)
df1
# box_num date x y
#1 6-WQ 2018-11-18 20.20 8.0
#2 6-WQ 2018-11-25 NA 7.2
#3 6-WQ 2018-12-2 NA 23.0
#4 25-LR 2018-11-18 374.95 4.3
#5 25-LR 2018-11-25 NA 9.3
#6 25-LR 2018-12-2 NA 4.0
#7 73-IU 2018-12-2 NA NA
#8 73-IU 2018-12-9 28.00 NA
#9 73-IU 2018-12-16 NA 52.8
#DATA
df1 = structure(list(box_num = c("6-WQ", "6-WQ", "6-WQ", "25-LR", "25-LR",
"25-LR", "73-IU", "73-IU", "73-IU"), date = c("2018-11-18", "2018-11-25",
"2018-12-2", "2018-11-18", "2018-11-25", "2018-12-2", "2018-12-2",
"2018-12-9", "2018-12-16"), x = c(20.2, 500.75, 500.75, 374.95,
0.134, 0.134, 225.54, 28, 225.54), y = c(8, 7.2, 23, 4.3, 9.3,
4, 0.7562, 0.7562, 52.8)), class = "data.frame", row.names = c(NA,
-9L))
With tidyverse you can do:
df %>%
group_by(box_num) %>%
mutate_at(vars(x:y), funs(ifelse(. %in% subset(rle(sort(.))$values, rle(sort(.))$length > 1), NA, .)))
box_num date x y
<fct> <fct> <dbl> <dbl>
1 6-WQ 2018-11-18 20.2 8.00
2 6-WQ 2018-11-25 NA 7.20
3 6-WQ 2018-12-2 NA 23.0
4 25-LR 2018-11-18 375. 4.30
5 25-LR 2018-11-25 NA 9.30
6 25-LR 2018-12-2 NA 4.00
7 73-IU 2018-12-2 NA NA
8 73-IU 2018-12-9 28.0 NA
9 73-IU 2018-12-16 NA 52.8
First, it sorts the values in "x" and "y" and computes the run length of equal values. Second, it creates a subset for those values that have a run length > 1. Finally, it compares whether the values in "x" and "y" are in the subset, and if so, they get NA.

How to change a column classed as NULL to class integer?

So I'm starting with a dataframe called max.mins that has 153 rows.
day Tx Hx Tn
1 1 10.0 7.83 2.1
2 2 7.7 6.19 2.5
3 3 7.1 4.86 0.0
4 4 9.8 7.37 2.7
5 5 13.4 12.68 0.4
6 6 17.5 17.47 3.5
7 7 16.5 15.58 6.5
8 8 21.5 20.30 6.2
9 9 21.7 21.41 9.7
10 10 24.4 28.18 8.0
I'm applying these statements to the dataframe to look for specific criteria
temp_warnings <- subset(max.mins, Tx >= 32 & Tn >=20)
humidex_warnings <- subset(max.mins, Hx >= 40)
Now when I open up humidex_warnings for example I have this dataframe
row.names day Tx Hx Tn
1 41 10 31.1 40.51 20.7
2 56 25 33.4 42.53 19.6
3 72 11 34.1 40.78 18.1
4 73 12 33.8 40.18 18.8
5 74 13 34.1 41.10 22.4
6 79 18 30.3 41.57 22.5
7 94 2 31.4 40.81 20.3
8 96 4 30.7 40.39 20.2
The next step is to search for 2 or 3 consective numbers in the column row.names and give me a total of how many times this occurs (I asked this in a previous question and have a function that should work once this problem is sorted out). The issue is that row.names is class NULL which is preventing me from applying further functions to this dataframe.
Help? :)
Thanks in advance,
Nick
If you need the row.names as a data as integer:
humidex_warnings$seq <- as.integer(row.names(humidex_warnings))
If you don't need row.names
row.names(humidex_warnings) <- NULL

Partially transpose a dataframe in R

Given the following set of data:
transect <- c("B","N","C","D","H","J","E","L","I","I")
sampler <- c(rep("J",5),rep("W",5))
species <- c("ROB","HAW","HAW","ROB","PIG","HAW","PIG","PIG","HAW","HAW")
weight <- c(2.80,52.00,56.00,2.80,16.00,55.00,16.20,18.30,52.50,57.00)
wingspan <- c(13.9, 52.0, 57.0, 13.7, 11.0,52.5, 10.7, 11.1, 52.3, 55.1)
week <- c(1,2,3,4,5,6,7,8,9,9)
# Warning to R newbs: Really bad idea to use this code
ex <- as.data.frame(cbind(transect,sampler,species,weight,wingspan,week))
What Iā€™m trying to achieve is to transpose the species and its associated information on weight and wingspan. For a better idea of the expected result please see below. My data set is about half a million lines long with approximately 200 different species so it will be a very large dataframe.
transect sampler week ROBweight HAWweight PIGweight ROBwingspan HAWwingspan PIGwingspan
1 B J 1 2.8 0.0 0.0 13.9 0.0 0.0
2 N J 2 0.0 52.0 0.0 0.0 52.0 0.0
3 C J 3 0.0 56.0 0.0 0.0 57.0 0.0
4 D J 4 2.8 0.0 0.0 13.7 0.0 0.0
5 H J 5 0.0 0.0 16.0 0.0 0.0 11.0
6 J W 6 0.0 55.0 0.0 0.0 52.5 0.0
7 E W 7 0.0 0.0 16.2 0.0 0.0 10.7
8 L W 8 0.0 0.0 18.3 0.0 0.0 11.1
9 I W 9 0.0 52.5 0.0 0.0 52.3 0.0
10 I W 9 0.0 57.0 0.0 0.0 55.1 0.0
The main problem is that you don't currently have unique "id" variables, which will create problems for the usual suspects of reshape and dcast.
Here's a solution. I've used getanID from my "splitstackshape" package, but it's pretty easy to create your own unique ID variable using many different methods.
library(splitstackshape)
library(reshape2)
idvars <- c("transect", "sampler", "week")
ex <- getanID(ex, id.vars=idvars)
From here, you have two options:
reshape from base R:
reshape(ex, direction = "wide",
idvar=c("transect", "sampler", "week", ".id"),
timevar="species")
melt and dcast from "reshape2"
First, melt your data into a "long" form.
exL <- melt(ex, id.vars=c(idvars, ".id", "species"))
Then, cast your data into a wide form.
dcast(exL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1 B J 1 1 NA NA NA NA 2.8 13.9
# 2 C J 3 1 56.0 57.0 NA NA NA NA
# 3 D J 4 1 NA NA NA NA 2.8 13.7
# 4 E W 7 1 NA NA 16.2 10.7 NA NA
# 5 H J 5 1 NA NA 16.0 11.0 NA NA
# 6 I W 9 1 52.5 52.3 NA NA NA NA
# 7 I W 9 2 57.0 55.1 NA NA NA NA
# 8 J W 6 1 55.0 52.5 NA NA NA NA
# 9 L W 8 1 NA NA 18.3 11.1 NA NA
# 10 N J 2 1 52.0 52.0 NA NA NA NA
A better option: "data.table"
Alternatively (and perhaps preferably), you can use the "data.table" package (at least version 1.8.11) as follows:
library(data.table)
library(reshape2) ## Also required here
packageVersion("data.table")
# [1] ā€˜1.8.11ā€™
DT <- data.table(ex)
DT[, .id := sequence(.N), by = c("transect", "sampler", "week")]
DTL <- melt(DT, measure.vars=c("weight", "wingspan"))
dcast.data.table(DTL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1: B J 1 1 NA NA NA NA 2.8 13.9
# 2: C J 3 1 56.0 57.0 NA NA NA NA
# 3: D J 4 1 NA NA NA NA 2.8 13.7
# 4: E W 7 1 NA NA 16.2 10.7 NA NA
# 5: H J 5 1 NA NA 16.0 11.0 NA NA
# 6: I W 9 1 52.5 52.3 NA NA NA NA
# 7: I W 9 2 57.0 55.1 NA NA NA NA
# 8: J W 6 1 55.0 52.5 NA NA NA NA
# 9: L W 8 1 NA NA 18.3 11.1 NA NA
# 10: N J 2 1 52.0 52.0 NA NA NA NA
Add fill = 0 to either of the dcast versions to replace NA values with 0.

Resources