I am trying to generate the below series (see attached image) based on the logic given below. I was able create the series for one product and store (code given below). i am having trouble when i try to generalize this for multiple product store combinations. Could you please advise if there is an easier way to do this.
Logic
a given
b lag of d by 4
c initial c for first week thereafter (c previous row + b current - a current)
d initial d - c current
my code
library(dplyr)
df = structure(list(
Product = c(11078931, 11078931, 11078931, 11078931, 11078931,
11078931, 12021216, 12021216, 12021216, 12021216,
12021216, 12021216, 10932270, 10932270, 10932270,
10932270, 10932270),
STORE = c(90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 547, 547,
547, 547, 547),
WEEK = c(201627, 201628, 201629, 201630, 201631, 201632, 201627, 201628,
201629, 201630, 201631, 201632, 201627, 201628, 201629, 201630,
201631),
WEEK_SEQ = c(914, 915, 916, 917, 918, 919, 914, 915, 916, 917, 918, 919,
914, 915, 916, 917, 918),
a = c(9.161, 9.087, 8.772, 8.698, 7.985, 6.985, 0.945, 0.734, 0.629, 0.599,
0.55, 0.583, 5.789, 5.694, 5.488, 5.47, 5.659),
initial_d = c(179, 179, 179, 179, 179, 179, 18, 18, 18, 18, 18, 18, 37, 37,
37, 37, 37),
Initial_c = c(62, 0, 0, 0, 0, 0, 33, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0)
),
.Names = c("Product", "STORE", "WEEK", "WEEK_SEQ", "a", "initial_d",
"Initial_c"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -17L))
# filter to extract one product and store
# df = df %>% filter(Product == 11078931) %>% filter(STORE == 90)
df$b = 0
df$c = 0
df$d = NA
c_init = 62
d_init = 179
df$d <- d_init
df$c[1] <- c_init
RQ <- function(df,...){
for(i in seq_along(df$WEEK_SEQ)){
if(i>4){
df[i, "b"] = round(df[i-4,"d"], digits = 0)# Calculate b with the lag
}
if(i>1){
df[i, "c"] = round(df[i-1, "c"] + df[i, "b"] - df[i, "a"], digits = 0) # calc c
}
df[i, "d"] <- round(d_init - df[i, "c"], digits = 0) # calc d
if(df[i, "d"] < 0) {
df[i, "d"] <- 0 # reset negative d values
}
}
return(df)
}
df = df %>% group_by(SKU_CD, STORE_CD) %>% RQ(df)
Expected output series
could you please advice what is wrong in my code. this code works fine for one product and store combination. but for multiple product and store it doesn't. thanks for your time and input!
Consider base R's by which subsets the input dataframe by each combination of factor types to return a list of the subsetted dataframes. Then, run a do.call(rbind, ...) to row bind the list into one final dataframe.
RQ_dfs <- by(df, df[c("Product", "STORE")], FUN=RQ)
finaldf <- do.call(rbind, RQ_dfs)
While I cannot achieve your outputted series screenshot with posted data, the filtered commented out pairing does show up finaldf:
# # A tibble: 17 × 10
# Product STORE WEEK WEEK_SEQ a initial_d Initial_c b c d
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 11078931 90 201627 914 9.161 179 62 0 62 117
# 2 11078931 90 201628 915 9.087 179 0 0 53 126
# 3 11078931 90 201629 916 8.772 179 0 0 44 135
# 4 11078931 90 201630 917 8.698 179 0 0 35 144
# 5 11078931 90 201631 918 7.985 179 0 117 144 35
# 6 11078931 90 201632 919 6.985 179 0 126 263 0
# 7 12021216 90 201627 914 0.945 18 33 0 0 179
# 8 12021216 90 201628 915 0.734 18 0 0 -1 180
# 9 12021216 90 201629 916 0.629 18 0 0 -2 181
# 10 12021216 90 201630 917 0.599 18 0 0 -3 182
# 11 12021216 90 201631 918 0.550 18 0 179 175 4
# 12 12021216 90 201632 919 0.583 18 0 180 354 0
# 13 10932270 547 201627 914 5.789 37 59 0 0 179
# 14 10932270 547 201628 915 5.694 37 0 0 -6 185
# 15 10932270 547 201629 916 5.488 37 0 0 -11 190
# 16 10932270 547 201630 917 5.470 37 0 0 -16 195
# 17 10932270 547 201631 918 5.659 37 0 179 157 22
Related
How do I make a new column in DF with the percentage change in share price over the year?
DF <- data.frame(name = c("EQU", "YAR", "MOWI", "AUSS", "GJF", "KOG", "SUBC"),
price20 = c(183, 343, 189, 88, 179, 169, 62),
price21 = c(221, 453, 183, 85, 198, 232, 67))
Here's a line that will do it for you. I've also added a round function to it so the table is more readable.
DF$percent_change <- round((DF$price21 - DF$price20) / DF$price20 * 100, 2)
name price20 price21 percent_change
1 EQU 183 221 20.77
2 YAR 343 453 32.07
3 MOWI 189 183 -3.17
4 AUSS 88 85 -3.41
5 GJF 179 198 10.61
6 KOG 169 232 37.28
7 SUBC 62 67 8.06
This line should do it:
DF$change <- (DF$price21/DF$price20*100) - 100
We can use a simple division and scales::percent:
library(dplyr)
DF %>% mutate(percent_change = scales::percent((price21-price20)/price20))
name price20 price21 percent_change
1 EQU 183 221 20.77%
2 YAR 343 453 32.07%
3 MOWI 189 183 -3.17%
4 AUSS 88 85 -3.41%
5 GJF 179 198 10.61%
6 KOG 169 232 37.28%
7 SUBC 62 67 8.06%
I'm a relative beginner with R so apologies for the simplistic question.
I have a simple data frame with columns x, y and z. They all contain numerical values and I'd like to write a piece of code that allows me to replaces a all z values with "115" whenever 300 < x < 600, 0 < y < 100, and z > 160.
Very simple problem but I am not sure why I am having so much trouble figuring out how to piece together code for this. I'm sure its some hodge-podge of replace and ifelse arguments but I can't seem to put it together.
Help is much appreciated! Thanks!
This is how I would do it:
library(tidyverse)
set.seed(1)
df <- data_frame("x" = sample(x = 200:700, size = 10, replace = TRUE),
"y" = sample(x = 0:400, size = 10, replace = TRUE),
"z" = sample(x = 0:200, size = 10, replace = TRUE))
df
#> A tibble: 10 x 3
#> x y z
#> <int> <int> <int>
#> 1 523 84 109
#> 2 366 276 164
#> 3 328 361 33
#> 4 617 329 105
#> 5 670 262 125
#> 6 498 328 88
#> 7 469 78 171
#> 8 665 212 32
#> 9 386 36 83
#>10 506 104 162
df$z <- ifelse((df$x > 300 & df$x < 600) & (df$y > 0 & df$y < 100) & (df$z > 160), 115, df$z)
df
#> A tibble: 10 x 3
#> x y z
#> <int> <int> <dbl>
#> 1 523 84 109
#> 2 366 276 164
#> 3 328 361 33
#> 4 617 329 105
#> 5 670 262 125
#> 6 498 328 88
#> 7 469 78 115
#> 8 665 212 32
#> 9 386 36 83
#>10 506 104 162
#(#7 was updated to 115 as it met all the criteria)
Edit
As usual, #TIC's answer is better than mine (fewer steps -> faster) but not by much on my system with a million rows. The data.table method is quickest:
library(tidyverse)
set.seed(1)
df <- data_frame("x" = sample(x = 0:700, size = 1000000, replace = TRUE),
"y" = sample(x = 0:400, size = 1000000, replace = TRUE),
"z" = sample(x = 0:200, size = 1000000, replace = TRUE))
ifelse_func <- function(df){
df$z <- ifelse((df$x > 300 & df$x < 600) & (df$y > 0 & df$y < 100) & (df$z > 160), 115, df$z)
}
transform_func <- function(df){
transform(df, z = replace(z, 300 < x & x < 600 & 0 < y & y < 100 & z > 160, 115))
}
rowsums_func <- function(df){
df$z[!rowSums(!(df >list(300, 0, 160) & df < list(600, 100, Inf)))] <- 115
}
library(data.table)
dt_func <- function(df){
setDT(df)
df[x > 300 & x < 600 & y > 0 & y < 100 & z > 160, z := 115]
}
mbm <- microbenchmark::microbenchmark(ifelse_func(df), transform_func(df),
rowsums_func(df), dt_func(df))
autoplot(mbm)
Edit 2
> system.time(ifelse_func(df))
user system elapsed
0.064 0.020 0.085
> system.time(transform_func(df))
user system elapsed
0.060 0.009 0.069
> system.time(rowsums_func(df))
user system elapsed
0.090 0.021 0.110
> system.time(dt_func(df))
user system elapsed
0.036 0.003 0.039
Do you want this?
transform(
df,
z = replace(z, 300 < x & x < 600 & 0 < y & y < 100 & z > 160, 115)
)
Another option in base R is with rowSums
df$z[!rowSums(!(df >list(300, 0, 160) & df < list(600, 100, Inf)))] <- 115
So we can do this with an ifelse conditions:
Some sample data:
df <- data.frame(x=c(450, runif(10)*200),
y=c(50, runif(10)*100),
z=c(170, runif(10)*100))
> df
x y z
1 450.00000 50.00000 170.00000
2 10.38674 93.33277 74.72619
3 117.66350 48.88015 27.60769
4 128.85086 35.74645 61.32745
5 93.21923 87.15894 53.37949
6 30.09869 86.72846 94.64611
7 104.03966 55.12932 89.78309
8 17.48741 16.50095 42.26284
9 183.52845 39.65171 27.60766
10 79.68355 18.14510 84.17454
11 110.14051 77.85835 33.67199
Then run this:
df$z <- ifelse(df$x > 300 & df$x < 600 & df$y > 0 & df$y < 100 & df$z > 160, 115, df$z)
And we get this:
> df
x y z
1 450.00000 50.00000 115.00000
2 10.38674 93.33277 74.72619
3 117.66350 48.88015 27.60769
4 128.85086 35.74645 61.32745
5 93.21923 87.15894 53.37949
6 30.09869 86.72846 94.64611
7 104.03966 55.12932 89.78309
8 17.48741 16.50095 42.26284
9 183.52845 39.65171 27.60766
10 79.68355 18.14510 84.17454
11 110.14051 77.85835 33.67199
I have tried to get a frequency table for one dataset ("sim") using the intervals and classes from another dataset ("obs") (both of the same type). I've tried using the table () function in R, but it doesn't give me the frequency of the dataset called "sim" using the "obs" intervals. There may be data that falls outside the range defined with "obs", the idea is that those are omitted. Is there a simple way to get the frequency table for this case?
Here is a sample of my data (vector):
X obs sim
1 1 11.2 8.44
2 2 22.5 15.51
3 3 26.0 20.08
4 4 28.1 23.57
5 5 29.0 26.46
6 6 29.5 28.95
...etc...
I leave you the lines of code:
# Set working directory
setwd("C:/Users/...")
# Vector has 2 set of data, "obs" and "sim"
vector <- read.csv("vector.csv", fileEncoding = 'UTF-8-BOM')
# Divide the range of "obs" into intervals, using Sturges for number of classes:
factor_obs <- cut(vector$obs, breaks=nclass.Sturges(vector$obs), include.lowest = T)
# Get a frequency table using the table() function for "obs"
obs_out <- as.data.frame(table(factor_obs))
obs_out <- transform(obs_out, cumFreq = cumsum(Freq), relative = prop.table(Freq))
# Get a frequency table using the table() function for "sim", using cut from "obs"
sim_out <- as.data.frame(table(factor_obs, vector$sim > 0))
This is what I get from "obs" frequency table:
> obs_out
factor_obs Freq cumFreq relative
1 [11.1,25.6] 2 2 0.04166667
2 (25.6,40.1] 10 12 0.20833333
3 (40.1,54.5] 17 29 0.35416667
4 (54.5,69] 4 33 0.08333333
5 (69,83.4] 8 41 0.16666667
6 (83.4,97.9] 5 46 0.10416667
7 (97.9,112] 2 48 0.04166667
This is what I get from "sim" frequency table:
> sim_out
factor_obs Var2 Freq
1 [11.1,25.6] TRUE 2
2 (25.6,40.1] TRUE 10
3 (40.1,54.5] TRUE 17
4 (54.5,69] TRUE 4
5 (69,83.4] TRUE 8
6 (83.4,97.9] TRUE 5
7 (97.9,112] TRUE 2
Which is the same frequency from the table for "obs".
The idea is that the elements of "sim" in each interval defined by the classes of "obs" are counted, and that extreme values outside the ranges of "obs" are omitted.
It would be helpful if someone can guide me. Thanks a lot!!
You will need to define your own breakpoints since if you let cut do it, the values are not saved for you to use with the sim variable. First use dput(vector) to put the data in a simple form for R:
vector <- structure(list(X = 1:48, obs = c(11.2, 22.5, 26, 28.1, 29, 29.5,
30.8, 32, 33.5, 35, 35.5, 38.9, 41, 41, 41, 43, 43.51, 44, 46,
48.5, 50, 50, 50, 50, 50.8, 51.5, 51.5, 53, 54.4, 55, 57.5, 59.5,
66.9, 70.6, 74.2, 75, 77, 80.2, 81.5, 82, 83, 83.6, 85, 85.1,
93.8, 94, 106.7, 112.3), sim = c(8.44, 15.51, 20.08, 23.57, 26.46,
28.95, 31.16, 33.17, 35.02, 36.75, 38.37, 39.92, 41.39, 42.81,
44.19, 45.52, 46.82, 48.09, 49.34, 50.56, 51.78, 52.98, 54.18,
55.37, 56.55, 57.75, 58.94, 60.14, 61.36, 62.59, 63.83, 65.1,
66.4, 67.74, 69.11, 70.53, 72.01, 73.55, 75.18, 76.9, 78.75,
80.76, 82.98, 85.46, 88.35, 91.84, 96.41, 103.48)), class = "data.frame",
row.names = c(NA, -48L))
Now we need the number of categories and the breakpoints:
nbreaks <- nclass.Sturges(vector$obs)
minval <- min(vector$obs)
maxval <- max(vector$obs)
int <- round((maxval - minval) / nbreaks, 3) # round to 1 digit more thab obs or sim
brks <- c(minval, minval + seq(nbreaks-1) * int, maxval)
The table for the obs data:
factor_obs <- cut(vector$obs, breaks=brks, include.lowest=TRUE)
obs_out <- transform(table(factor_obs), cumFreq = cumsum(Freq), relative = prop.table(Freq))
print(obs_out, digits=3)
# factor_obs Freq cumFreq relative
# 1 [11.2,25.6] 2 2 0.0417
# 2 (25.6,40.1] 10 12 0.2083
# 3 (40.1,54.5] 17 29 0.3542
# 4 (54.5,69] 4 33 0.0833
# 5 (69,83.4] 8 41 0.1667
# 6 (83.4,97.9] 5 46 0.1042
# 7 (97.9,112] 2 48 0.0417
Now the sim data:
factor_sim <- cut(vector$sim, breaks=brks, include.lowest=TRUE)
sim_out <- transform(table(factor_sim), cumFreq = cumsum(Freq), relative = prop.table(Freq))
print(sim_out, digits=3)
# factor_sim Freq cumFreq relative
# 1 [11.2,25.6] 3 3 0.0638
# 2 (25.6,40.1] 8 11 0.1702
# 3 (40.1,54.5] 11 22 0.2340
# 4 (54.5,69] 11 33 0.2340
# 5 (69,83.4] 9 42 0.1915
# 6 (83.4,97.9] 4 46 0.0851
# 7 (97.9,112] 1 47 0.0213
Notice there are only 47 cases shown instead of 48 since one value is less then the minimum.
addmargins(table(factor_obs, factor_sim, useNA="ifany"))
# factor_sim
# factor_obs [11.2,25.6] (25.6,40.1] (40.1,54.5] (54.5,69] (69,83.4] (83.4,97.9] (97.9,112] <NA> Sum
# [11.2,25.6] 1 0 0 0 0 0 0 1 2
# (25.6,40.1] 2 8 0 0 0 0 0 0 10
# (40.1,54.5] 0 0 11 6 0 0 0 0 17
# (54.5,69] 0 0 0 4 0 0 0 0 4
# (69,83.4] 0 0 0 1 7 0 0 0 8
# (83.4,97.9] 0 0 0 0 2 3 0 0 5
# (97.9,112] 0 0 0 0 0 1 1 0 2
# Sum 3 8 11 11 9 4 1 1 48
my data is like this
df<- structure(list(label = c("afghanestan", "afghanestan", "afghanestanIndia",
"afghanestanindiaholad", "afghanestanUSA", "USA", "Argentina",
"Brazil", "Argentinabrazil", "Brazil"), Start = c(114, 516, 89,
22, 33, 67, 288, 362, 45, 362), Stop = c(127, 544, 105, 34, 50,
85, 299, 381, 68, 381)), class = "data.frame", .Names = c("label",
"Start", "Stop"), row.names = c(NA, -10L))
when I want to remove the exact duplicate , I simply do this
df[!duplicated(df[,c('label','Start','Stop')]),]
now the problem is that I want to recognize those that are similar in the label but possibly different in the start and stop. so I would like to generate something like this afterwards
label Start Stop NewLab
1 afghanestan 114 127 TRUE
2 afghanestan 516 544 TRUE
3 afghanestanIndia 89 105 FALSE
4 afghanestanindiaholad 22 34 FALSE
5 afghanestanUSA 33 50 FLASE
6 USA 67 85 FALSE
7 Argentina 288 299 FALSE
8 Brazil 362 381 FALSE
9 Argentinabrazil 45 68 FALSE
This would work in a single line of code:
df$NewLab <- df$label %in% df[duplicated(df$label), ]$label
And the output:
> df$NewLab <- df$label %in% df[duplicated(df$label), ]$label
> df
label Start Stop NewLab
1 afghanestan 114 127 TRUE
2 afghanestan 516 544 TRUE
3 afghanestanIndia 89 105 FALSE
4 afghanestanindiaholad 22 34 FALSE
5 afghanestanUSA 33 50 FALSE
6 USA 67 85 FALSE
7 Argentina 288 299 FALSE
8 Brazil 362 381 FALSE
9 Argentinabrazil 45 68 FALSE
Or in dplyr notation:
df <- dplyr::mutate(df, NewLab = label %in% df[duplicated(df$label), ]$label)
Here is a somewhat convoluted methods using dplyr
library(tidyverse)
df %>%
group_by(label) %>%
mutate(n = n()) %>%
group_by(Start, Stop) %>%
mutate(n2 = n()) %>%
mutate(newlabel = ifelse(n>1 & n2==1, TRUE, FALSE)) %>%
dplyr::select(-n, -n2)
First create a grouping variable of labels - take a count, then a grouping variable of start and stop times - take a count, use an ifelse to assign True/False, then remove the intermediate columns.
This is R language.
From a matrix called temp_warnings that looks like
row.names row day Tx Hx Tn
1 61 61 30 31.9 36.85 19.1
2 84 84 23 33.5 43.07 20.3
3 85 85 24 31.5 39.82 19.2
4 94 94 2 30.9 41.36 20.0
5 99 99 7 34.0 43.17 21.6
6 101 101 9 34.4 42.45 21.0
7 131 131 8 30.1 38.52 19.6
8 132 132 9 30.7 38.35 21.0
I want to have this informations saved using the row and day columns into a new matrix called stn.
2001
Tmax >= 30 & Tmin >= 19 61, 84, 85, 94, 99, 101, 131, 132
May
June 30
July 23, 24
August 2, 7, 9
September 8, 9
So I would like the contents of the column row to be saved in the first cell. There are 153 days being tested for Tx, Hx and Tn, May 1st - Sept 30th so the day column corresponds to the day of the month. So for column row numbers 1-31 are May, 32-61 are June and so on. I would like the day column numbers to be saved in the correct cells for their month as well.
If you need any other information let me know,
Thanks,
Nick
This is very unusual format so things can get messy:
dat <- read.table(header = TRUE, text="row.names row day Tx Hx Tn
1 61 61 30 31.9 36.85 19.1
2 84 84 23 33.5 43.07 20.3
3 85 85 24 31.5 39.82 19.2
4 94 94 2 30.9 41.36 20.0
5 99 99 7 34.0 43.17 21.6
6 101 101 9 34.4 42.45 21.0
7 131 131 8 30.1 38.52 19.6
8 132 132 9 30.7 38.35 21.0")
## creating a column for the months and pasting the days by month
dat <- within(dat, {
m <- cut(row, breaks = c(0, 31, 61, 91, 121, Inf), labels = month.abb[5:9])
ms <- ave(dat$day, m, FUN = function(x) paste(x, collapse = ', '))
# 'Tmax >= 30 & Tmin >= 19' <- paste(row, collapse = ', ')
})
## creating the final data frame to merge into
dat1 <- data.frame(' ' = c('Tmax >= 30 & Tmin >= 19', month.abb[5:9]),
'2001' = c(paste(dat$row, collapse = ', '), rep(NA, 5)),
check.names = FALSE)
dat1 <- merge(dat1, dat[!duplicated(dat[c('m','ms')]), c('m','ms')],
by.x = ' ', by.y = 'm', all = TRUE)
## combining the two columns and some clean-up
dat1 <- within(dat1, {
'2001' <- gsub('NA', '', paste(`2001`, ms))
ms <- NULL
' ' <- factor(` `, levels = c('Tmax >= 30 & Tmin >= 19', month.abb[5:9]))
})
## and ordering the rows as desired
dat1[with(dat1, order(` `)), ]
# 2001
# 6 Tmax >= 30 & Tmin >= 19 61, 84, 85, 94, 99, 101, 131, 132
# 4 May
# 3 Jun 30
# 2 Jul 23, 24
# 1 Aug 2, 7, 9
# 5 Sep 8, 9
This is what I ended up doing
stn[1,1] <- toString(temp_warnings$row)
stn[2,1] <- toString((subset(temp_warnings, row <= 31))$day)
stn[3,1] <- toString((subset(temp_warnings, 31 < row & row <= 61))$day)
stn[4,1] <- toString((subset(temp_warnings, 61 < row & row <= 92))$day)
stn[5,1] <- toString((subset(temp_warnings, 92 < row & row <= 123))$day)
stn[6,1] <- toString((subset(temp_warnings, 123 < row))$day)