I'm working with a large dataset that has multiple locations measured monthly, but each site has different number of measurement and NAs, creating a broken time series. To get around this, I've created a for loop, looped at each site, to fill in the gaps using an interpolation technique. From this, I get an interpolated output and would ideally like to add this back into the original dataset. For example:
library(imputeTS)
Sites = c(rep("A", 5), rep("B", 4), rep("C", 10))
Meas = c(25,20,NA,21,NA,23,21,22,26,27,15,20,NA,25,NA,28,28,27,NA)
df= data.frame(Sites, Meas)
for(i in Sites) {
d = subset(df, Sites = i)
d$fit = na.interpolation(d$Meas)
}
What I would like is to take d$fit and match it back into a new column, df$fit, such that the number of measurements and each site is matched properly. Any suggestions, or complete overhauls to my approach? Thanks in advance!
It's not often that you actually need for loops. You can do this particular task with the ave() function
df$fit <- ave(df$Meas, df$Sites, FUN=na.interpolation)
In this case the function applies the na.interpolation function to each of the Meas values for each of the different values of Sites and then puts everything back in the right order.
Another stragegy you could use for something more complex, is split/unsplit. Something like
ss <- split(df$Meas, df$Sites)
df$fit <- unsplit(lapply(ss, na.interpolation), df$Sites)
Related
tl;dr I need to condition if a promotion was on or not based upon drops(or not) in price over time. I am open to alternative approaches.
I have a data frame of prices split across several grouping factors over time. My goal is for each 'ITEM' in 'EACH' store to check the mode of the 'PRICE' for the past 7 dates (if they exist). If the value of the observation is less than 10% of the mode of price, then in the 'Promotion' column should be populated with a 1, if not a 0.
EXAMPLE DATA
dat <- data.frame(Date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 10),
Item = rep(LETTERS[1:4], times = 10),
Store = as.factor(sample(rep(c("NY","SYD","LON","PAR"), each = 10))),
Price = rnorm(n = 40, mean = 2.5, sd = 1))
So far I have used dplyr's group_split to break out item and store groupings into separate data frames to capture all the conditions. What I believe I need to do now is mutate the new column using an ifelse statement with rollapply. I have so far attempted to use the following line of code...
data %>% mutate(Promotion = ifelse(rollapply(Price, 7, Mode <= Price*0.91,1,0)))
this returns an error statement...
Error: Problem with `mutate()` input `PRMT_IND2`.
x comparison (5) is possible only for atomic and list types
i Input `PRMT_IND2` is `ifelse(...)`.
I am not really sure where to go from here. If you have time I would also appreciate it if you could tell me how to apply this across all the groups created by the group_split, and how to stitch this back together.
note. Observations (dates/rows) are no even across shops, and some are populated with less than 7 days. I can remove these if the rolling apply will not work without it. But that loses quite a chunk of data.
I am using this function for the Mode...
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Maybe you can use rolling mean instead of mode.
library(dplyr)
library(zoo)
dat %>%
group_by(Item, Store) %>%
mutate(Promotion = as.integer(abs((Price -
rollmeanr(Price, 7, fill = NA))/Price) > 0.1))
This will give NA's to first 6 value and give 1 if Price varies more than 10% than previous 7 days value and 0 otherwise. Also note, that we take absolute value here so it will give 1 if the price increases by 10% or decreases.
As Ronah Shak pointed out, the function does not seem like the most appropriate choice.
Also, note that the use of tabulate converts the values to integers, which may be problematic for the values you have.
Regarding the error, as you correctly guessed, the problem was that your splitted data does not always have 7 dates so the rollapply function with width=7returned an error.
Allowing your function to use the length of the Date vector OR 7 if available solves the issue.
Also, you can use just apply your function using group_by, splitting the data is not necessary.
dat %>%
group_by(Store,Item)%>%
mutate(price_check = Price*0.91,
Promotion = ifelse(rollapply(Price, width = min(length(Date),7), Mode)>=price_check,1,0))
Similar to do.call/lapply approach here, and data.table approach here, but both have the setup of:
MainDF with data and startdate/enddate ranges
SubDF with a vector of single dates
Where the users are looking for summaries of all the MainDF ranges that overlap each SubDF date. I have
MainDF with data and a vector of single dates
SubDF with startdate/enddate ranges
And am looking to append summaries, to SubDF, for multiple rows of MainDF data which fall within each SubDF range. Example:
library(lubridate)
MainDF <- data.frame(Dates = seq.Date(from = as.Date("2020-02-12"),
by = "days",
length.out = 10),
DataA = 1:10)
SubDF <- data.frame(DateFrom = as.Date(c("2020-02-13", "2020-02-16", "2020-02-19")),
DateTo = as.Date(c("2020-02-14", "2020-02-17", "2020-02-21")))
SubDF$interval <- interval(SubDF$DateFrom, SubDF$DateTo)
Trying the data.table approach from the second link I figure it should be something like:
MainDF[SubDF, on = .(Dates >= DateFrom, Dates <= DateTo), allow = TRUE][
, .(SummaryStat = max(DataA)), by = .(Dates)]
But it errors with unused arguments for on. On my actual data I got a result by using (the equivalent of) max(MainDF$DataA), but it was 3 repeats of the second value (In my actual data the final row won't run as it doesn't have a value for DateTo). I suspect using MainDF$ means I've subverting the grouping.
I suspect I'm close but I'm really struggling to get my head around the data.table mindset for complex use cases. The summary stats I'm looking to do are (for example data):
Mean & Max of DataA
length(which(DataA > 3))
difftime(last(Dates), first(Dates), units = "mins")
Dates[which.max(DataA)]
I added the interval line above as data.table's %between% help suggests one might be able to use a Dates %between% interval format but it doesn't mention intervals/difftimes specifically in the text nor examples and my attempts are already failing elsewhere so I'm loathe to concentrate on improving my running while I can't walk!
I've focused on the data.table approach since it's used for a similar problem, but I've been wondering whether dplyr's group_by/group_by_if could be used instead? group_by_if's .predicate seems to be constrained to tests on the columns (e.g. are they factors) rather than relating to data in the columns' rows, but I could be wrong.
Thanks in advance for any help!
I am quite new to R and have quite a challenging Question. I have a large dataframe consisting of 110,000 rows representing high-Resolution data from a Sediment core. I would like to select multiple rows based on Depth (which is recorded in mm to 3 decimal points). Of Course, I have not the time to go through the entire dataframe and pick the rows that I Need. I would like to be able to select the rows I would like based on the decimal Point part of the number and not the first Digit. I.e. I would like to be able to subset to a dataframe where all the .035 values would be returned. I have so far tried using the which() function but had no luck
newdata <- Linescan_EN18218[which(Linescan_EN18218$Position.mm.== .035),]
Can anyone offer any hints/suggestions how I can solve this Problem. Link to the first part of the dataframe csv
Welcome to stack overflow
Can you please further describe what you mean with had no luck. Did you get an error message or an empty data.frame?
In principle, your method should work. I have replicated it with simulated data.
n = 100
test <- data.frame(
a = 1:n,
b = rnorm(n = n),
c = sample(c(0.1,0.035, 0.0001), size = n, replace =T)
)
newdata <- test[which(test$c == 0.035),]
Goal
If the following description is hard follow, please see the example "before" and "after" to see a straightforward example.
I have bartering data, with unique trade ids, and two sides of the trade. Side1 and Side2 are baskets, lists of item ids that represent both sides of the barter transaction.
I'd like to count the frequency each ITEM appears in TRADES. E.g, if item "001" appeared in 3 trades, I'd have a count of 3 (ignoring how many times the item appeared in each trade).
Further, I'd like to do this with the plyr ddply function.
(If you're interested as to my motivation, I working over many hundreds of thousands of transactions and am already using a ddply to calculate several other summary statistics. I'd like to add this to the ddply I'm already using, rather than calculate it after, and merge it into the ddply output.... sorry if that was difficult to follow.)
In terms of pseudo code I'm working off of:
merge each row of Side1 and Side2
by row, get unique() appearances of each item id
apply table() function
transpose and relabel output from table
Example of the structure of my data, and the output I desire.
Data Example (before):
df <- data.frame(TradeID = c("01","02","03","04"))
df$Side1 = list(c("001","001","002"),
c("002","002","003"),
c("001","004"),
c("001","002","003","004"))
df$Side2 = list(c("001"),c("007"),c("009"),c())
Desired Output (after):
df.ItemRelFreq_byTradeID <- data.frame(ItemID = c("001","002","003","004","007","009"),
RelFreq_byTrade = c(3,3,2,2,1,1))
One method to do this without ddply
I've worked out one way to do this below. My problem is that I can't quite seem to get ddply to do this for me.
temp <- table(unlist(sapply(mapply(c,df$Side1,df$Side2), unique)))
df.ItemRelFreq_byTradeID <- data.frame(ItemID = names(temp),
RelFreq_byTrade = temp[])
Thanks for any help you can offer!
Curtis
I believe this will do what you're asking for. It uses ddply. Twice!
res <- ddply(df, .(TradeID), function(df) data.frame(ItemID = c(df$Side1[[1]],df$Side2[[1]]), TradeID = df$TradeID))
ddply(res, .(ItemID), summarise, RelFreq_byTrade = length(unique(TradeID)))
Note that the ItemsIDs are slightly out of order.
Let's start from the end: the R output will be read in Tableau to create a dashboard, and therefore I need the R output to look like in a certain way. With that in mind, I'm starting with a data frame in R with n groups of time series. I want to run auto.arima (or another forecasting method from package forecast) on each by group. I'm using the by function to do that, but I'm not attached to that approach, it's just what seemed to do the job for an R beginner like me.
The output I need would append a (say) 1 period forecast to the original data frame, filling in the date (variable t) and by variable (variable class).
If possible I'd like the approach to generalize to multiple by variables (i.e class_1,...class_n,).
#generate fake data
t<-seq(as.Date("2012/1/1"), by = "month", length.out = 36)
class<-rep(c("A","B"),each=18)
set.seed(1234)
metric<-as.numeric(arima.sim(model=list(order=c(2,1,1),ar=c(0.5,.3),ma=0.3),n=35))
df <- data.frame(t,class,metric)
df$type<-"ORIGINAL"
#sort of what I'd like to do
library(forecast)
ts<-ts(df$metric)
ts<-by(df$metric,df$class,auto.arima)
#extract forecast and relevant other pieces of data
#???
#what I'd like to look like
t<-as.Date(c("2013/7/1","2015/1/1"))
class<-rep(c("A","B"),each=1)
metric<-c(1.111,2.222)
dfn <- data.frame(t,class,metric)
dfn$type<-"FORECAST"
dfinal<-rbind(df,dfn)
I'm not attached to the how-to, as long as it starts with a data frame that looks like what I described, and outputs a data frame like the output I described.
Your description is a little vague, but something along these lines should work:
library(data.table)
dt = data.table(df)
dt[, {result = auto.arima(metric);
rbind(.SD,
list(seq(t[.N], length.out = 2, by = '1 month')[2], result$sigma2, "FORECAST"))},
by = class]
I arbitrarily chose to fill in the sigma^2, since it wasn't clear which variable(s) you want there.