randomizing data by category in R

randomizing data by category in R - r

So I am a bit new with R, so forgive me if this is a silly question. I have a data set of behaviors that looks something like this:
time behavior
10:04:36 FEED
10:04:37 FEED
10:04:38 REST
10:04:39 REST
10:04:40 RUN
etc..
I have added a column that numbers each new behavior as a unique number, something like:
time behavior Number
10:04:36 FEED 1
10:04:37 FEED 1
10:04:38 REST 2
10:04:39 REST 2
10:04:40 RUN 3
Therefore, if the behaviors at 10:04:36 and 10:30:00 are both FEED, they are still recognized as different behavior events because they have different numbers. I then subsetted my data by behavior category so that I have a dataset of all one behavior. However, in this data set I have Number categories for each time I have a new behavior event, for example:
time behavior Number
10:04:36 FEED 1
10:04:37 FEED 1
10:30:00 FEED 10
10:30:01 FEED 10
10:30:02 FEED 10
11:01:00 FEED 21
11:01:01 FEED 21
etc...
Now, what I would like to do is randomize this new dataset by Number category. So I would like to tell R to take each chunk of data with the same Number value and reorganize these chunks. I tried to use sample(), but that only seems to work for randomizing by row. As you can see the Number categories are not all the same size either. Basically I would like to create a new matrix that looks something like this:
time behavior Number
10:30:00 FEED 10
10:30:01 FEED 10
10:30:02 FEED 10
11:01:00 FEED 21
11:01:01 FEED 21
10:04:36 FEED 1
10:04:37 FEED 1
So, I would like R to recognize each new Number category as a distinct event, and randomly reorganize the data by each new event, not by row.
Does anyone know a way to do what I am trying to do in R?

You could create a helper funciton, such as
reorderingFunc <- function(data, indxCol){
indx <- sample(unique(data[, indxCol]))
data[order(unique(data[, indxCol])[match(data[, indxCol], indx)]), ]
}
Testing
set.seed(111) # Setting a seed so the outcome of `sample` be reproducible
reorderingFunc(df, "Number")
# time behavior Number
# 3 10:30:00 FEED 10
# 4 10:30:01 FEED 10
# 5 10:30:02 FEED 10
# 6 11:01:00 FEED 21
# 7 11:01:01 FEED 21
# 1 10:04:36 FEED 1
# 2 10:04:37 FEED 1

Related

Is there a way I can use r code in order to calculate the average price for specific days? (AVERAGEIF function)

Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.

If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel

R: Subsetting rows by group based on time difference

I have the following data frame:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 1976-02-09 1976-12-11
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
I want to subset my data frame in such a way that the new data frame only shows the rows in which the values of date_show are further than 10 days apart but this condition should only be applied per group. I.e. if the values in the date_show column are less than 10 days apart but the group_ids are different, I need to keep both entries. What I want my result to look like based on the above table is:
group_id date_show date_med
1 1976-02-07 1971-04-14
1 2011-03-02 1970-03-22
2 1993-08-04 1997-06-13
2 2008-07-25 2006-09-01
2 2009-06-18 2005-11-12
3 2009-06-18 1999-11-03
Which row gets deleted isn't important because the reason why I'm subsetting in the first place is to calculate the number of rows I am left with after applying this criteria.
I've tried playing around with the diff function but I'm not sure how to go about it in the simplest possible way because this problem is already within another sapply function so I'm trying to avoid any kind of additional loop (in this case by group_id).
The df I'm working with has around 100 000 rows. Ideally, I would like to do this with base R because I have no rights to install any additional packages on the machine I'm working on but if this is not possible (or if solving this with an additional package would be significantly better), I can try and ask my admin to install it.
Any tips would be appreciated!

Calendar (again) manipulations in R

I have code like this:
today<-as.Date(Sys.Date())
spec<-as.Date(today-c(1:1000))
df<-data.frame(spec)
stage.dates<-as.Date(c('2015-05-31','2015-06-07','2015-07-01','2015-08-23','2015-09-15','2015-10-15','2015-11-03'))
stage.vals<-c(1:8)
stagedf<-data.frame(stage.dates,stage.vals)
df['IsMonthInStage']<-ifelse(format(df$spec,'%m')==(format(stagedf$stage.dates,'%m')),stagedf$stage.vals,0)
This is producing the incorrect output, i.e.
df.spec, df.IsMonthInStage
2013-05-01, 0
2013-05-02, 1
2013-05-03, 0
....
2013-05-10, 1
It seems to be looping around, so stage.dates is 8 long, and it is repeating the 'TRUE' match every 8th. How do I fix this so that it would flag 1 for the whole month that it is in stage vals?
Or for bonus reputation - how do I set it up so that between different stage.dates, it will populate 1, 2, 3, etc of the most recent stage?
For example:
31st of May to 7th of June would be populated 1, 7th of June to 1st of July would be populated 2, etc, 3rd of November to 30th of May would be populated 8?
Thanks
Edit:
I appreciate the latter is functionally different to the former question. I am ultimately trying to arrive at both (for different reasons), so all answers appreciated

see if this works.
cut and split your data based on the stage.dates consider them as your buckets. you don't need btw stage.vals here.
Cut And Split
data<-split(df, cut(df$spec, stagedf$stage.dates, include.lowest=TRUE))
This should give you list of data.frame splitted as per stage.dates
Now mutate your data with index..this is what your stage.vals were going to be
Mutate
data<-lapply(seq_along(data), function(index) {mutate(data[[index]],
IsMonthInStage=index)})
Now join the data frame in the list using ldply
Join
data=ldply(data)
This will however give out or order dates which you can arrange by
Sort
arrange(data,spec)
Final Output
data[1:10,]
spec IsMonthInStage
1 2015-05-31 1
2 2015-06-01 1
3 2015-06-02 1
4 2015-06-03 1
5 2015-06-04 1
6 2015-06-05 1
7 2015-06-06 1
8 2015-06-07 2
9 2015-06-08 2
10 2015-06-09 2

Creating a Dummy Variable for Observations within a date range

I want to create a new dummy variable that prints 1 if my observation is within a certain set of date ranges, and a 0 if its not. My dataset is a list of political contributions over a 10 year range and I want to make a dummy variable to mark if the donation came during a certain range of dates. I have 10 date ranges I'm looking at.
Does anyone know if the right way to do this is to create a loop? I've been looking at this question, which seems similar, but I think mine would be a bit more complicated: Creating a weekend dummy variable
By way of example, what I have a variable listing dates that contributions were recorded and I want to create dummy to show whether this contribution came during a budget crisis. So, if there were a budget crisis from 2010-2-01 until 2010-03-25 and another from 2009-06-05 until 2009-07-30, the variable would ideally look like this:
Contribution Date.......Budget Crisis
2009-06-01...........................0
2009-06-06...........................1
2009-07-30...........................1
2009-07-31...........................0
2010-01-31...........................0
2010-03-05...........................1
2010-03-26...........................0
Thanks yet again for your help!

This looks like a good opportunity to use the %in% syntax of the match(...) function.
dat <- data.frame(ContributionDate = as.Date(c("2009-06-01", "2009-06-06", "2009-07-30", "2009-07-31", "2010-01-31", "2010-03-05", "2010-03-26")), CrisisYes = NA)
crisisDates <- c(seq(as.Date("2010-02-01"), as.Date("2010-03-25"), by = "1 day"),
seq(as.Date("2009-06-05"), as.Date("2009-07-30"), by = "1 day")
)
dat$CrisisYes <- as.numeric(dat$ContributionDate %in% crisisDates)
dat
ContributionDate CrisisYes
1 2009-06-01 0
2 2009-06-06 1
3 2009-07-30 1
4 2009-07-31 0
5 2010-01-31 0
6 2010-03-05 1
7 2010-03-26 0

merge same row of different Vector and multiplicate afterwards

I have a dataset like this:
MQ = data.frame(Model=c("C150A","B174","DG18"),Quantity=c(5000,3800,4000))
MQ is a data.frame, it shows the Productionplan for a week in the future. With Model producing Model and Quantity
C150A = data.frame( Material=c("A0015", "A0071", "Z00071", "Z00080","Z00090",
"Z00012","SZ0001"), Number=c(1,1,1,1,1,1,4))
B174= data.frame(Material=c("A0014","A0071","Z00080","Z00091","Z00011","SZ0000"),
Number=c(1,1,1,1,2,4))
DG18= data.frame( Material=c("A0014","A0075","Z00085","Z00090","Z00010","SZ0005"),
Number=c(1,1,1,2,3,4))
T75A= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00012","SZ0005"),
Number=c(1,1,1,2,3,4))
G95= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00017","SZ0008"),
Number=c(1,1,1,2,3,4))
These are Models which could be produced...
My first problem here is, that belonging on the Productionplan MQ, i want to open automatically the needed Models, and multiplicate the Quantity with the number, to know how many of each Component(Material) is needed.
The output could be a data.frame, where all needed Components ( different Models can use the same Components and different Components, also the amount of needed Components caan be different) over all in the production plan noted Models are combined.
Material_Master= data.frame( Material=c( "A0013", "A001","A0015", "A0071", "A0075",
"A0078", "Z00071", "Z00080", "Z00090", "Z00091",
"Z00012","Z00091","Z00010""Z00012","Z00017","SZ0001",
"SZ0005","SZ0005","SZ0000","SZ0008","SZ0009"),
Number=c(20000,180000,250000,480000,250000,170000,
690000,1800000,17000,45000,12000,5000, 5000,
8000,16000,17000,45000,88000,7500,12000,45000))
In the last step the created data.frame should be merged with the Material_Master data: in the Material Master data, there are all important Components with the stock noted.
In my example there are all Components which where needed for the production also noted in the Material Master, but it can also be that in Material_Master is a Component missing, then just ignore this Component.
The Output should be something like, Compare the needed amount of Components, with the actual stock of them. Give a report, if there is more need then the actual stock have.
Thank you for your help.

This should work:
mods <- do.call(rbind,lapply(MQ$Model,function(x)cbind(Model=x,get(x))))
full_plan <- merge(mods,MQ,by="Model")
material_plan <- with(full_plan,aggregate(Quantity*Number,by=list(Material),sum))
# Group.1 x
# 1 A0014 7800
# 2 A0015 5000
# 3 A0071 8800
# 4 A0075 4000
# 5 SZ0000 15200
# 6 SZ0001 20000
# 7 SZ0005 16000
# 8 Z00010 12000
# 9 Z00011 7600
# 10 Z00012 5000
# 11 Z00071 5000
# 12 Z00080 8800
# 13 Z00085 4000
# 14 Z00090 13000
# 15 Z00091 3800
The first line gets each of your models and stacks them, along with the model name. The second line merges back to get the Quantity, and the third aggregates.
I went ahead and made a usable example by trimming off the 1 at the beginning of each Number in your latter models. Also, I read the Model and Material columns in as character instead of factor.
options(stringsAsFactors=FALSE)
MQ = data.frame(Model=c("C150A","B174","DG18"),Quantity=c(5000,3800,4000))
C150A = data.frame(Material=c("A0015","A0071","Z00071","Z00080","Z00090","Z00012","SZ0001"),Number=c(1,1,1,1,1,1,4))
B174= data.frame(Material=c("A0014","A0071","Z00080","Z00091","Z00011","SZ0000"), Number=c(1,1,1,1,2,4))
DG18= data.frame(Material=c("A0014","A0075","Z00085","Z00090","Z00010","SZ0005"),Number=c(1,1,1,2,3,4))
T75A= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00012","SZ0005"),Number=c(1,1,1,2,3,4))
G95= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00017","SZ0008"),Number=c(1,1,1,2,3,4))
Edit: Added the required stringsAsFactors option, as identified by #RicardoSaporta.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex