I have a R data frame with 3 columns
timestamp
Category
Value
I am trying to find an elegant way(ideally) to find where the values increased, or decreased, by X percent within a specified time frame. For example, I'd like to know all points in the data where Value increased by 50% or more within 1 week.
Are there any built in funcitons of packages where I can just pass a percentage and a number of days and have it return which rows in the data frame are a match?
something along these lines(pseudo code below):
RowsThatareAMatch <- findmatches(date=MyDF$Timestamp, grouping=MyDF$Category, data=MyDF$Value, growth=0.5, range=7)
The thing that is throwing me off is that I want it returning the rows for each Category that has values, and not just look at every value in the data frame. So if Category A & B had growth of 50% or more within 7 days 8 times in my data, I want those rows returned, and if categories C, D, & E didn't every have that kind of increase I don't want data from those categories returned at all.
Right now I am looking at systematically splitting the data frame into multiple data frames for each category and then doing the analysis on each individual data frame. While that approach could work, something is telling me that R has an easier way to do this.
Thoughts?
edit: Ideally what I am looking for returned is a data frame with 3 columns, and 1 row for every match in my data.
Category
Start timestamp of the match
End timestamp of the match.
Based in my experience with R I would need to identify the row numbers for each grouping and then I could extract the above data from the original data frame, but if there's any good way to go straight to the above output that would be awesome too!
Sample Data
So I have a CSV like this:
Timestamp,Category,Value
2015-01-01,A,1
2015-01-02,A,1.2
2015-01-03,A,1.3
2015-01-04,A,8
2015-01-05,A,8.2
2015-01-06,A,9
2015-01-07,A,9.2
2015-01-08,A,10
2015-01-09,A,11
2015-01-01,B,12
2015-01-02,B,12.75
2015-01-03,B,15
2015-01-04,B,60
2015-01-05,B,62.1
2015-01-06,B,63
2015-01-07,B,12.3
2015-01-08,B,10
2015-01-09,B,11
2015-01-01,C,100
2015-01-02,C,100000
2015-01-03,C,200
2015-01-04,C,350
2015-01-05,C,780
2015-01-06,C,780.2
2015-01-07,C,790
2015-01-08,C,790.3
2015-01-09,C,791
2015-01-01,D,0.5
2015-01-02,D,0.8
2015-01-03,D,0.83
2015-01-04,D,2
2015-01-05,D,0.01
2015-01-06,D,0.03
2015-01-07,D,0.99
2015-01-08,D,1.23
2015-01-09,D,5
I would read that into R like this
df <- read.csv("CategoryMeasurements.csv", header=TRUE)
Say you your data.frame is called df, you can do something like this using data.table which creates a new row that reads "increase over 50%" if the value grew by 50% or more (which you can then filter):
lag <- function(x, n) c(rep(NA, n), x[1:(length(x) - n)])
library(data.table)
setDT(df)[, ifelse(value/lag(value, 1) - 1 > 0.5, "increase over 50%", "Other"), by = category]
Well I'm not sure how elegant this is, but it works, and I wound up having to subset by category before passing the data frame to my function and will need to create a loop or use one of the apply functions to pass each category to my function, but it should get the job done.
Mydf <- read.csv("CategoryMeasurements.csv", header=TRUE)
GetIncreasesWithinRange <- function(df, growth, days ) {
# df = data frame with data you want processed. 1st column should be a date, 2nd column should be the data.
# growth = % of growth you are looking for in the data
# days = the number of days that the growth should occur in to be a match.
df <- df[order(df[,1]), ] # Sort the df by the date column. This is important for the loop logic.
# Initialize empty data frame to hold results that will be returned from this funciton.
ReturnDF <- data.frame( StartDate=as.Date(character()),
EndDate=as.Date(character()),
Growth=double(),
stringsAsFactors=FALSE)
TotalRows = nrow(df)
for(i in 1:TotalRows) {
StartDate <- toString(df[i,1])
StartValue <- df[i,2]
for(x in i:(TotalRows)) {
NextDate <- toString(df[x,1])
DayDiff <- as.numeric(difftime(NextDate ,StartDate , units = c("days")))
if(DayDiff >= days) {
NextValue <- df[x,2]
PercentChange = (NextValue - StartValue)/NextValue
if(PercentChange >= growth) {
ReturnDF[(nrow(ReturnDF)+1),] <- list(StartDate, NextDate, PercentChange)
}
break
}
}
}
return(ReturnDF)
}
subDF <- Mydf[which(Mydf$Category=='A'), ]
subDF$Category <- NULL # Nuke the category column from the subsetting DF. It's not relevant for this.
X <- GetIncreasesWithinRange(subDF, 0.5, 4)
print(X)
Which outputs
StartDate EndDate Growth
1 2015-01-01 2015-01-05 0.8780488
2 2015-01-02 2015-01-06 0.8666667
3 2015-01-03 2015-01-07 0.8586957
Related
I have a data frame with an ID column that includes duplicates. There is a column called type that takes the values "S" or "N." There are two additional date columns - admission date and discharge date. My question is a bit similar to comparing two data frames and isolating rows based on certain date differences, but not quite. If needed, I could separate my data into two data frames, but I'm wondering if I can accomplish what I want without the extra steps.
Here is a small example of what the data for two patients looks like in R:
example <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
What I want to do is compare within patients, entries that take the value "N" and entries that take the value "S" in the type variable. Based on the discharge date for entries with the value "S," I would like to find entries with the value "N" that have an admission date within 5 days of the former's discharge date (the discharge date with value "S" should be before the admission date with value "N").
So in the example data frame, the only two entries that should be retained are rows 2 and 3 and not rows 5 and 6 since the difference between admission date and discharge date is greater than 5.
Does anyone have any suggestions of how I can filter this data? Any help is greatly appreciated.
This was an interesting challenge. One reason for this is because iterating over rows is less intuitive than iterating over columns (see this question for lots of suggestions: For each row in an R dataframe).
Now I know vectorized solutions are preferred over for loops, but one of the challenges with this problem was that instead of just performing functions on each row, we're comparing the iterated rows to other rows and deleting some rows as we go along. I expect there's a better solution out there and I hope someone posts a better solution to help me learn.
One minor note before I begin, "example" isn't a great name for an object because it's also a function in base R. Additionally, the solution is much easier if we're only dealing with alternating rows of "S" and "N" - that is if many S's precede an N then only the lowest S might be within 5 days of N. Nonetheless it was worth the effort to attack the more challenging case.
Ultimately I ended up solving this as a 2-stage problem, each solved with a for loop. First, I took out all the S rows which weren't within 5 days of the corresponding N rows. Then I took out those N rows which didn't have any appropriate S companions. All of this is implemented in base R.
So to begin:
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
example_df$admission_date<-as.numeric(as.Date(example_df$admission_date))
example_df$discharge_date<-as.numeric(as.Date(example_df$discharge_date))
The first thing I did was to take the date columns (which were characters) and convert them to numeric based on date. Originally I was doing mathematical operations with date objects, but this became complicated with the subsetting operations I ended up using.
Here's the first for loop:
del_vec <- vector("integer")
for( i in 1:nrow(example_df)) {
if (example_df[i,"type"]== "S") {
next
}
if (example_df[i,"type"] == "N") {
add_on <- which
(
example_df["type"] == "S" &
example_df["ID"]==example_df[i,"ID"] &
example_df["discharge_date"] < (example_df[i,"admission_date"] - 5)
)
}
del_vec<- append(del_vec,add_on)
}
example_df_new <- example_df[-c(del_vec),]
rownames(example_df_new) <- 1:nrow(example_df_new)
example_df_new
What I did here is start by creating a vector which will contain the row numbers that we delete. To get rid of the inappropriate S rows we need to actually work on the N rows, so I have the loop skip the S rows. Then when the loop encounters an N row, we find the rows which meet the following conditions:
have type S
have the same ID as the N row in question
have a discharge date which is more than 5 days from the admission date for the N row in question
Using which()captures the row numbers that meet these criteria. Now I add these rows to the empty vector and remove them from the original df. I also rename the rows of the new df to get the following output for example_df_new
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
3 52 16241 16251 N
So we've preserved the 2 rows you wanted to keep, but now we have this bottom row that we want to get rid of. I do this in the second loop which iterates over the rows in the new reduced df:
del_vec2 <- vector()
for(i in 1:nrow(example_df_new)) {
if (example_df_new[i,"type"]=="S") {
next
}
if (example_df_new[i,"type"] == "N") {
add_on_two <- which(example_df_new["type"] == "S" & example_df_new["ID"] == example_df_new[i,"ID"])
}
if(length(add_on_two !=0)) {
next
} else {
del_vec2 <- append(del_vec2,i)
}
}
example_df_3<-example_df_new[-c(del_vec2),]
example_df_3
Again, we tell the loop to skip the S rows — whichever ones made the first cut should stay in. Now when the loop encounters an N row we ask the loop to look for rows that meet the following criteria:
is type S
has the same ID as the N row in question
Again I use which() to save the positions of these rows. If these criteria are met then we skip ahead - we want to keep all the N's that have an appropriate S companion. If not then we add the row number of (i) - that is the row number for the N in question to our vector of rows that we want to delete.
We then delete those rows and end up with the desired output:
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
At this point you can change the date columns back to a date format.
Again, while this may be the first, I expect it's not the best solution. I hope to see an improved solution, but the problem is more tricky than it appears at first.
After attempting to filter within the same data frame, I decided to separate the data into two tables: one containing only data of type "S" and the other containing only data of type "N." Then, I did a full join while matching on the ID column. While this creates a greater number of rows than before, I was then able to compare the two date of interest. The resulting data frame contains only one row - the entry of a patient with an admission date with type "N" within 5 days of a discharge date with type "S."
The code in R is as follows:
library(dplyr)
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
N_only <- example_df %>%
filter(type == "N")
S_only <- example_df %>%
filter(type == "S")
example_df_merged <- merge(N_only, S_only, by = "ID")
example_df_merged$admission_date.x <- as.Date(as.character(example_df_merged$admission_date.x), format="%Y-%m-%d")
example_df_merged$discharge_date.y <- as.Date(as.character(example_df_merged$discharge_date.y), format="%Y-%m-%d")
example_df_merged$dateDiff <- example_df_merged$discharge_date.y - example_df_merged$admission_date.x
example_df_final <- example_df_merged %>%
filter(dateDiff <= 5 & dateDiff >= 0)
For clearer variable names, I would have changed the variables ending in ".x" and ".y," but that is not necessary.
Before I begin explaining, my data is super messy and I am not an advanced programmer, so bear with me.
I have a large data frame with two columns containing data on each row and one column containing spread data and NA's. I want to identify a value in the NA column, using the two other columns, and assign that value to the other rows in the NA column that have the matching other two columns.
To explain what I mean I have created a small sample code with a data frame 270 x 3:
Sample Code:
year <-c()
for (i in 1988:1990){
y1 <- i %>% rep(30)
year <- c(year, y1)
}
year <- year %>% rep(3)
stock <- c()
for (i in 1:18) {
s1 <- i %>% rep(15)
stock<- c(stock, s1)
}
df <- data.frame(year,stock)
df$group <- NA
df$group[c(24, 130, 160, 212)] <- 11
df$group[c(60, 88, 140, 240)] <- 12
df$group[c(2, 72, 100, 183)] <- 13
df$group[c(16, 47, 203, 262)] <- 14
I have years and stocks data in every row, but the group is random. The years can be repeated because they are years of a collection of monthly dates and different stocks on the same dates.
I want to identify the group number of a certain row and see if that stock "persists" 12 months by having the same yearly value for i + 11 rows. I then want to assign the value of the stock's group in row i to i:(i+11). In other words, I want to see if the stock in row i and row i+11 is the same stock and the year is the same in row i and i+11
I tried to overcome this problem by only looking at the rows where we have data and create if-else conditions inside a for loop:
identifier <- which(!is.na(df$group))
for (i in identifier) {
if (df$stock[i] == df$stock[i + 11]) {
df$group[(i+1):(i+11)] <- df$group[i]
} else {
df$group[i] <- NA
}
}
I have two problems (that I have found so far) with the code:
It does not look at all 12 values from i to i+11, so it can cause
overlapping groups to become NA and assign false groups to certain
stocks and periods
My original data frame consist of >3m observations and the for loop takes more than 30 minutes to run.
My sample code does not fully represent my original data set. Some stocks exist for 15 years and some 20 months. The stocks can also change group from year to year.
Here is an example of the error and how I want it to look:
df$group[24] # is NA, but the surrounding values are not
df$group[16:27]
# They should be NA, in this example
df$group[16:27] <- NA
df$group[262] <-NA # also identified error
# I then want to remove the NA's and get a clean data set containing only stocks with assigned groups
df <- df[-which(is.na(df$group)),]
df
Do you have any suggestions on how to make this more efficient?
Ways I can structure the data better?
More efficiently look for values in a column and then assign it to the other rows?
I am trying to reorganize some raw data into a more condense form. Currently the data looks like the below output from the R code. I would like the final output to have columns for time, ID, and all possible desired prices. Then, I want each ID to have only one row for each time with the quantity number put in at the different desired prices(so how many an ID wants at a particular price during this time). So for example, a particular ID might have a quantity of 1 at 100 and quantity of 2 at 101. If it is a buy, then the value should be negative and if it is a sell then positive. For example, -1 for buy at 100 and 2 for sell at 101.
I originally tried doing it through a double for loop with the first loop being time and then the second loop being the ID. Then I was able to look at the quantity column and desired price for an ID and put them into a vector. Afterwards, I combined all the vectors together for that time and then repeated this. When I tried to use this in practice, it was not feasible because the code was too slow as there are hundreds of IDs and thousands of times.
Can someone help me do this in a faster and cleaner way?
set.seed(1)
time <- rep(seq(1,5), , each = 15)
id <- sample(342:450,75,replace = TRUE)
price <- sample(99:103,75,replace = TRUE)
Desire.Price <- sample(97:105,75,replace = TRUE)
quantity <- sample(1:4,75,replace = TRUE)
data <- data.frame(time = time, id = id,price = price, Desire.Price = Desire.Price,quantity = quantity)
data$buysell <- 0
data$buysell <- ifelse( data$Desire.Price <= data$price, "BUY","SELL")
I expect the final data set would look something like this.
Final.df <- data.frame(time=NA,id=NA,"97" = NA,"98"=NA ,"99"=NA,"100"=NA,"101"=NA,"102"=NA,"103"=NA
,"104"=NA,"105"=NA)
It would basically condense the original raw data to have all the information for a particular ID in a row during each time period.
Edit: If an ID did not get sampled in that time (for example ID 342 is not in time 1) they should have a row of NA in that time period( So ID 342 would have a row of NA in time 1). I edited the code that generates the samples to have more ids to reflect this( So that they can't all possibility be sampled in every time period).
Here's a tidyverse approach. First, make quantity signed based on BUY/SELL, then sum quantity for each id / time / Desire.Price, then spread those into wide format with a column for each Desire.Price.
library(dplyr); library(tidyr)
data %>%
mutate(quantity_signed = if_else(buysell == "BUY", -quantity, quantity)) %>%
count(id, time, Desire.Price, wt = quantity_signed) %>%
complete(id, time) %>% # EDIT to bring in all times for all id's
spread(Desire.Price, n) %>% View("output")
I think this approach is simple comparatively.
# Code
library(reshape2)
#Turning BUY quantity values negative.
data[which(data$buysell=="BUY"),]$quantity <- -(data[which(data$buysell=="BUY"),]$quantity)
#Using dcast function to achieve desired columns.
final.df <- dcast(data,time + id~Desire.Price ,fun=sum,value.var='quantity')
I have a time series dataset in R environment.
A variable CB_Day is equal to MPD on some dates and 0 on most of the dates.
I want to delete all rows except the MPD days and the 10 previous days.
I have tried subset, head() and tail(), but they did not work.
Can someone tell me what is the right command for deleting records
based on my condition in R ?
The result should be the whole table with all other columns. Only rows need wo be deleted.
If I get it right then something like this should help...
# create data where CB_Day is always 0 (please provide reproducible data next time)
df <- data.frame(MPD = 1:100, CB_Day = rep(0, 100))
# sometimes CB_Day is same as MPD
df$CB_Day[c(20, 70)] <- df$MPD[c(20, 70)]
# Find where both are same
same <- which(df$MPD== df$CB_Day)
# create vectors with "10 rows before CB_Day and MPD are same" to the row where they are same
keep <- sapply(same, function(x){(x-10):x})
# make it a vector instead of a matrix
keep <- unlist(keep)
# select the rows
df[keep, ]
I'm new to R; have a simple stumbling block for which I've been searching for an answer for too long.
Dateframe includes a list of individuals with their performance over a five year period. The analysis needs to include only those individuals that participated in the most recent year, so I need to identify those individuals and then select all records from the original data frame for those individuals with all columns (there's 50 or more other columns).
Original data frame is performance_fiveyr; variables I'm working with are person_id and year. I have tried any number of possible ways to get what I need; I'm listing one of those ways here...
First step is to create the list of individuals that participated this past year
person_current <- subset (x = performance_fiveyr,
subset = year==2015, # keep only records from 2015
select = person_id # keep only the person_id variable
)
Next step then is to select from performance_fiveyr all rows that have a person_id that exists in person_current and return all other columns (more than 50 columns total).
performance_current <- performance_fiveyr[performance_fiveyr$person_id
%in% person_current, ]
I've tried more than a few variations of this and end up with either all columns and no rows or all rows and no variables.
Here is some example data:
set.seed(0)
p5 <- data.frame(id = sample(5, 20, replace=TRUE), year = sample(2010:2015, 20, replace=TRUE))
p5 <- p5[order(p5$id, p5$year), ]
I think you were on the right track. I think the below does what you are after:
current <- unique(p5[p5$year==2015, 'id'])
p_current <- p5[p5$id %in% current, ]
p_current