Subsetting data frame by list of dates - r

I have a time series data frame (1 entry per min) which has a date column, and I have made a list of all the unique dates in the column. The purpose of this is so I can perform functions on the data for one day at a time. So I want to have a loop where I take all the rows where the date is equal to the first date on my unique dates list, work with this, output, then same for second unique date, third, fourth etc.
When I use the following code I am getting zero rows returned; days and df$Date both are factors.
df <- read.csv("file1.csv")
days <- as.list(unique(df$Date))
for(i in 1:length(days)){
df <-df[(df$Date == days[i]),] #Only from this date
# Further work with data here
}

Related

Transform a country|date|Value dataframe in a dataframe with countries in the columns [duplicate]

This question already has answers here:
Reshape multiple value columns to wide format
(5 answers)
Closed 4 months ago.
I have a data frame, in which data for different countries are listed vertically with this pattern:
country | time | value
I want to transform it in a data frame, in which each row is a specific time period, and every column is the value relative to that country. Data are monthly.
time | countryA-value | countryB-value |countryC-value
Moreover, not all periods are present, when data is missing, the row is just absent, and not filled with NA or similar. I thought to two possible solutions, but they seem too complicated and inefficient. I do not write here the code,
If the value in a cell of the column "time" is more than one month after the cell above, while the cells to the left are the same (i.e. the data pertains to the same country), then we have a gap. I have to fill the gap and to this recursively until all missing dates are included.
At this point I have for each country the same number of observations, and I can simply copy a number of cells equal to the number of observations.
Drawbacks: it does not seem very efficient.
I could create a list of time periods using the command
allDates <- seq.Date(from = as.Date('2020-02-01'), to = as.Date('2021-01-01'), by = 'month')-1)
Then I look up the table about each period of allDates for each subset of the table of each country. If the value exist, copy the value, if there is not, fill with NA.
Drawbacks: I have no idea of which function I could use to this purpose.
Below the code to create a small table with two missing rows, namely data2
data <- data.frame(matrix(NA, 24, 3))
colnames(data) <- c("date", "country", "value")
data["date"] <- rep((seq.Date(from = as.Date('2020-02-01'), to = as.Date('2021-01-01'), by = 'month')-1), 2)
data["country"] <- rep(c("US", "CA"), each = 12)
data["value"] <- round(runif(24, 0, 1), 2)
data2 <- data[c(-4,-5),]
I solved the problem following the suggestion of r2evans, I checked the function dcast, and I obtained exactly what I wanted.
I used the code
reshape2::dcast(dataFrame, yearMonth ~ country, fill = NA)
Where dataFrame is the name of the data frame, yearMonth is the name of the column, in which the date is written, and country is the name of the column, in which the country is written.
The option fill=NA allowed to fill all gaps in the data with NA.

Comparing dates in different columns to isolate certain within-group entries in R

I have a data frame with an ID column that includes duplicates. There is a column called type that takes the values "S" or "N." There are two additional date columns - admission date and discharge date. My question is a bit similar to comparing two data frames and isolating rows based on certain date differences, but not quite. If needed, I could separate my data into two data frames, but I'm wondering if I can accomplish what I want without the extra steps.
Here is a small example of what the data for two patients looks like in R:
example <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
What I want to do is compare within patients, entries that take the value "N" and entries that take the value "S" in the type variable. Based on the discharge date for entries with the value "S," I would like to find entries with the value "N" that have an admission date within 5 days of the former's discharge date (the discharge date with value "S" should be before the admission date with value "N").
So in the example data frame, the only two entries that should be retained are rows 2 and 3 and not rows 5 and 6 since the difference between admission date and discharge date is greater than 5.
Does anyone have any suggestions of how I can filter this data? Any help is greatly appreciated.
This was an interesting challenge. One reason for this is because iterating over rows is less intuitive than iterating over columns (see this question for lots of suggestions: For each row in an R dataframe).
Now I know vectorized solutions are preferred over for loops, but one of the challenges with this problem was that instead of just performing functions on each row, we're comparing the iterated rows to other rows and deleting some rows as we go along. I expect there's a better solution out there and I hope someone posts a better solution to help me learn.
One minor note before I begin, "example" isn't a great name for an object because it's also a function in base R. Additionally, the solution is much easier if we're only dealing with alternating rows of "S" and "N" - that is if many S's precede an N then only the lowest S might be within 5 days of N. Nonetheless it was worth the effort to attack the more challenging case.
Ultimately I ended up solving this as a 2-stage problem, each solved with a for loop. First, I took out all the S rows which weren't within 5 days of the corresponding N rows. Then I took out those N rows which didn't have any appropriate S companions. All of this is implemented in base R.
So to begin:
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
example_df$admission_date<-as.numeric(as.Date(example_df$admission_date))
example_df$discharge_date<-as.numeric(as.Date(example_df$discharge_date))
The first thing I did was to take the date columns (which were characters) and convert them to numeric based on date. Originally I was doing mathematical operations with date objects, but this became complicated with the subsetting operations I ended up using.
Here's the first for loop:
del_vec <- vector("integer")
for( i in 1:nrow(example_df)) {
if (example_df[i,"type"]== "S") {
next
}
if (example_df[i,"type"] == "N") {
add_on <- which
(
example_df["type"] == "S" &
example_df["ID"]==example_df[i,"ID"] &
example_df["discharge_date"] < (example_df[i,"admission_date"] - 5)
)
}
del_vec<- append(del_vec,add_on)
}
example_df_new <- example_df[-c(del_vec),]
rownames(example_df_new) <- 1:nrow(example_df_new)
example_df_new
What I did here is start by creating a vector which will contain the row numbers that we delete. To get rid of the inappropriate S rows we need to actually work on the N rows, so I have the loop skip the S rows. Then when the loop encounters an N row, we find the rows which meet the following conditions:
have type S
have the same ID as the N row in question
have a discharge date which is more than 5 days from the admission date for the N row in question
Using which()captures the row numbers that meet these criteria. Now I add these rows to the empty vector and remove them from the original df. I also rename the rows of the new df to get the following output for example_df_new
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
3 52 16241 16251 N
So we've preserved the 2 rows you wanted to keep, but now we have this bottom row that we want to get rid of. I do this in the second loop which iterates over the rows in the new reduced df:
del_vec2 <- vector()
for(i in 1:nrow(example_df_new)) {
if (example_df_new[i,"type"]=="S") {
next
}
if (example_df_new[i,"type"] == "N") {
add_on_two <- which(example_df_new["type"] == "S" & example_df_new["ID"] == example_df_new[i,"ID"])
}
if(length(add_on_two !=0)) {
next
} else {
del_vec2 <- append(del_vec2,i)
}
}
example_df_3<-example_df_new[-c(del_vec2),]
example_df_3
Again, we tell the loop to skip the S rows — whichever ones made the first cut should stay in. Now when the loop encounters an N row we ask the loop to look for rows that meet the following criteria:
is type S
has the same ID as the N row in question
Again I use which() to save the positions of these rows. If these criteria are met then we skip ahead - we want to keep all the N's that have an appropriate S companion. If not then we add the row number of (i) - that is the row number for the N in question to our vector of rows that we want to delete.
We then delete those rows and end up with the desired output:
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
At this point you can change the date columns back to a date format.
Again, while this may be the first, I expect it's not the best solution. I hope to see an improved solution, but the problem is more tricky than it appears at first.
After attempting to filter within the same data frame, I decided to separate the data into two tables: one containing only data of type "S" and the other containing only data of type "N." Then, I did a full join while matching on the ID column. While this creates a greater number of rows than before, I was then able to compare the two date of interest. The resulting data frame contains only one row - the entry of a patient with an admission date with type "N" within 5 days of a discharge date with type "S."
The code in R is as follows:
library(dplyr)
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
N_only <- example_df %>%
filter(type == "N")
S_only <- example_df %>%
filter(type == "S")
example_df_merged <- merge(N_only, S_only, by = "ID")
example_df_merged$admission_date.x <- as.Date(as.character(example_df_merged$admission_date.x), format="%Y-%m-%d")
example_df_merged$discharge_date.y <- as.Date(as.character(example_df_merged$discharge_date.y), format="%Y-%m-%d")
example_df_merged$dateDiff <- example_df_merged$discharge_date.y - example_df_merged$admission_date.x
example_df_final <- example_df_merged %>%
filter(dateDiff <= 5 & dateDiff >= 0)
For clearer variable names, I would have changed the variables ending in ".x" and ".y," but that is not necessary.

Fastest way to assign values in data frame to matrix in R?

I have a very large data frame with Timestamp, StationId and Value as column names.
I would like to create a new matrix where the rows are Timestamps, columns are StationIds and the matrix elements are Values.
I have tried doing so using a loop but it is taking very long:
for (row in 1:nrow(res))
{
rmatrix[toString(res[row,"Timestamp"]),toString(res[row,"StationId"])] <-
res[row,"Value"]
}
The 'res' data frame looks like this. The timestamps are for a year, at 5mins interval. There are 62 unique station ids. The elements in the Value column are actually rainfall values.
The rmatrix I'm trying to rearrange the data into looks like this. Each row is a unique timestamp at 5mins interval. Each column is the id of a station. The elements of the matrix are supposed to be the rainfall value for that station at that time.
Is there a faster way to do this?
library(tidyverse)
df <- res %>% spread(StationIds,Values)

R find X percent growth within Y time frame

I have a R data frame with 3 columns
timestamp
Category
Value
I am trying to find an elegant way(ideally) to find where the values increased, or decreased, by X percent within a specified time frame. For example, I'd like to know all points in the data where Value increased by 50% or more within 1 week.
Are there any built in funcitons of packages where I can just pass a percentage and a number of days and have it return which rows in the data frame are a match?
something along these lines(pseudo code below):
RowsThatareAMatch <- findmatches(date=MyDF$Timestamp, grouping=MyDF$Category, data=MyDF$Value, growth=0.5, range=7)
The thing that is throwing me off is that I want it returning the rows for each Category that has values, and not just look at every value in the data frame. So if Category A & B had growth of 50% or more within 7 days 8 times in my data, I want those rows returned, and if categories C, D, & E didn't every have that kind of increase I don't want data from those categories returned at all.
Right now I am looking at systematically splitting the data frame into multiple data frames for each category and then doing the analysis on each individual data frame. While that approach could work, something is telling me that R has an easier way to do this.
Thoughts?
edit: Ideally what I am looking for returned is a data frame with 3 columns, and 1 row for every match in my data.
Category
Start timestamp of the match
End timestamp of the match.
Based in my experience with R I would need to identify the row numbers for each grouping and then I could extract the above data from the original data frame, but if there's any good way to go straight to the above output that would be awesome too!
Sample Data
So I have a CSV like this:
Timestamp,Category,Value
2015-01-01,A,1
2015-01-02,A,1.2
2015-01-03,A,1.3
2015-01-04,A,8
2015-01-05,A,8.2
2015-01-06,A,9
2015-01-07,A,9.2
2015-01-08,A,10
2015-01-09,A,11
2015-01-01,B,12
2015-01-02,B,12.75
2015-01-03,B,15
2015-01-04,B,60
2015-01-05,B,62.1
2015-01-06,B,63
2015-01-07,B,12.3
2015-01-08,B,10
2015-01-09,B,11
2015-01-01,C,100
2015-01-02,C,100000
2015-01-03,C,200
2015-01-04,C,350
2015-01-05,C,780
2015-01-06,C,780.2
2015-01-07,C,790
2015-01-08,C,790.3
2015-01-09,C,791
2015-01-01,D,0.5
2015-01-02,D,0.8
2015-01-03,D,0.83
2015-01-04,D,2
2015-01-05,D,0.01
2015-01-06,D,0.03
2015-01-07,D,0.99
2015-01-08,D,1.23
2015-01-09,D,5
I would read that into R like this
df <- read.csv("CategoryMeasurements.csv", header=TRUE)
Say you your data.frame is called df, you can do something like this using data.table which creates a new row that reads "increase over 50%" if the value grew by 50% or more (which you can then filter):
lag <- function(x, n) c(rep(NA, n), x[1:(length(x) - n)])
library(data.table)
setDT(df)[, ifelse(value/lag(value, 1) - 1 > 0.5, "increase over 50%", "Other"), by = category]
Well I'm not sure how elegant this is, but it works, and I wound up having to subset by category before passing the data frame to my function and will need to create a loop or use one of the apply functions to pass each category to my function, but it should get the job done.
Mydf <- read.csv("CategoryMeasurements.csv", header=TRUE)
GetIncreasesWithinRange <- function(df, growth, days ) {
# df = data frame with data you want processed. 1st column should be a date, 2nd column should be the data.
# growth = % of growth you are looking for in the data
# days = the number of days that the growth should occur in to be a match.
df <- df[order(df[,1]), ] # Sort the df by the date column. This is important for the loop logic.
# Initialize empty data frame to hold results that will be returned from this funciton.
ReturnDF <- data.frame( StartDate=as.Date(character()),
EndDate=as.Date(character()),
Growth=double(),
stringsAsFactors=FALSE)
TotalRows = nrow(df)
for(i in 1:TotalRows) {
StartDate <- toString(df[i,1])
StartValue <- df[i,2]
for(x in i:(TotalRows)) {
NextDate <- toString(df[x,1])
DayDiff <- as.numeric(difftime(NextDate ,StartDate , units = c("days")))
if(DayDiff >= days) {
NextValue <- df[x,2]
PercentChange = (NextValue - StartValue)/NextValue
if(PercentChange >= growth) {
ReturnDF[(nrow(ReturnDF)+1),] <- list(StartDate, NextDate, PercentChange)
}
break
}
}
}
return(ReturnDF)
}
subDF <- Mydf[which(Mydf$Category=='A'), ]
subDF$Category <- NULL # Nuke the category column from the subsetting DF. It's not relevant for this.
X <- GetIncreasesWithinRange(subDF, 0.5, 4)
print(X)
Which outputs
StartDate EndDate Growth
1 2015-01-01 2015-01-05 0.8780488
2 2015-01-02 2015-01-06 0.8666667
3 2015-01-03 2015-01-07 0.8586957

Create a stack of n subset data frames from a single data frame based on date column

I need to create a bunch of subset data frames out of a single big df, based on a date column (e.g. - "Aug 2015" in month-Year format). It should be something similar to the subset function, except that the count of subset dfs to be formed should change dynamically depending upon the available values on date column
All the subsets data frames need to have similar structure, such that the date column value will be one and same for each and every subset df.
Suppose, If my big df currently has last 10 months of data, I need 10 subset data frames now, and 11 dfs if i run the same command next month (with 11 months of base data).
I have tried something like below. but after each iteration, the subset subdf_i is getting overwritten. Thus, I am getting only one subset df atlast, which is having the last value of month column in it.
I thought that would be created as 45 subset dfs like subdf_1, subdf_2,... and subdf_45 for all the 45 unique values of month column correspondingly.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
subdf_i <- subset(df, mnth == uniqmnth[i])
i==i+1
}
I hope there should be some option in the subset function or any looping might do. I am a beginner in R, not sure how to arrive at this.
I think the perfect solution for this might be use of assign() for the iterating variable i, to get appended in the names of each of the 45 subsets. Thanks for the note from my friend. Here is the solution to avoid the subset data frame being overwritten each run of the loop.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
assign(paste("subdf_",i,sep=""), subset(df, mnth == uniqmnth[i])) i==i+1
}

Resources