R data frame - row number increases - r

Assuming we have a data frame (the third column is a date column) that contains observations of irregular events starting from Jan 2000 to Oct 2011. The goal is to choose those rows of the data frame that contain observations between two dates
start<-"2005/09/30"
end<-"2011/01/31"
The original data frame contains about 21 000 rows. We can verify this using
length(df_original$date_column).
We now create a new data frame that contains dates newer than the start date:
df_new<-df_original[df_original$date_column>start,]
If I check the length using length(df_new$date_column) it shows about 13 000 for the length.
Now we create another data frame applying the second criterion (smaller than end date):
df_new2<-df_new[df_new$date_column<end,]
If I check again the length using length(df_new2$date_column) it shows about 19 000 counts for the length.
How is it possible that by applying a second criterion on the new data frame df_new the number of rows increases? The df_new should have a number of rows equal or below the 13 000.
The data frame is quite large such that I cannot post it here. Maybe someone can provide a reason under which circumstances this behavior occurs.

The following example works fine for me:
df_original = data.frame(date_column = seq(as.Date('2000/01/01'), Sys.Date(), by=1), value = 1)
start = as.Date('2005/09/30')
end = as.Date('2011/01/31')
df_new = df_original[df_original$date_column>start,]
df_new2 = df_new[df_new$date_column<end,]
> dim(df_original)
[1] 4316 2
> dim(df_new)
[1] 2216 2
> dim(df_new2)
[1] 1948 2
Without seeing an example of your actual data, I would suggest 2 things to look out for:
Make sure your dates are coded as dates.
Make sure you aren't accidentally indexing by row name. This is a common culprit for the behavior you're talking about.

Can you get the results you want via one subset command?
df_new <- df_original[with(df_original, date_column>start & date_column<end),]
# or
df_new <- subset(df_original, date_column>start & date_column<end)

can you give us dput(head(df_original))? which shares with us the first 5 records and their data structure. I am suspicious something is up with the format of your date_column.
If you are storing start and end both as strings (which your example seems to indicate) and the date column is also a string, then you will not be able to use < or > to compare values of dates. So somewhere you need to validate that everything being compared is known by R to be dates.

Related

R: Change the Contents of Column A based on Column B

I am cleaning some data in R, and I am imputing different values for some outliers that are clearly not correct, so I am doing the following:
dat$colA[dat$colA > 10000] <- quantile(dat$colA, c(.95))
This changes the values of two columns. Now, I want to change the contents of another column based on what I changed here. For example, the above line changed the values for rows 24 and 676. Now, I want to impute a constant value in a different column for rows 24 and 676, but I don't want to hard code it. I'd like to perform some sort of indexing to do so. How can I do this in R?
In other words, I want to set colB to 1 for rows 24 and 676. How can I do this by referencing the values in colA?
Create an index i telling where in colA are the changes to take place, then use and reuse the index as many times as you want.
i <- which(dat$colA > 10000)
dat$colA[i] <- quantile(dat$colA, 0.95)
dat$colB[i] <- 1

Comparing dates in different columns to isolate certain within-group entries in R

I have a data frame with an ID column that includes duplicates. There is a column called type that takes the values "S" or "N." There are two additional date columns - admission date and discharge date. My question is a bit similar to comparing two data frames and isolating rows based on certain date differences, but not quite. If needed, I could separate my data into two data frames, but I'm wondering if I can accomplish what I want without the extra steps.
Here is a small example of what the data for two patients looks like in R:
example <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
What I want to do is compare within patients, entries that take the value "N" and entries that take the value "S" in the type variable. Based on the discharge date for entries with the value "S," I would like to find entries with the value "N" that have an admission date within 5 days of the former's discharge date (the discharge date with value "S" should be before the admission date with value "N").
So in the example data frame, the only two entries that should be retained are rows 2 and 3 and not rows 5 and 6 since the difference between admission date and discharge date is greater than 5.
Does anyone have any suggestions of how I can filter this data? Any help is greatly appreciated.
This was an interesting challenge. One reason for this is because iterating over rows is less intuitive than iterating over columns (see this question for lots of suggestions: For each row in an R dataframe).
Now I know vectorized solutions are preferred over for loops, but one of the challenges with this problem was that instead of just performing functions on each row, we're comparing the iterated rows to other rows and deleting some rows as we go along. I expect there's a better solution out there and I hope someone posts a better solution to help me learn.
One minor note before I begin, "example" isn't a great name for an object because it's also a function in base R. Additionally, the solution is much easier if we're only dealing with alternating rows of "S" and "N" - that is if many S's precede an N then only the lowest S might be within 5 days of N. Nonetheless it was worth the effort to attack the more challenging case.
Ultimately I ended up solving this as a 2-stage problem, each solved with a for loop. First, I took out all the S rows which weren't within 5 days of the corresponding N rows. Then I took out those N rows which didn't have any appropriate S companions. All of this is implemented in base R.
So to begin:
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
example_df$admission_date<-as.numeric(as.Date(example_df$admission_date))
example_df$discharge_date<-as.numeric(as.Date(example_df$discharge_date))
The first thing I did was to take the date columns (which were characters) and convert them to numeric based on date. Originally I was doing mathematical operations with date objects, but this became complicated with the subsetting operations I ended up using.
Here's the first for loop:
del_vec <- vector("integer")
for( i in 1:nrow(example_df)) {
if (example_df[i,"type"]== "S") {
next
}
if (example_df[i,"type"] == "N") {
add_on <- which
(
example_df["type"] == "S" &
example_df["ID"]==example_df[i,"ID"] &
example_df["discharge_date"] < (example_df[i,"admission_date"] - 5)
)
}
del_vec<- append(del_vec,add_on)
}
example_df_new <- example_df[-c(del_vec),]
rownames(example_df_new) <- 1:nrow(example_df_new)
example_df_new
What I did here is start by creating a vector which will contain the row numbers that we delete. To get rid of the inappropriate S rows we need to actually work on the N rows, so I have the loop skip the S rows. Then when the loop encounters an N row, we find the rows which meet the following conditions:
have type S
have the same ID as the N row in question
have a discharge date which is more than 5 days from the admission date for the N row in question
Using which()captures the row numbers that meet these criteria. Now I add these rows to the empty vector and remove them from the original df. I also rename the rows of the new df to get the following output for example_df_new
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
3 52 16241 16251 N
So we've preserved the 2 rows you wanted to keep, but now we have this bottom row that we want to get rid of. I do this in the second loop which iterates over the rows in the new reduced df:
del_vec2 <- vector()
for(i in 1:nrow(example_df_new)) {
if (example_df_new[i,"type"]=="S") {
next
}
if (example_df_new[i,"type"] == "N") {
add_on_two <- which(example_df_new["type"] == "S" & example_df_new["ID"] == example_df_new[i,"ID"])
}
if(length(add_on_two !=0)) {
next
} else {
del_vec2 <- append(del_vec2,i)
}
}
example_df_3<-example_df_new[-c(del_vec2),]
example_df_3
Again, we tell the loop to skip the S rows — whichever ones made the first cut should stay in. Now when the loop encounters an N row we ask the loop to look for rows that meet the following criteria:
is type S
has the same ID as the N row in question
Again I use which() to save the positions of these rows. If these criteria are met then we skip ahead - we want to keep all the N's that have an appropriate S companion. If not then we add the row number of (i) - that is the row number for the N in question to our vector of rows that we want to delete.
We then delete those rows and end up with the desired output:
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
At this point you can change the date columns back to a date format.
Again, while this may be the first, I expect it's not the best solution. I hope to see an improved solution, but the problem is more tricky than it appears at first.
After attempting to filter within the same data frame, I decided to separate the data into two tables: one containing only data of type "S" and the other containing only data of type "N." Then, I did a full join while matching on the ID column. While this creates a greater number of rows than before, I was then able to compare the two date of interest. The resulting data frame contains only one row - the entry of a patient with an admission date with type "N" within 5 days of a discharge date with type "S."
The code in R is as follows:
library(dplyr)
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
N_only <- example_df %>%
filter(type == "N")
S_only <- example_df %>%
filter(type == "S")
example_df_merged <- merge(N_only, S_only, by = "ID")
example_df_merged$admission_date.x <- as.Date(as.character(example_df_merged$admission_date.x), format="%Y-%m-%d")
example_df_merged$discharge_date.y <- as.Date(as.character(example_df_merged$discharge_date.y), format="%Y-%m-%d")
example_df_merged$dateDiff <- example_df_merged$discharge_date.y - example_df_merged$admission_date.x
example_df_final <- example_df_merged %>%
filter(dateDiff <= 5 & dateDiff >= 0)
For clearer variable names, I would have changed the variables ending in ".x" and ".y," but that is not necessary.

How to make smaller subsets based upon a fixed number of rows repeating over the dataframe

My Problem:
I have a dataframe consisting of 86016000 rows of observations:
there are 512000 observations for each hour
there are 24 hours data for seven days
So 24*7*512000 = 86016000
there are 40 columns (variables)
There is no column of date or datetimestamp
Only row numbers are good enough to identify how many obs. for each day, and there are no errors in recording of this data.
Given such a large dataset, what I want to do is create subsets of 12288000 (i.e. 24 * 512000) rows, so that we have 7 each day's subset.
What I tried:
d <- split(PltB_Fold3_1_Data, rep(1:12288000, each=7))
But unfortunately after almost half an hour, I termicated the process as there was no result.
Is there any better solution then the one above?
You're probably looking for seq rather than rep. With seq, you can generate a sequence of numbers from 0 to 86016000 incremented by 12288000.
To save resources, you can then use this sequence to generate temporary data frames and do whatever you want with each.
sequence <- seq(from = 0, to = 86016000, by = 12288000)
for(i in 1:(length(sequence)-1)){
temp <- df[sequence[i]+1:sequence[i+1], ]
# do something here with your temporary data frame
}

Create a stack of n subset data frames from a single data frame based on date column

I need to create a bunch of subset data frames out of a single big df, based on a date column (e.g. - "Aug 2015" in month-Year format). It should be something similar to the subset function, except that the count of subset dfs to be formed should change dynamically depending upon the available values on date column
All the subsets data frames need to have similar structure, such that the date column value will be one and same for each and every subset df.
Suppose, If my big df currently has last 10 months of data, I need 10 subset data frames now, and 11 dfs if i run the same command next month (with 11 months of base data).
I have tried something like below. but after each iteration, the subset subdf_i is getting overwritten. Thus, I am getting only one subset df atlast, which is having the last value of month column in it.
I thought that would be created as 45 subset dfs like subdf_1, subdf_2,... and subdf_45 for all the 45 unique values of month column correspondingly.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
subdf_i <- subset(df, mnth == uniqmnth[i])
i==i+1
}
I hope there should be some option in the subset function or any looping might do. I am a beginner in R, not sure how to arrive at this.
I think the perfect solution for this might be use of assign() for the iterating variable i, to get appended in the names of each of the 45 subsets. Thanks for the note from my friend. Here is the solution to avoid the subset data frame being overwritten each run of the loop.
uniqmnth <- unique(df$mnth)
for (i in 1:length(uniqmnth)){
assign(paste("subdf_",i,sep=""), subset(df, mnth == uniqmnth[i])) i==i+1
}

Subset column based on a range of time

I am trying to subset a data frame based on a range of time. Someone has asked this question in the past and the answer was to use R CMD INSTALL lubridate_1.3.1.tar.gz (see link: subset rows according to a range of time.
The issue with this answer is that I get the following warning:
> install.packages("lubridate_1.3.2.tar.gz")
Warning in install.packages :
package ‘lubridate_1.3.2.tar.gz’ is not available (for R version 3.1.2)
I am looking for something very similar to this answer but I cannot figure out how to do this. I have a MasterTable with all of my data organized into columns. One of my columns is called maxNormalizedRFU.
My question is simple:
How can I subset my maxNormalizedRFU column by time?
I would simply like to add another column which only displays the maxNormalizedRFU the data between 10 hours and 14 hours. Here is what I have up to now:
#Creates the master table
MasterTable <- inner_join(LongRFU, LongOD, by= c("Time.h", "Well", "Conc.nM", "Assay"))
#normalizes my data by fluorescence (RFU) and optical density (OD) based on 6 different subsets called "Assay"
MasterTable$NormalizedRFU <- MasterTable$AvgRFU/MasterTable$AvgOD
#creates a column that only picks the maximum value of each "Assay"
MasterTable <- ddply(MasterTable, .(Conc.nM, Assay), transform, maxNormalizedRFU=max(NormalizedRFU))
#The issue
MasterTable$CutmaxNormalizedRFU <- ddply(maxNormalizedRFU, "Time.h", transform, [MasterTable$Time.h < 23.00 & MasterTable$Time.h > 10.00,])
Attached is a sample of my dataset. Since the original file has over 90 000 lines, I have only attached a small fraction of it (only one assay and one concentration).
My line is currently using ddply to do the subset but this simply does not work. Does anyone have a suggestion as to how to fix this issue?
Thank you in advance!
Marty
I downloaded your data and had a look. If I am not mistaken, all you need is to subset data using Time.h. Here you have a range of time (10-23) you want. I used dplyr and did the following. You are asking R to pick up rows which have values between 10 and 23 in Time.h. Your data frame is called mydf here.
library(dplyr)
filter(mydf, between(Time.h, 10, 23))

Resources