Subset column based on a range of time - r

I am trying to subset a data frame based on a range of time. Someone has asked this question in the past and the answer was to use R CMD INSTALL lubridate_1.3.1.tar.gz (see link: subset rows according to a range of time.
The issue with this answer is that I get the following warning:
> install.packages("lubridate_1.3.2.tar.gz")
Warning in install.packages :
package ‘lubridate_1.3.2.tar.gz’ is not available (for R version 3.1.2)
I am looking for something very similar to this answer but I cannot figure out how to do this. I have a MasterTable with all of my data organized into columns. One of my columns is called maxNormalizedRFU.
My question is simple:
How can I subset my maxNormalizedRFU column by time?
I would simply like to add another column which only displays the maxNormalizedRFU the data between 10 hours and 14 hours. Here is what I have up to now:
#Creates the master table
MasterTable <- inner_join(LongRFU, LongOD, by= c("Time.h", "Well", "Conc.nM", "Assay"))
#normalizes my data by fluorescence (RFU) and optical density (OD) based on 6 different subsets called "Assay"
MasterTable$NormalizedRFU <- MasterTable$AvgRFU/MasterTable$AvgOD
#creates a column that only picks the maximum value of each "Assay"
MasterTable <- ddply(MasterTable, .(Conc.nM, Assay), transform, maxNormalizedRFU=max(NormalizedRFU))
#The issue
MasterTable$CutmaxNormalizedRFU <- ddply(maxNormalizedRFU, "Time.h", transform, [MasterTable$Time.h < 23.00 & MasterTable$Time.h > 10.00,])
Attached is a sample of my dataset. Since the original file has over 90 000 lines, I have only attached a small fraction of it (only one assay and one concentration).
My line is currently using ddply to do the subset but this simply does not work. Does anyone have a suggestion as to how to fix this issue?
Thank you in advance!
Marty

I downloaded your data and had a look. If I am not mistaken, all you need is to subset data using Time.h. Here you have a range of time (10-23) you want. I used dplyr and did the following. You are asking R to pick up rows which have values between 10 and 23 in Time.h. Your data frame is called mydf here.
library(dplyr)
filter(mydf, between(Time.h, 10, 23))

Related

How to calculate the average of different groups in a dataset using R

I have a dataset in R that I would like to find the average of a given variable for each year in the dataset (here, from 1871-2019). Not every year has the same number of entries, and so I have encountered two problems: first, how to find the average of the variable for each year, and second, how to add the column of averages to the dataset. I am unsure how to approach the first problem, but I attempted a version of the second problem by simply finding the sum of each group and then trying to add those values to the dataset for each entry of a given year with the code teams$SBtotal <- tapply(teams$SB, teams$yearID, FUN=sum). That code resulted in an error that notes replacement has 149 rows, data has 2925. I know that this can be done less quickly in Excel, but I'm hoping to be able to use R to solve this problem.
The tapply should work
data(iris)
tapply(iris$Sepal.Length, iris$Species, FUN = sum)

Subsetting dates from colnames

I have a dataframe as follows:
TAS1 2000 obs. of 9862 variables
Each of these variables (columns) represent daily temperatures from 1979-01-01 to 2005-12-31. The colnames have been set with these dates. I now wish to separate the dataframe into twelve separate monthly data frames - containing Jan, Feb, Mar etc.
I have tried:
TAS1.JAN = subset(TAS1, grepl("-01-"), colnames(TAS1))
But get the error:
Error in grepl("-01-") : argument "x" is missing, with no default
Is there a relatively quick solution for this? I feel there must be but haven't cracked it despite trying various solutions.
I would subset January data like below.
Jan_df <- subset(MyDatSet, select=(grepl("-01-, colnames(MyDatSet))))
I have assumed that your parent dataset is called MyDatSet and a pattern "-01-" defines that it is January data.
You may repeat the process for other 11 months or come up with intelligent loop.
Like Roland, in the comments, suggested, I would opt for melting mechanism too. However, since I do not know your use case, here you go based on what you posted and asked for.
As your error says, you are missing an argument there:
tas1.jan <- subset(df, grepl("-01-", df$tas1))
Another way to do it with the help of stringr and dplyr would be:
library(stringr)
library(dplyr)
tas1.jan <- df %>% filter(str_detect(tas1, "-01-"))
Bottom side of this approach: you need to run a loop or do this 12 times for all months.

missing values for each participant in the study

I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).

aggregate data.frame with formula and date variable goes wrong

I want to aggregate and count how often in my dataset is a special kind of disease at one date. (I don't use duplicate because I want all rows, not only the duplicated ones)
My original data set looks like:
id dat kinds kind
AE00302 2011-11-20 valv 1
AE00302 2011-10-31 vask 2
(of course my data.frame is much larger)
I try this:
xagg<-aggregate(kind~id+dat+kinds,subx,length)
names(xagg)<-c("id","dat","kinds","kindn")
and get:
id dat kinds kindn
AE00302 2011-10-31 valv 1
AE00302 2011-11-20 vask 1
I wonder why R is going wrong by the 'date' resp. the 'kinds'-column.
Has anybody an idea?
I still don't know why.
But I found out, aggregate goes wrong, because of columns I don't use for aggregating.
Therefor these steps solve the problem for me:
# 1st step: reduce the data.frame to only the needed columns
# 2nd Step: aggregate the reduced data.frame
# 3rd Step: merge aggregated data to reduced dataset
# 4th step: remove duplicated rows from reduced dataset (if they occur)
# 5th step: merge reduced dataset without dublicated data to original dataset
Maybe the problem occurs, if there are duplicated datasets in the aggregated data.frame.
Thanks for all your help, questions and attempts to solve my problem!
elchvonoslo

R data frame - row number increases

Assuming we have a data frame (the third column is a date column) that contains observations of irregular events starting from Jan 2000 to Oct 2011. The goal is to choose those rows of the data frame that contain observations between two dates
start<-"2005/09/30"
end<-"2011/01/31"
The original data frame contains about 21 000 rows. We can verify this using
length(df_original$date_column).
We now create a new data frame that contains dates newer than the start date:
df_new<-df_original[df_original$date_column>start,]
If I check the length using length(df_new$date_column) it shows about 13 000 for the length.
Now we create another data frame applying the second criterion (smaller than end date):
df_new2<-df_new[df_new$date_column<end,]
If I check again the length using length(df_new2$date_column) it shows about 19 000 counts for the length.
How is it possible that by applying a second criterion on the new data frame df_new the number of rows increases? The df_new should have a number of rows equal or below the 13 000.
The data frame is quite large such that I cannot post it here. Maybe someone can provide a reason under which circumstances this behavior occurs.
The following example works fine for me:
df_original = data.frame(date_column = seq(as.Date('2000/01/01'), Sys.Date(), by=1), value = 1)
start = as.Date('2005/09/30')
end = as.Date('2011/01/31')
df_new = df_original[df_original$date_column>start,]
df_new2 = df_new[df_new$date_column<end,]
> dim(df_original)
[1] 4316 2
> dim(df_new)
[1] 2216 2
> dim(df_new2)
[1] 1948 2
Without seeing an example of your actual data, I would suggest 2 things to look out for:
Make sure your dates are coded as dates.
Make sure you aren't accidentally indexing by row name. This is a common culprit for the behavior you're talking about.
Can you get the results you want via one subset command?
df_new <- df_original[with(df_original, date_column>start & date_column<end),]
# or
df_new <- subset(df_original, date_column>start & date_column<end)
can you give us dput(head(df_original))? which shares with us the first 5 records and their data structure. I am suspicious something is up with the format of your date_column.
If you are storing start and end both as strings (which your example seems to indicate) and the date column is also a string, then you will not be able to use < or > to compare values of dates. So somewhere you need to validate that everything being compared is known by R to be dates.

Resources