Subsetting dates from colnames - r

I have a dataframe as follows:
TAS1 2000 obs. of 9862 variables
Each of these variables (columns) represent daily temperatures from 1979-01-01 to 2005-12-31. The colnames have been set with these dates. I now wish to separate the dataframe into twelve separate monthly data frames - containing Jan, Feb, Mar etc.
I have tried:
TAS1.JAN = subset(TAS1, grepl("-01-"), colnames(TAS1))
But get the error:
Error in grepl("-01-") : argument "x" is missing, with no default
Is there a relatively quick solution for this? I feel there must be but haven't cracked it despite trying various solutions.

I would subset January data like below.
Jan_df <- subset(MyDatSet, select=(grepl("-01-, colnames(MyDatSet))))
I have assumed that your parent dataset is called MyDatSet and a pattern "-01-" defines that it is January data.
You may repeat the process for other 11 months or come up with intelligent loop.

Like Roland, in the comments, suggested, I would opt for melting mechanism too. However, since I do not know your use case, here you go based on what you posted and asked for.
As your error says, you are missing an argument there:
tas1.jan <- subset(df, grepl("-01-", df$tas1))
Another way to do it with the help of stringr and dplyr would be:
library(stringr)
library(dplyr)
tas1.jan <- df %>% filter(str_detect(tas1, "-01-"))
Bottom side of this approach: you need to run a loop or do this 12 times for all months.

Related

Looking for an R function to divide data by date

I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))

R coding, I'm trying to correctly order the variables in my dataframe from 1 to 13 but it goes like 201501, 2015010, 011,012,013, 02...09

I have a large dataframe sorted by fiscal year and fiscal period. I am trying to create a time plot starting at fiscal period 1 of 2015, ending at fiscal period 13 of 2019. I have two columns, one for FY, one for FP. They look like this.
I merged the two columns together separated by a 0 in a new column (C) using the code:
MarkP$C = paste(MarkP$FY, MarkP$FP, sep="0")
This ensures that my new column is a numeric variable.
It looks like this (check column C)
Then since I want to plot a time plot of total sales per period, I aggregated all sales to the level of C, so all rows ending with the same C aggregate together. I used this code for the aggregation.
MarkP11 <- MarkP %>%
group_by(C) %>%
summarise(Sales=sum(Sales))
This is what MarkP11 looks like.
The problem i'm having is that the row's are out of order so when I plot them, it gives me an incorrect plot. It has period 10 coming after period 1.
I've done some research and discovered that the sprintf function may work but i'm not sure how I can incorporate that into the code for my data frame.
The code below is how my C column is created by merging two columns. I believe I need to edit this line with the 'sprintf' function but i'm not sure how to get that to work.
R programming
MarkP$C = paste(MarkP$FY, MarkP$FP, sep="0")
I expect the ordering of the MarkP dataframe to look something like this:
sprintf is indeed what you want:
sprintf("%0.0f%02.0f", 2019, c(1,10))
# [1] "201901" "201910"
This assumes that FP's range is 0-99. It would not be incorrect to use sprintf("%d%02d", 2019, c(1,10)) since you're intending to use integers, but sometimes I find that seemingly-integer values can trigger Error ... invalid format '%02d', so I just strong-arm it. You could also use as.integer on each set of values ... another workaround.
I was speaking with a colleague of mine and he helped me figure out the solution. Like r2evans commented, sprintf is the correct function. The syntax that worked for me was:
MarkP$C = paste(MarkP$FY, sprintf("%02d", MarkP$FP), sep-"")
What that did in my code was concatenate the two cells FY and FP together in a new cell titled "C".
-It first added my FY column to the new cell.
-Then, since sep="" there was no separator character so FY and FP were simply merged together.
-Since I added the sprintf function with
("%02d",
it padded the FP column with 0 zero prior to tacking on my FP column.

Assigning a variable in one dataset to multiple fields in another dataset

I'm trying to assign a variable in one dataframe into multiple rows of another dataframe - namely the AWND variable here (average wind speed).
I'm trying to obtain the AWND from
here
And I am trying to match it with multiple dates based on the date
here
Here's what I've tried so far.
dfNew <- merge(dfWeather, dfFlight, by="DATE")
I'm not sure how to proceed with this.
Should I do a join?
(EDIT: Here's the data- https://shrib.com/#-7dXevTkb12Bt6Kdfxim (this is the dput output of the data I am getting AWND from)
I got the flights data (that I am trying to match dates with) from the nycflights13 package, and then I subset the flights data to include only the carriers that had at least 1000 flights depart from LaGuardia.
The flights data has the date-time class as shown in your tibble. First, make sure that the elements you want to join between are the same i.e. 2013-01-01 05:00:00 will not match with 2013-01-01 in your dfWeather data.frame
# Make sure dates match between data.frames
dfFlight$DATE <- stringr::str_extract(dfFlight$DATE, "\\S*")
# Join AWND wherever dates match to left-hand side
dfNew <- dplyr::left_join(dfFlight, dfWeather, by = "DATE")
I did assume some things about your data since I couldn't fully see what you're working with from screenshot. This is my first answer on Stack Overflow, so feel free to edit or leave me suggestions

creating a for loop to calculate a sum for a certain year

I wrote some data into a CSV- this should be a shareable link. If it says no access, then just in general terms is greatly appreciated. https://drive.google.com/a/rice.edu/file/d/0B-O6tTyIMPyaNUNtQlJGVkNRcGs/view?usp=sharing
I have a data set with over 220,000 entries. What I am trying to do, without writing 50+ lines of code is:
There is a category called fyear, ranging from 1980 for 2014. For each year, I want to take the sum of the column called "revenue" for that year, and then divide it by the number of entries for that year.
Without a loop, it would be- for example the year 1980
n80<- subset(returns, fyear=="1980")
sum(n80$returns) / length(n80)
and it would return the value I want- but I don't want to go through and do this 44 times. So, I need to make a loop of some sort I assume. All I can come up with is
returns=NULL
for (i in 1:fyear) {
year.returns[i]= sum(returns$return)/ length(?)
How to I reference the length of the number of entries for each fiscal year?
Reading up on apply/sapply etc now to see if I can figure out how to do it that way.
You can do this with dplyr
library(dplyr)
data %>%
group_by(fyear) %>%
summarize(mean_returns = mean(returns) )
Since fyear is a numeric value its easy to iterate over the range:
for(i in 1980:2014){
x<- subset(returns, fyear==i)
sum(x$returns) / length(x)
}
In your original code you have 1980 in quote indicating it's a character if this is the case you could use fyear == as.character(i)
You could also vectorize the solution using sapply
One simple approach I can think of is using unique. Use years <- unique(returns$fyear) to get a vector containing all the years. And then you can loop through values in years vector and do the calculation you've mentioned in the question.
It will take care of any missing year as well.
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'fyear', we get the mean of 'returns'.
library(data.table)
setDT(data)[, list(mean_returns = mean(returns)) , by = fyear]

Subset column based on a range of time

I am trying to subset a data frame based on a range of time. Someone has asked this question in the past and the answer was to use R CMD INSTALL lubridate_1.3.1.tar.gz (see link: subset rows according to a range of time.
The issue with this answer is that I get the following warning:
> install.packages("lubridate_1.3.2.tar.gz")
Warning in install.packages :
package ‘lubridate_1.3.2.tar.gz’ is not available (for R version 3.1.2)
I am looking for something very similar to this answer but I cannot figure out how to do this. I have a MasterTable with all of my data organized into columns. One of my columns is called maxNormalizedRFU.
My question is simple:
How can I subset my maxNormalizedRFU column by time?
I would simply like to add another column which only displays the maxNormalizedRFU the data between 10 hours and 14 hours. Here is what I have up to now:
#Creates the master table
MasterTable <- inner_join(LongRFU, LongOD, by= c("Time.h", "Well", "Conc.nM", "Assay"))
#normalizes my data by fluorescence (RFU) and optical density (OD) based on 6 different subsets called "Assay"
MasterTable$NormalizedRFU <- MasterTable$AvgRFU/MasterTable$AvgOD
#creates a column that only picks the maximum value of each "Assay"
MasterTable <- ddply(MasterTable, .(Conc.nM, Assay), transform, maxNormalizedRFU=max(NormalizedRFU))
#The issue
MasterTable$CutmaxNormalizedRFU <- ddply(maxNormalizedRFU, "Time.h", transform, [MasterTable$Time.h < 23.00 & MasterTable$Time.h > 10.00,])
Attached is a sample of my dataset. Since the original file has over 90 000 lines, I have only attached a small fraction of it (only one assay and one concentration).
My line is currently using ddply to do the subset but this simply does not work. Does anyone have a suggestion as to how to fix this issue?
Thank you in advance!
Marty
I downloaded your data and had a look. If I am not mistaken, all you need is to subset data using Time.h. Here you have a range of time (10-23) you want. I used dplyr and did the following. You are asking R to pick up rows which have values between 10 and 23 in Time.h. Your data frame is called mydf here.
library(dplyr)
filter(mydf, between(Time.h, 10, 23))

Resources