I am trying to exclude some rows from a datatable based on, let's say, days and month - excluding for example summer holidays, that always begin for example 15th of June and end the 15th of next month. I can extract those days based on Date, but as as.Date function is awfully slow to operate with, I have separate integer columns for Month and Day and I want to do it using only them.
It is easy to select the given entries by
DT[Month==6][Day>=15]
DT[Month==7][Day<=15]
Is there any way how to make "difference" of the two data.tables (the original ones and the ones I selected). (Why not subset? Maybe I am missing something simple, but I don't want to exclude days like 10/6, 31/7.)
I am aware of a way to do it with join, but only day by day
setkey(DT, Month, Day)
DT[-DT[J(Month,Day), which= TRUE]]
Can anyone help how to solve it in more general way?
Great question. I've edited the question title to match the question.
A simple approach avoiding as.Date which reads nicely :
DT[!(Month*100L+Day) %between% c(0615L,0715L)]
That's probably fast enough in many cases. If you have a lot of different ranges, then you may want to step up a gear :
DT[,mmdd:=Month*100L+Day]
from = DT[J(0615),mult="first",which=TRUE]
to = DT[J(0715),mult="first",which=TRUE]
DT[-(from:to)]
That's a bit long and error prone because it's DIY. So one idea is that a list column in an i table would represent a range query (FR#203, like a binary search %between%). Then a not-join (also not yet implemented, FR#1384) could be combined with the list column range query to do exactly what you asked :
setkey(DT,mmdd)
DT[-J(list(0615,0715))]
That would extend to multiple different ranges, or the same range for many different ids, in the usual way; i.e., more rows added to i.
Based on the answer here, you might try something like
# Sample data
DT <- data.table(Month = sample(c(1,3:12), 100, replace = TRUE),
Day = sample(1:30, 100, replace = TRUE), key = "Month,Day")
# Dates that you want to exclude
excl <- as.data.table(rbind(expand.grid(6, 15:30), expand.grid(7, 1:15)))
DT[-na.omit(DT[excl, which = TRUE])]
If your data contain at least one entry for each day you want to exclude, na.omit might not be required.
Related
I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))
I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected
I have a a simple, but large data frame (lateness_tbl) consisting of three columns (Days, Due_Date, End_Date). I need to see how many times each Due Date is matched in End Date. I’m currently doing something like this:
x <- c()
for (i in 1:length(lateness_tbl$Due_Date){
x[i] <- sum(lateness_tbl$Due_Date[i] == lateness_tbl$End_Date)}
The only problem is I have more than 2 million records to compare and am looking for help from the community to speed this up. Any tips, tricks, or corrections would be awesome. Thanks
There is a simple solution to it. You can define a new vector to store the differences between the EndDate and DueDate and then count the entries on this vector that are equal to zero.
differences <- lateness_tbl$Due_Date - lateness_tbl$End_Date
length(which(differences == 0))
If Due_date and End_Date are data (and not integers), you can use the difftime function as shown here and use the same strategy pointed above.
I wrote some data into a CSV- this should be a shareable link. If it says no access, then just in general terms is greatly appreciated. https://drive.google.com/a/rice.edu/file/d/0B-O6tTyIMPyaNUNtQlJGVkNRcGs/view?usp=sharing
I have a data set with over 220,000 entries. What I am trying to do, without writing 50+ lines of code is:
There is a category called fyear, ranging from 1980 for 2014. For each year, I want to take the sum of the column called "revenue" for that year, and then divide it by the number of entries for that year.
Without a loop, it would be- for example the year 1980
n80<- subset(returns, fyear=="1980")
sum(n80$returns) / length(n80)
and it would return the value I want- but I don't want to go through and do this 44 times. So, I need to make a loop of some sort I assume. All I can come up with is
returns=NULL
for (i in 1:fyear) {
year.returns[i]= sum(returns$return)/ length(?)
How to I reference the length of the number of entries for each fiscal year?
Reading up on apply/sapply etc now to see if I can figure out how to do it that way.
You can do this with dplyr
library(dplyr)
data %>%
group_by(fyear) %>%
summarize(mean_returns = mean(returns) )
Since fyear is a numeric value its easy to iterate over the range:
for(i in 1980:2014){
x<- subset(returns, fyear==i)
sum(x$returns) / length(x)
}
In your original code you have 1980 in quote indicating it's a character if this is the case you could use fyear == as.character(i)
You could also vectorize the solution using sapply
One simple approach I can think of is using unique. Use years <- unique(returns$fyear) to get a vector containing all the years. And then you can loop through values in years vector and do the calculation you've mentioned in the question.
It will take care of any missing year as well.
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'fyear', we get the mean of 'returns'.
library(data.table)
setDT(data)[, list(mean_returns = mean(returns)) , by = fyear]
G'day All,
I have a data.frame with multiple columns, with a count corresponding to each column and row. The data looks like the following:
d <- data.frame(replicate(10,sample(0:15,10,rep=TRUE)))
I am wanting to count the number of rows which have a value greater than 1, and then I want to repeat this all the way to 15. I then want to do the same thing for every column in the data.frame. So I will end up with data that tells me how many rows have values greater than 0:14 (iteratively) for each column of data.
I can do this one at a time with the following code:
length(which(d[,1]>0))
length(which(d[,1]>1))
length(which(d[,1]>2))
.
.
.
length(which(d[,10]>14))
This is really slow, and my real data set is much larger than this so I don't want to have to do it this way. I thought I might be able to do it if as follows:
lrows <- data.frame('cnt' = rep(0:14, 10), 'column' = rep(1:10, each = 15),
't' = 0)
lrows$t <- length(which(d[,lrows[,2]]>lrows[,1]))
But it doesn't do it iteratively. I have tried a couple of other things unsuccessfully and was hoping someone here might be able to point me in the right direction.
I know this is going to be obvious once it is pointed out but I just can't figure it out, sorry. Thank you for your time and help!
All the best