I have a dataframe testData which is made up of many unique ids. My objective is to identify whether or not the ids contain all of the numbers in the range of month, yday, and week
In other words, if id has all possible values in the range in month, then it should receive a t. If id has all possible values in the range in yday, it should receive a t, and if id has all possible values in the range in week, it should receive a t. Otherwise, it should receive an f
A sample of the data looks like this:
> testData
id month yday week
1 1 1 1 1
2 3 1 2 1
3 4 1 3 1
4 2 1 4 1
5 3 3 5 1
6 4 1 6 1
7 2 1 7 1
8 3 1 8 2
9 1 1 9 2
10 5 1 10 2
11 3 2 11 1
12 4 1 12 1
13 5 1 13 1
14 1 1 14 1
The output should look something like this:
> output
id month yday week
1 1 f f t
2 2 f f f
3 3 t f t
4 4 f f f
5 5 f f t
I know that one can check if a numbers are within a certain range with findInterval(), but could someone suggest a method to check if numbers in a vector contain all integers within a range?
> dput(testData)
structure(list(id = c(1L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 1L, 5L,
3L, 4L, 5L, 1L), month = c(1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L), yday = 1:14, week = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("id", "month",
"yday", "week"), class = "data.frame", row.names = c(NA, -14L
))
Easy with data.table
library(data.table)
setDT(testdata)
output<-testdata[,.(month=all(unique(testdata$month)%in%.SD$month),yday=all(unique(testdata$yday)%in%.SD$yday),Week=all(unique(testdata$week)%in%.SD$week)),by=(id)]
output
id month yday Week
1: 1 FALSE FALSE TRUE
2: 2 FALSE FALSE FALSE
3: 3 TRUE FALSE TRUE
4: 4 FALSE FALSE FALSE
5: 5 FALSE FALSE TRUE
Here's how to do it with dplyr:
library(dplyr)
testData_copy <-testData
testData %>%
group_by(id) %>%
summarise(month=n_distinct(month)== n_distinct(testData_copy$month),
yday =n_distinct(yday) == n_distinct(testData_copy$yday),
week =n_distinct(week) == n_distinct(testData_copy$week)
)
# A tibble: 5 × 4
id month yday week
<int> <lgl> <lgl> <lgl>
1 1 FALSE FALSE TRUE
2 2 FALSE FALSE FALSE
3 3 TRUE FALSE TRUE
4 4 FALSE FALSE FALSE
5 5 FALSE FALSE TRUE
Related
If I have this data
Group,start_time
1,9:05:00
1,9:07:00
1,19:09:00
1,9:00:00
1,9:00:00
1,9:02:00
2,9:05:00
2,9:07:00
2,19:09:00
2,9:00:00
2,9:00:00
2,9:02:00
and I would like to get a column check on my data like below. How can I do that? Thanks
Group,start_time, check
1,9:05:00,True
1,9:07:00,True
1,19:09:00, True
1,9:00:00,False
1,9:00:00,False
1,9:02:00,False
2,9:05:00,True
2,9:07:00,True
2,19:09:00,True
2,9:00:00,False
2,9:00:00,False
2,9:02:00,False
Here's a possible solution:
df = structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), start_time = structure(c(4L, 5L, 1L, 2L, 2L, 3L,
4L, 5L, 1L, 2L, 2L, 3L), .Label = c("19:09:00", "9:00:00", "9:02:00",
"9:05:00", "9:07:00"), class = "factor")), class = "data.frame", row.names = c(NA, -12L))
library(dplyr)
df %>%
group_by(Group) %>%
mutate(check = as.numeric(gsub(":","",start_time)) >= cummax(as.numeric(gsub(":","",start_time)))) %>%
ungroup()
# # A tibble: 12 x 3
# Group start_time check
# <int> <fct> <lgl>
# 1 1 9:05:00 TRUE
# 2 1 9:07:00 TRUE
# 3 1 19:09:00 TRUE
# 4 1 9:00:00 FALSE
# 5 1 9:00:00 FALSE
# 6 1 9:02:00 FALSE
# 7 2 9:05:00 TRUE
# 8 2 9:07:00 TRUE
# 9 2 19:09:00 TRUE
#10 2 9:00:00 FALSE
#11 2 9:00:00 FALSE
#12 2 9:02:00 FALSE
I'm assuming that FALSE cases are the ones that we seem to go back in time.
In order to compare times I remove : and I create a number using the remaining (numerical) characters.
Within a group, I want to find the difference between that row and the first time that user appeared in the data. For example, I need to create the diff variable below. Users have different number of rows each as in the following data:
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L),
money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L), diff = c(NA, 3L, 4L,
6L, NA, 2L, 3L, NA, NA, 8L)), .Names = c("ID", "money", "occurence",
"diff"), class = "data.frame", row.names = c(NA, -10L))
ID money occurence diff
1 1 9 1 NA
2 1 12 2 3
3 1 13 3 4
4 1 15 4 6
5 2 5 1 NA
6 2 7 2 2
7 2 8 3 3
8 3 5 1 NA
9 4 2 1 NA
10 4 10 2 8
You can use ave(). We just remove the first value per group and replace it with NA, and subtract the first value from the rest of the values.
with(df, ave(money, ID, FUN = function(x) c(NA, x[-1] - x[1])))
# [1] NA 3 4 6 NA 2 3 NA NA 8
A dplyr solution, which uses the first function to get the first value and calculate the difference.
library(dplyr)
df2 <- df %>%
group_by(ID) %>%
mutate(diff = money - first(money)) %>%
mutate(diff = replace(diff, diff == 0, NA)) %>%
ungroup()
df2
# # A tibble: 10 x 4
# ID money occurence diff
# <int> <int> <int> <int>
# 1 1 9 1 NA
# 2 1 12 2 3
# 3 1 13 3 4
# 4 1 15 4 6
# 5 2 5 1 NA
# 6 2 7 2 2
# 7 2 8 3 3
# 8 3 5 1 NA
# 9 4 2 1 NA
# 10 4 10 2 8
Update
Here is a data.table solution provided by Sotos. Notice that no need to replace 0 with NA.
library(data.table)
setDT(df)[, money := money - first(money), by = ID][]
# ID money occurence diff
# 1: 1 0 1 NA
# 2: 1 3 2 3
# 3: 1 4 3 4
# 4: 1 6 4 6
# 5: 2 0 1 NA
# 6: 2 2 2 2
# 7: 2 3 3 3
# 8: 3 0 1 NA
# 9: 4 0 1 NA
# 10: 4 8 2 8
DATA
dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L),
money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "money",
"occurence"), row.names = c(NA, -10L), class = "data.frame")
I have a dataframe testData which is made up of many unique ids. My objective is to identify whether or not the ids contain all of the possible integers in the range of month, yday, and week where the min is the first value per id and max is the max value in the entire range of the column
Please note this is different from the related question here
In other words, if id has all possible values in the range in month, then it should receive a t. For example, under month where id = 1, the min value is 2 and the max value for the whole column is 5, therefore 1 should receive a true because there is a value 2, 3, 4, and 5. Where id = 2, however, there are only values 1, 2, 4, and 5, so the 3 was skipped and therefore 2 should receive an f.
So far, I have a formula that takes all the values in the entire range of the column (but NOT the min value per id):
library(data.table)
setDT(testData)
output<-testData[,.(month=all(unique(testData$month)%in%.SD$month),yday=all(unique(testData$yday)%in%.SD$yday),week=all(unique(testData$week)%in%.SD$week)),by=(id)]
Any idea how I could integrate min where min is the minimum value per id and max is the maximum value in the range?
> testData
id month yday week
1 1 2 1 1
2 3 1 2 1
3 4 1 3 1
4 2 1 4 1
5 3 3 5 2
6 4 3 6 3
7 2 2 7 1
8 3 1 8 3
9 1 2 9 2
10 5 4 10 3
11 3 2 11 1
12 4 4 12 1
13 5 4 13 2
14 1 3 14 3
15 1 4 15 1
16 1 5 16 2
17 2 4 17 3
18 2 5 18 1
19 5 5 19 1
> dput(testData)
structure(list(id = c(1L, 3L, 4L, 2L, 3L, 4L, 2L, 3L, 1L, 5L,
3L, 4L, 5L, 1L, 1L, 1L, 2L, 2L, 5L), month = c(2L, 1L, 1L, 1L,
3L, 3L, 2L, 1L, 2L, 4L, 2L, 4L, 4L, 3L, 4L, 5L, 4L, 5L, 5L),
yday = 1:19, week = c(1L, 1L, 1L, 1L, 2L, 3L, 1L, 3L, 2L,
3L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L)), .Names = c("id",
"month", "yday", "week"), class = "data.frame", row.names = c(NA,
-19L))
In the end, the output should look like this:
> output
id month yday week
1 1 t f t
2 2 f f f
3 3 f f t
4 4 f f f
5 5 t f t
Using dplyr you can group by id and then just check whether all elements of the range are in the values present for each group. Note that min(month) gives the min for the grouped id variable, but max(testData$month) gives the max for the whole list.
library(dplyr)
tD2 <- testData %>% group_by(id) %>%
summarise(month=all(min(month):max(testData$month) %in% month),
yday=all(min(yday):max(testData$yday) %in% yday),
week=all(min(week):max(testData$week) %in% week))
tD2
# A tibble: 5 × 4
id month yday week
<int> <lgl> <lgl> <lgl>
1 1 TRUE FALSE TRUE
2 2 FALSE FALSE FALSE
3 3 FALSE FALSE TRUE
4 4 FALSE FALSE FALSE
5 5 TRUE FALSE TRUE
I have the following data frame
id<-c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)
time<-c(0,1,2,3,4,5,6,7,0,1,2,3,0,1,2,3)
value<-c(1,1,6,1,2,0,0,1,2,6,2,2,1,1,6,1)
d<-data.frame(id, time, value)
The value 6 appears only once for each id. For every id, i would like to remove all rows after the line with the value 6 per id except the first two lines coming after.
I've searched and found a similar problem, but i couldnt adapt it myself. I therefore use the code of this thread
In the above case the final data frame should be
id time value
1 0 1
1 1 1
1 2 6
1 3 1
1 4 2
2 0 2
2 1 6
2 2 2
2 3 2
3 0 1
3 1 1
3 2 6
3 3 1
On of the solution given seems getting very close to what i need. But i didn't manage to adapt it. Could u help me?
library(plyr)
ddply(d, "id",
function(x) {
if (any(x$value == 6)) {
subset(x, time <= x[x$value == 6, "time"])
} else {
x
}
}
)
Thank you very much.
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(d)). Grouped by the 'id' column, we get the position of 'value' that is equal to 6. Add 2 to it. Find the min of the number of elements for that group (.N) and the position, get the seq, and use that to subset the dataset. We can also add an if/else condition to check whether there are any 6 in the 'value' column or else to return .SD without any subsetting.
library(data.table)
setDT(d)[, if(any(value==6)) .SD[seq(min(c(which(value==6) + 2, .N)))]
else .SD, by = id]
# id time value
# 1: 1 0 1
# 2: 1 1 1
# 3: 1 2 6
# 4: 1 3 1
# 5: 1 4 2
# 6: 2 0 2
# 7: 2 1 6
# 8: 2 2 2
# 9: 2 3 2
#10: 3 0 1
#11: 3 1 1
#12: 3 2 6
#13: 3 3 1
#14: 4 0 1
#15: 4 1 2
#16: 4 2 5
Or as #Arun mentioned in the comments, we can use the ?head to subset, which would be faster
setDT(d)[, if(any(value==6)) head(.SD, which(value==6L)+2L) else .SD, by = id]
Or using dplyr, we group by 'id', get the position of 'value' 6 with which, add 2, get the seq and use that numeric index within slice to extract the rows.
library(dplyr)
d %>%
group_by(id) %>%
slice(seq(which(value==6)+2))
# id time value
#1 1 0 1
#2 1 1 1
#3 1 2 6
#4 1 3 1
#5 1 4 2
#6 2 0 2
#7 2 1 6
#8 2 2 2
#9 2 3 2
#10 3 0 1
#11 3 1 1
#12 3 2 6
#13 3 3 1
#14 4 0 1
#15 4 1 2
#16 4 2 5
data
d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), time = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 1L, 2L), value = c(1L, 1L, 6L, 1L,
2L, 2L, 6L, 2L, 2L, 1L, 1L, 6L, 1L, 1L, 2L, 5L)), .Names = c("id",
"time", "value"), class = "data.frame", row.names = c(NA, -16L))
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have the following data frame:
Event Scenario Year Cost
1 1 1 10
2 1 1 5
3 1 2 6
4 1 2 6
5 2 1 15
6 2 1 12
7 2 2 10
8 2 2 5
9 3 1 4
10 3 1 5
11 3 2 6
12 3 2 5
I need to produce a pivot table/ frame that will sum the total cost per year for each scenario. So the result will be.
Scenario Year Cost
1 1 15
1 2 12
2 1 27
2 2 15
3 1 9
3 2 11
I need to produce a ggplot line graph that plot the cost of each scenario per year. I know how to do that, I just can't get the right data frame.
Try
library(dplyr)
df %>% group_by(Scenario, Year) %>% summarise(Cost=sum(Cost))
Or
library(data.table)
setDT(df)[, list(Cost=sum(Cost)), by=list(Scenario, Year)]
Or
aggregate(Cost~Scenario+Year, df,sum)
data
df <- structure(list(Event = 1:12, Scenario = c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), Year = c(1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L), Cost = c(10L, 5L, 6L, 6L, 15L, 12L,
10L, 5L, 4L, 5L, 6L, 5L)), .Names = c("Event", "Scenario", "Year",
"Cost"), class = "data.frame", row.names = c(NA, -12L))
The following does it:
library(plyr)
ddply(df, .(Scenario, Year), summarize, Cost = sum(Cost))
#Scenario Year Cost
#1 1 1 15
#2 1 2 12
#3 2 1 27
#4 2 2 15
#5 3 1 9
#6 3 2 11