easiest way to add missing rows in R - r

I have the following table that I generate via a table/cumsum command.
> temp
numCars
18 1
17 2
16 8
15 18
14 25
13 29
12 42
11 55
10 70
9 134
8 160
7 172
6 177
5 180
3 181
2 181
1 181
0 181
temp <- structure(c(1L, 2L, 8L, 18L, 25L, 29L, 42L, 55L, 70L, 134L, 160L,
172L, 177L, 180L, 181L, 181L, 181L, 181L), .Dim = c(18L, 1L), .Dimnames = list(
c("18", "17", "16", "15", "14", "13", "12", "11", "10", "9",
"8", "7", "6", "5", "3", "2", "1", "0"), "numCars"))
As you can see the row with name 4 is missing. What's the easiest R way to fill it in where the value should be the value of the number lower (in this case 181).
I understand I can do this with a messy for loop where I can go in, size it, create a new DF, then put in any blank values. I'm just wondering if there's a better way?
Here's the table code:
cohortSizeByMileage <- data.matrix(cumsum(rev(table(cleanMileage$OdometerBucket))))
colnames(cohortSizeByMileage) <- "numCars"

We create the row names as column from the original dataset 'temp', based on the minimum and maximum value of row number in temp, another dataset ('df2') was created, merge or left_join the datasets, and fill the NA elements using na.locf from library(zoo).
df1 <- data.frame(numCars=temp[[1]], rn1=as.numeric(row.names(temp)))
df2 <- data.frame(rn1= max(df1$rn1):min(df1$rn1))
library(dplyr)
library(zoo)
left_join(df2, df1) %>%
mutate(numCars= na.locf(numCars,fromLast=TRUE ))
# rn1 numCars
#1 18 1
#2 17 2
#3 16 8
#4 15 18
#5 14 25
#6 13 29
#7 12 42
#8 11 55
#9 10 70
#10 9 134
#11 8 160
#12 7 172
#13 6 177
#14 5 180
#15 4 181
#16 3 181
#17 2 181
#18 1 181
#19 0 181

Related

R replace specific value in many columns across dataframe

I am mainly interested in replacing a specific value (81) in many columns across the dataframe.
For example, if this is my dataset
Id Date Col_01 Col_02 Col_03 Col_04
30 2012-03-31 1 A42.2 20.46 43
36 1996-11-15 42 V73 23 55
96 2010-02-07 X48 81 13 3R
40 2010-03-18 AD14 18.12 20.12 36
69 2012-02-21 8 22.45 12 10
11 2013-07-03 81 V017 78.12 81
22 2001-06-01 11 09 55 12
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12 1 4 67 12
34 2014-03-10 82.12 N72.22 V45.44 10
I like to replace value 81 in columns Col1, Col2, Col3, Col4 to NA. The final expected dataset like this
Id Date Col_01 Col_02 Col_03 Col_04
30 2012-03-31 1 A42.2 20.46 43
36 1996-11-15 42 V73 23 55
96 2010-02-07 X48 **NA 13 3R
40 2010-03-18 AD14 18.12 20.12 36
69 2012-02-21 8 22.45 12 10
11 2013-07-03 **NA V017 78.12 **NA
22 2001-06-01 11 09 55 12
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12 1 4 67 12
34 2014-03-10 82.12 N72.22 V45.44 10
I tried this approach
df %>% select(matches("^Col_\\d+$"))[ df %>% select(matches("^Col_\\d+$")) == 81 ] <- NA
Something similar to this solution data[ , 2:3 ][ data[ , 2:3 ] == 4 ] <- 10 here
Replacing occurrences of a number in multiple columns of data frame with another value in R
This did not work.
Any suggestion is much appreciated. Thanks in adavance.
Instead of select, we can directly specify the matches in mutate to replace the values that are '81' to NA (use na_if)
library(dplyr)
df <- df %>%
mutate(across(matches("^Col_\\d+$"), ~ na_if(., "81")))
-output
df
Id Date Col_01 Col_02 Col_03 Col_04
1 30 2012-03-31 1 A42.2 20.46 43
2 36 1996-11-15 42 V73 23 55
3 96 2010-02-07 X48 <NA> 13 3R
4 40 2010-03-18 AD14 18.12 20.12 36
5 69 2012-02-21 8 22.45 12 10
6 11 2013-07-03 <NA> V017 78.12 <NA>
7 22 2001-06-01 11 09 55 12
8 83 2005-03-16 80.45 V22.15 46.52 X29.11
9 92 2012-02-12 1 4 67 12
10 34 2014-03-10 82.12 N72.22 V45.44 10
Or we can use base R
i1 <- grep("^Col_\\d+$", names(df))
df[i1][df[i1] == "81"] <- NA
The issue in the OP's code is the assignment is not triggered as we expect i.e.
(df %>%
select(matches("^Col_\\d+$")))[(df %>%
select(matches("^Col_\\d+$"))) == "81" ]
[1] "81" "81" "81"
which is same as
df[i1][df[i1] == "81"]
[1] "81" "81" "81"
and not the assignment
(df %>%
select(matches("^Col_\\d+$")))[(df %>%
select(matches("^Col_\\d+$"))) == "81" ] <- NA
Error in (df %>% select(matches("^Col_\\d+$")))[(df %>% select(matches("^Col_\\d+$"))) == :
could not find function "(<-"
In base R, it does the assignment with [<-
data
df <- structure(list(Id = c(30L, 36L, 96L, 40L, 69L, 11L, 22L, 83L,
92L, 34L), Date = c("2012-03-31", "1996-11-15", "2010-02-07",
"2010-03-18", "2012-02-21", "2013-07-03", "2001-06-01", "2005-03-16",
"2012-02-12", "2014-03-10"), Col_01 = c("1", "42", "X48", "AD14",
"8", "81", "11", "80.45", "1", "82.12"), Col_02 = c("A42.2",
"V73", "81", "18.12", "22.45", "V017", "09", "V22.15", "4", "N72.22"
), Col_03 = c("20.46", "23", "13", "20.12", "12", "78.12", "55",
"46.52", "67", "V45.44"), Col_04 = c("43", "55", "3R", "36",
"10", "81", "12", "X29.11", "12", "10")),
class = "data.frame", row.names = c(NA,
-10L))
We can also use replace:
library(dplyr)
df <- df %>%
mutate(across(matches("^Col_\\d+$"), ~ replace(.x, ~.x==81, NA)))

Create new rows according to the values stored in a column

I have the following data
structure(list(State = c(56L, 81L, 126L, 161L, 120L, 138L, 71L,
133L, 6L, 171L, 42L, 64L, 28L, 76L, 56L, 117L, 47L, 69L, 65L,
105L, 175L, 151L, 0L, 91L, 150L, 157L, 172L, 69L, 132L, 39L,
152L, 107L, 142L, 174L, 187L, 84L, 58L, 73L, 198L, 5L, 43L, 189L,
34L, 177L, 119L, 69L, 152L, 155L, 44L, 59L, 20L, 120L, 1L, 173L,
190L, 121L, 118L, 168L, 80L, 45L, 26L, 15L, 190L, 25L, 7L, 146L,
177L, 41L, 28L, 190L, 64L, 76L, 194L, 13L, 172L, 120L, 132L,
160L, 58L, 12L), AgentID = 1:80, t_IDs = c("1 15", "2", "3",
"4", "5 52 76", "6", "7", "8", "9", "10", "11", "12 71", "13 69",
"14 72", "1 15", "16", "17", "18 28 46", "19", "20", "21", "22",
"23", "24", "25", "26", "27 75", "18 28 46", "29 77", "30", "31 47",
"32", "33", "34", "35", "36", "37 79", "38", "39", "40", "41",
"42", "43", "44 67", "45", "18 28 46", "31 47", "48", "49", "50",
"51", "5 52 76", "53", "54", "55 63 70", "56", "57", "58", "59",
"60", "61", "62", "55 63 70", "64", "65", "66", "44 67", "68",
"13 69", "55 63 70", "12 71", "14 72", "73", "74", "27 75", "5 52 76",
"29 77", "78", "37 79", "80")), row.names = c(NA, -80L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x560df68bf760>)
What I want to create turn t_IDs column into a new column IDs such that there will a new row for containing the values in t_IDs.
For example, for such rows
the resulting rows after the transformation would be like
2 198 39 309
2 194 73 73
2 190 55 55
2 190 55 63
2 190 55 70
If the State/AgendID pairs are not guaranteed to be unique, one way to do this is with a row-wise unlisting:
DT[, .(State = State[1], AgentID = AgentID[1], t_IDs = unlist(strsplit(t_IDs, split=" "))), by = seq_len(nrow(DT))][,-1]
# State AgentID t_IDs
# <int> <int> <char>
# 1: 56 1 1
# 2: 56 1 15
# 3: 81 2 2
# 4: 126 3 3
# 5: 161 4 4
# 6: 120 5 5
# 7: 120 5 52
# 8: 120 5 76
# 9: 138 6 6
# 10: 71 7 7
# ---
# 107: 172 75 75
# 108: 120 76 5
# 109: 120 76 52
# 110: 120 76 76
# 111: 132 77 29
# 112: 132 77 77
# 113: 160 78 78
# 114: 58 79 37
# 115: 58 79 79
# 116: 12 80 80
Alternative, we can work with a list-column and then use tidyr::unnest to break it out:
tidyr::unnest(DT[, t_IDs := strsplit(t_IDs, split=" ")][], t_IDs)
# # A tibble: 116 x 3
# State AgentID t_IDs
# <int> <int> <chr>
# 1 56 1 1
# 2 56 1 15
# 3 81 2 2
# 4 126 3 3
# 5 161 4 4
# 6 120 5 5
# 7 120 5 52
# 8 120 5 76
# 9 138 6 6
# 10 71 7 7
# # ... with 106 more rows
(This has the side-effect of converting to tbl_df.)
I would try this with data.table. Use strsplit to separate the values in t_IDs by spaces, and with unlist will have a long vector for the new t_IDs column. This is done for each State and AgentID combination.
library(data.table)
setDT(dt)
dt[, list(t_IDs = unlist(strsplit(t_IDs, " "))), by = c("State", "AgentID")]
An alternative that does not assume uniqueness of State/AgentID might be:
dt[ ,.(State, AgentID, new_tIDs = unlist(strsplit(t_IDs, " "))), by = seq_len(nrow(dt))]
Output
State AgentID t_IDs
1: 56 1 1
2: 56 1 15
3: 81 2 2
4: 126 3 3
5: 161 4 4
---
112: 132 77 77
113: 160 78 78
114: 58 79 37
115: 58 79 79
116: 12 80 80
Using the data you provided (3 columns) i performed this:
n = matrix(ncol= 3)[-1,]
for(i in grep(' ', df$t_IDs)){
b = df[i,]
c = length(gregexpr(' ',as.character(b[,3]))[[1]])+1
d = b[rep(1,c),1:2]
d$t_IDs = as.numeric(unlist(strsplit(as.character(b[1,3]), ' ')))
n = rbind(n,as.matrix(d))
}
df_new = df[-grep(' ', df$t_IDs),]
df_new = rbind(df_new, n)
df_new= df_new[order(df_new$AgentID),]
it gives this result
State AgentID t_IDs
1: 56 1 1
2: 56 1 15
3: 81 2 2
4: 126 3 3
5: 161 4 4
---
112: 132 77 77
113: 160 78 78
114: 58 79 37
115: 58 79 79
116: 12 80 80
I hope it helped
This is very easy to do in tidyr -one-step solution
separate_rows(dt, t_IDs, sep = ' ')
# A tibble: 116 x 3
State AgentID t_IDs
<int> <int> <chr>
1 56 1 1
2 56 1 15
3 81 2 2
4 126 3 3
5 161 4 4
6 120 5 5
7 120 5 52
8 120 5 76
9 138 6 6
10 71 7 7
# ... with 106 more rows

R delete first and last x % of rows

I have a data frame with 3 ID variables, then several values for each ID.
user Log Pass Value
2 2 123 342
2 2 123 543
2 2 123 231
2 2 124 257
2 2 124 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
The start and end of each set of values is sometimes noisy, and I want to be able to delete the first few values. Unfortunately the number of values varies significantly, but it is always the first and last 20% of values that are noisy.
I want to delete the first 20% of rows, with a minimum of 1 row deleted.
So for instance if there are 20 values for user 2 log 2 pass 123 I want to delete the first and last 4 rows. If there are only 3 values for the ID variable I want to delete the first and last row.
The resulting dataset would be:
user Log Pass Value
2 2 123 543
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
I've tried fiddling around with nrow but I struggle to figure out how to reference the % of rows by id variable.
Thanks.
Jonathan.
I believe the following can do it.
DATA.
dat <-
structure(list(user = c(2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), Log = c(2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), Pass = c(123L, 123L, 123L, 124L, 124L, 125L, 125L,
125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L,
125L, 125L, 125L), Value = c(342L, 543L, 231L, 257L, 342L, 543L,
231L, 257L, 342L, 543L, 231L, 257L, 543L, 231L, 257L, 543L, 231L,
257L, 543L, 231L, 257L)), .Names = c("user", "Log", "Pass", "Value"
), class = "data.frame", row.names = c(NA, -21L))
CODE.
fun <- function(x, p = 0.20){
n <- nrow(x)
m <- max(1, round(n*p))
inx <- c(seq_len(m), n - seq_len(m) + 1)
x[-inx, ]
}
result <- do.call(rbind, lapply(split(dat, dat$user), fun))
row.names(result) <- NULL
result
# user Log Pass Value
#1 2 2 123 543
#2 2 2 123 231
#3 2 2 124 257
#4 4 3 125 342
#5 4 3 125 543
#6 4 3 125 231
#7 4 3 125 257
#8 4 3 125 543
#9 4 3 125 231
#10 4 3 125 257
#11 4 3 125 543
#12 4 3 125 231
#13 4 3 125 257
Would something like this help?
For a dataframe df:
df[-c(1:floor(nrow(df)*0.2), (1+ceiling(nrow(df)*0.8)):nrow(df)),]
Just removing the first and last 20%, taking the upper and lower values so that for smaller data frame you keep some of the information:
> df<-data.frame(a=1:100)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[31] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
> df<-data.frame(1:3)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 2
You can do this with dplyr...
library(dplyr)
df2 <- df %>% group_by(user, Log, Pass) %>%
filter(n()>2) %>% #remove those with just two elements or fewer
slice(max(2, 1+ceiling(n()*0.2)):min(n()-1, floor(0.8*n())))
df2
user Log Pass Value
1 2 2 123 543
2 4 3 125 543
3 4 3 125 231
4 4 3 125 257
5 4 3 125 543
6 4 3 125 231
7 4 3 125 257
8 4 3 125 543
9 4 3 125 231
Calculate the offset for what you want to retain:
rem <- ceiling( nrow( x ) * .2 ) + 1
Then take out the records you don-t want:
dat <- dat[ rem : ( nrow( dat ) - rem ), ]
Here is an idea using base R that returns the row indices of each user to keep and then subsets on these indices.
idx <- unlist(lapply(split(seq_along(dat[["user"]]), dat[["user"]]), function(x) {
tmp <- max(1, ceiling(.2 * length(x)))
tail(head(x, -tmp), -tmp)}),
use.names=FALSE)
split(seq_along(dat[["user"]]), dat[["user"]]) returns a list of the rows for each user. lapply loops through these rows, calculating the number of rows to drop from each end with split(seq_along(dat[["user"]]), dat[["user"]]), and then dropping them with tail(head(x, -tmp), -tmp)}). Since lapply returns a named list, this is unlisted and the names are dropped.
This returns
idx
2 3 4 10 11 12 13 14 15 16 17
Now subset
dat[idx,]
user Log Pass Value
2 2 2 123 543
3 2 2 123 231
4 2 2 124 257
10 4 3 125 543
11 4 3 125 231
12 4 3 125 257
13 4 3 125 543
14 4 3 125 231
15 4 3 125 257
16 4 3 125 543
17 4 3 125 231

Releveling factor to facilitate use as nested factor in DESeq2 model in R

I am fitting a GLM using the DESeq2 package, and have the situation where individuals (RatIDs) are nested within the treatment (Diet). The author of the package suggests that the individuals be re-leveled from 1:N within each Diet (where N is the number of RatIDs within a specific Diet) rather than their original ID/factor level (DESeq2 vignette, page 35.)
The data looks something like this (there are actually more columns and rows, but omitted for simplicity):
Diet Extraction RatID
199 HAMSP 8 65
74 HAMS 9 108
308 HAMS 18 100
41 HAMSA 3 83
88 HAMSP 12 11
221 HAMSP 14 66
200 HAMSA 8 57
155 HAMSB 1 105
245 HAMSB 19 50
254 HAMS 21 90
182 HAMSB 4 4
283 HAMSA 23 59
180 HAMSP 4 22
71 HAMSP 9 112
212 HAMS 12 63
220 HAMSP 14 54
56 HAMS 7 81
274 HAMSP 1 11
114 HAMS 17 102
143 HAMSP 22 93
And here is a dput() output for the structure:
data = structure(list(Diet = structure(c(4L, 1L, 1L, 2L, 4L, 4L, 2L,
3L, 3L, 1L, 3L, 2L, 4L, 4L, 1L, 4L, 1L, 4L, 1L, 4L), .Label = c("HAMS",
"HAMSA", "HAMSB", "HAMSP", "LAMS"), class = "factor"), Extraction = c(8L,
9L, 18L, 3L, 12L, 14L, 8L, 1L, 19L, 21L, 4L, 23L, 4L, 9L, 12L,
14L, 7L, 1L, 17L, 22L), RatID = structure(c(61L, 7L, 3L, 76L,
9L, 62L, 52L, 6L, 46L, 81L, 37L, 54L, 20L, 12L, 59L, 50L, 74L,
9L, 4L, 84L), .Label = c("1", "10", "100", "102", "103", "105",
"108", "109", "11", "110", "111", "112", "113", "13", "14", "16",
"17", "18", "20", "22", "23", "24", "25", "26", "27", "28", "29",
"3", "30", "31", "32", "34", "35", "36", "37", "39", "4", "40",
"42", "43", "45", "46", "48", "49", "5", "50", "51", "52", "53",
"54", "55", "57", "58", "59", "6", "60", "61", "62", "63", "64",
"65", "66", "67", "68", "69", "70", "71", "73", "77", "78", "79",
"8", "80", "81", "82", "83", "85", "86", "88", "89", "90", "91",
"92", "93", "94", "95", "96", "98", "99"), class = "factor")), .Names = c("Diet",
"Extraction", "RatID"), row.names = c(199L, 74L, 308L, 41L, 88L,
221L, 200L, 155L, 245L, 254L, 182L, 283L, 180L, 71L, 212L, 220L,
56L, 274L, 114L, 143L), class = "data.frame")
Can someone please specify an elegant way to generate the new factor levels for RatIDs within Diet as an additional column of the above data.frame.
Could this be done with the roll function of data.table?
Desired output (done manually):
Diet Extraction RatID newCol
1 HAMSP 8 65 1
2 HAMS 9 108 1
3 HAMS 18 100 2
4 HAMSA 3 83 1
5 HAMSP 12 11 2
6 HAMSP 14 66 3
7 HAMSA 8 57 2
8 HAMSB 1 105 1
9 HAMSB 19 50 2
10 HAMS 21 90 3
11 HAMSB 4 4 3
12 HAMSA 23 59 3
13 HAMSP 4 22 4
14 HAMSP 9 112 5
15 HAMS 12 63 4
16 HAMSP 14 54 6
17 HAMS 7 81 5
18 HAMSP 1 11 2
19 HAMS 17 102 6
20 HAMSP 22 93 7
NOTE: There are not an equal number of Rats in each treatment. I'd also like the solution to not re-order the rows in the data (if possible).
EDIT: There is no 'natural' order to the RatIDs, just as long as there is a 1:1 mapping within a diet, its fine.
You can convert the 'RatID' to 'factor' and coerce it back to 'numeric'
library(data.table)#v1.9.4+
setDT(data)[, newCol:=as.numeric(factor(RatID,
levels=unique(RatID))), Diet]
# Diet Extraction RatID newCol
# 1: HAMSP 8 65 1
# 2: HAMS 9 108 1
# 3: HAMS 18 100 2
# 4: HAMSA 3 83 1
# 5: HAMSP 12 11 2
# 6: HAMSP 14 66 3
# 7: HAMSA 8 57 2
# 8: HAMSB 1 105 1
# 9: HAMSB 19 50 2
#10: HAMS 21 90 3
#11: HAMSB 4 4 3
#12: HAMSA 23 59 3
#13: HAMSP 4 22 4
#14: HAMSP 9 112 5
#15: HAMS 12 63 4
#16: HAMSP 14 54 6
#17: HAMS 7 81 5
#18: HAMSP 1 11 2
#19: HAMS 17 102 6
#20: HAMSP 22 93 7
Or use match
setDT(data)[, newCol:=match(RatID, unique(RatID)), Diet]
Or similar option with base R
data$newCol <- with(data, ave(as.numeric(levels(RatID))[RatID],
Diet, FUN=function(x) match(x, unique(x))))
Here is the as.numeric(factor(.)) trick implemented in dplyr:
require(dplyr)
data %>% group_by(Diet) %>% mutate(RatIDByDiet=as.numeric(factor(RatID)))
## Source: local data frame [20 x 4]
## Groups: Diet
##
## Diet Extraction RatID RatIDByDiet
## 1 HAMSP 8 65 5
## 2 HAMS 9 108 3
## 3 HAMS 18 100 1
## 4 HAMSA 3 83 3
## 5 HAMSP 12 11 1
## 6 HAMSP 14 66 6
## 7 HAMSA 8 57 1
## 8 HAMSB 1 105 1
## 9 HAMSB 19 50 3
## 10 HAMS 21 90 6
## 11 HAMSB 4 4 2
## 12 HAMSA 23 59 2
## 13 HAMSP 4 22 3
## 14 HAMSP 9 112 2
## 15 HAMS 12 63 4
## 16 HAMSP 14 54 4
## 17 HAMS 7 81 5
## 18 HAMSP 1 11 1
## 19 HAMS 17 102 2
## 20 HAMSP 22 93 7
And here is a solution that avoids going through factor(), if you want more control over how the numbering happens:
data %>% group_by(Diet) %>% mutate(RatIDByDiet=match(RatID, unique(RatID)))
## Source: local data frame [20 x 4]
## Groups: Diet
##
## Diet Extraction RatID RatIDByDiet
## 1 HAMSP 8 65 1
## 2 HAMS 9 108 1
## 3 HAMS 18 100 2
## 4 HAMSA 3 83 1
## 5 HAMSP 12 11 2
## 6 HAMSP 14 66 3
## 7 HAMSA 8 57 2
## 8 HAMSB 1 105 1
## 9 HAMSB 19 50 2
## 10 HAMS 21 90 3
## 11 HAMSB 4 4 3
## 12 HAMSA 23 59 3
## 13 HAMSP 4 22 4
## 14 HAMSP 9 112 5
## 15 HAMS 12 63 4
## 16 HAMSP 14 54 6
## 17 HAMS 7 81 5
## 18 HAMSP 1 11 2
## 19 HAMS 17 102 6
## 20 HAMSP 22 93 7

Merging output in R

max=aggregate(cbind(a$VALUE,Date=a$DATE) ~ format(a$DATE, "%m") + cut(a$CLASS, breaks=c(0,2,4,6,8,10,12,14)) , data = a, max)[-1]
max$DATE=as.Date(max$DATE, origin = "1970-01-01")
Sample Data :
DATE GRADE VALUE
2008-09-01 1 20
2008-09-02 2 30
2008-09-03 3 50
.
.
2008-09-30 2 75
.
.
2008-10-01 1 95
.
.
2008-11-01 4 90
.
.
2008-12-01 1 70
2008-12-02 2 40
2008-12-28 4 30
2008-12-29 1 40
2008-12-31 3 50
My Expected output according to above table for only first month is :
DATE GRADE VALUE
2008-09-30 (0,2] 75
2008-09-02 (2,4] 50
Output in my real data :
format(DATE, "%m")
1 09
2 10
3 11
4 12
5 09
6 10
7 11
cut(a$GRADE, breaks = c(0, 2, 4, 6, 8, 10, 12, 14)) value
1 (0,2] 0.30844444
2 (0,2] 1.00000000
3 (0,2] 1.00000000
4 (0,2] 0.73333333
5 (2,4] 0.16983488
6 (2,4] 0.09368000
7 (2,4] 0.10589335
Date
1 2008-09-30
2 2008-10-31
3 2008-11-28
4 2008-12-31
5 2008-09-30
6 2008-10-31
7 2008-11-28
The output is not according to the sample data , as the data is too big . A simple logic is that there are grades from 1 to 10 , so I want to find the highest value for a month in the corresponding grade groups . Eg : I need a highest value for each group (0,2],(0,4] etc
I used an aggregate condition with function max and two grouping it by two columns Date and Grade . Now when I run the code and display the value of max , I get 3 tables as output one after the other. Now I want to plot this output but i am not able to do that because of this .So how can i merge all these output ?
Try:
library(dplyr)
a %>%
group_by(MONTH=format(DATE, "%m"), GRADE=cut(GRADE, breaks=seq(0,14,by=2))) %>%
summarise_each(funs(max))
# MONTH GRADE DATE VALUE
#1 09 (0,2] 2008-09-30 75
#2 09 (2,4] 2008-09-03 50
#3 10 (0,2] 2008-10-01 95
#4 11 (2,4] 2008-11-01 90
#5 12 (0,2] 2008-12-29 70
#6 12 (2,4] 2008-12-31 50
Or using data.table
library(data.table)
setDT(a)[, list(DATE=max(DATE), VALUE=max(VALUE)),
by= list(MONTH=format(DATE, "%m"),
GRADE=cut(GRADE, breaks=seq(0,14, by=2)))]
# MONTH GRADE DATE VALUE
#1: 09 (0,2] 2008-09-30 75
#2: 09 (2,4] 2008-09-03 50
#3: 10 (0,2] 2008-10-01 95
#4: 11 (2,4] 2008-11-01 90
#5: 12 (0,2] 2008-12-29 70
#6: 12 (2,4] 2008-12-31 50
Or using aggregate
res <- transform(with(a,
aggregate(cbind(VALUE, DATE),
list(MONTH=format(DATE, "%m") ,GRADE=cut(GRADE, breaks=seq(0,14, by=2))), max)),
DATE=as.Date(DATE, origin="1970-01-01"))
res[order(res$MONTH),]
# MONTH GRADE VALUE DATE
#1 09 (0,2] 75 2008-09-30
#4 09 (2,4] 50 2008-09-03
#2 10 (0,2] 95 2008-10-01
#5 11 (2,4] 90 2008-11-01
#3 12 (0,2] 70 2008-12-29
#6 12 (2,4] 50 2008-12-31
data
a <- structure(list(DATE = structure(c(14123, 14124, 14125, 14152,
14153, 14184, 14214, 14215, 14241, 14242, 14244), class = "Date"),
GRADE = c(1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 4L, 1L, 3L), VALUE = c(20L,
30L, 50L, 75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L)), .Names = c("DATE",
"GRADE", "VALUE"), row.names = c(NA, -11L), class = "data.frame")
Update
If you want to include YEAR also in the grouping
library(dplyr)
a %>%
group_by(MONTH=format(DATE, "%m"), YEAR=format(DATE, "%Y"), GRADE=cut(GRADE, breaks=seq(0,14, by=2)))%>%
summarise_each(funs(max))
# MONTH YEAR GRADE DATE VALUE
#1 09 2008 (0,2] 2008-09-30 75
#2 09 2008 (2,4] 2008-09-03 50
#3 09 2009 (0,2] 2009-09-30 75
#4 09 2009 (2,4] 2009-09-03 50
#5 10 2008 (0,2] 2008-10-01 95
#6 10 2009 (0,2] 2009-10-01 95
#7 11 2008 (2,4] 2008-11-01 90
#8 11 2009 (2,4] 2009-11-01 90
#9 12 2008 (0,2] 2008-12-29 70
#10 12 2008 (2,4] 2008-12-31 50
#11 12 2009 (0,2] 2009-12-29 70
#12 12 2009 (2,4] 2009-12-31 50
data
a <- structure(list(DATE = structure(c(14123, 14124, 14125, 14152,
14153, 14184, 14214, 14215, 14241, 14242, 14244, 14488, 14489,
14490, 14517, 14518, 14549, 14579, 14580, 14606, 14607, 14609
), class = "Date"), GRADE = c(1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L,
4L, 1L, 3L, 1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 4L, 1L, 3L), VALUE = c(20L,
30L, 50L, 75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L, 20L, 30L, 50L,
75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L)), .Names = c("DATE",
"GRADE", "VALUE"), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "21", "31", "41", "51", "61",
"71", "81", "91", "101", "111"), class = "data.frame")
Following code using base R may be helpful (using 'a' dataframe from akrun's answer):
xx = strsplit(as.character(a$DATE), '-')
a$month = sapply(strsplit(as.character(a$DATE), '-'),'[',2)
gradeCats = cut(a$GRADE, breaks = c(0, 2, 4, 6, 8, 10, 12, 14))
aggregate(VALUE~month+gradeCats, data= a, max)
month gradeCats VALUE
1 09 (0,2] 75
2 10 (0,2] 95
3 12 (0,2] 70
4 09 (2,4] 50
5 11 (2,4] 90
6 12 (2,4] 50

Resources