How to make a cross table with NA instead of X? - r

I have the following dataset (see for loading dataset below)
ID Date qty
1 ID25 2007-12-01 45
2 ID25 2008-01-01 26
3 ID25 2008-02-01 46
4 ID25 2008-03-01 0
5 ID25 2008-04-01 78
6 ID25 2008-05-01 65
7 ID25 2008-06-01 32
8 ID99 2008-02-01 99
9 ID99 2008-03-01 0
10 ID99 2008-04-01 99
And I would like to create a pivot table of that. I do that with the following command and that seems to be working fine:
pivottable <- xtabs(qty ~ ID + Date, table)
The output is the following:
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 0 0 99 0 99 0 0
However, for ID99 there are only values for 3 periods the rest is marked as '0'. I would like to display NA in the fields that have no values in the first table. I would like to get a table that looks as following:
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 NA NA 99 0 99 NA NA
Any suggestion on how to accomplish this?
Loading dataset:
table <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("ID25", "ID99"), class = "factor"), Date = structure(c(7L,
1L, 2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L), .Label = c("01/01/2008",
"01/02/2008", "01/03/2008", "01/04/2008", "01/05/2008", "01/06/2008",
"01/12/2007"), class = "factor"), qty = c(45L, 26L, 46L, 0L,
78L, 65L, 32L, 99L, 0L, 99L)), .Names = c("ID", "Date", "qty"
), class = "data.frame", row.names = c(NA, -10L))
table$Date <- as.POSIXct(table$Date, format='%d/%m/%Y')

You could use xtabs twice to obtain the output you are looking for:
Create the table:
pivottable <- xtabs(qty ~ ID + Date, table)
Replace all zeros of non-existing combinations with NA:
pivottable[!xtabs( ~ ID + Date, table)] <- NA
The output:
Date
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 99 0 99
Note that NAs are not displayed. This is due to the print function for this class. But you could use unclass(pivottable) to achieve regular behavior of print.

Related

R - How to calculate value differences between dates with heterogeneous number of rows

My data look like the following example.
# A tibble: 18 x 4
DATE AUTHOR PRODUCT SALES
<dttm> <chr> <chr> <dbl>
1 2019-11-27 James B 80
2 2019-11-28 James B 100
3 2019-11-27 James A 80
4 2019-11-28 James A 100
5 2019-11-26 Frank B 70
6 2019-11-27 Frank B 75
7 2019-11-28 Frank B 65
8 2019-11-26 Frank A 70
9 2019-11-27 Frank A 75
10 2019-11-28 Frank A 65
11 2019-11-25 Mary A 100
12 2019-11-26 Mary A 80
13 2019-11-27 Mary A 95
14 2019-11-28 Mary A 110
15 2019-11-25 Mary B 100
16 2019-11-26 Mary B 80
17 2019-11-27 Mary B 95
18 2019-11-28 Mary B 110
I would like to add a "DIFF" column where the difference over day for SALES is calculated grouping by AUTHOR. My issues here are the following:
I have a different number of rows for every AUTHOR.
The same DATE could be repeated for some AUTHORS to report different information (in this example is PRODUCT), but the value for SALES will always remain the same, since it only depends on the DATE and the AUTHOR.
I have to keep every row in the dataset because every row contains specific information, so I can not just drop the rows where DATE is a duplicated.
Ideally I would implement the whole with a loop function in my script.
My desired outcome would be:
# A tibble: 18 x 4
DATE AUTHOR PRODUCT SALES DIFF
<dttm> <chr> <chr> <dbl>
1 2019-11-27 James B 80
2 2019-11-28 James B 100 20
3 2019-11-27 James A 80
4 2019-11-28 James A 100 20
5 2019-11-26 Frank B 70
6 2019-11-27 Frank B 75 5
7 2019-11-28 Frank B 65 -10
8 2019-11-26 Frank A 70
9 2019-11-27 Frank A 75 5
10 2019-11-28 Frank A 65 -10
11 2019-11-25 Mary A 100
12 2019-11-26 Mary A 80 -20
13 2019-11-27 Mary A 95 15
14 2019-11-28 Mary A 110 15
15 2019-11-25 Mary B 100
16 2019-11-26 Mary B 80 -20
17 2019-11-27 Mary B 95 15
18 2019-11-28 Mary B 110 15
I tried different things with dplyr and mutate but nothing seemed to work. Anyone has suggestions?
Thank you!
You could use lag to subtract previous value by group
library(dplyr)
df %>% group_by(AUTHOR, PRODUCT) %>% mutate(diff = SALES - lag(SALES))
# DATE AUTHOR PRODUCT SALES diff
# <fct> <fct> <fct> <int> <int>
# 1 2019-11-27 James B 80 NA
# 2 2019-11-28 James B 100 20
# 3 2019-11-27 James A 80 NA
# 4 2019-11-28 James A 100 20
# 5 2019-11-26 Frank B 70 NA
# 6 2019-11-27 Frank B 75 5
# 7 2019-11-28 Frank B 65 -10
# 8 2019-11-26 Frank A 70 NA
# 9 2019-11-27 Frank A 75 5
#10 2019-11-28 Frank A 65 -10
#11 2019-11-25 Mary A 100 NA
#12 2019-11-26 Mary A 80 -20
#13 2019-11-27 Mary A 95 15
#14 2019-11-28 Mary A 110 15
#15 2019-11-25 Mary B 100 NA
#16 2019-11-26 Mary B 80 -20
#17 2019-11-27 Mary B 95 15
#18 2019-11-28 Mary B 110 15
Or using diff
df %>% group_by(AUTHOR, PRODUCT) %>% mutate(diff = c(NA, diff(SALES)))
data
df <- structure(list(DATE = structure(c(3L, 4L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("2019-11-25",
"2019-11-26", "2019-11-27", "2019-11-28"), class = "factor"),
AUTHOR = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Frank",
"James", "Mary"), class = "factor"), PRODUCT = structure(c(2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("A", "B"), class = "factor"), SALES = c(80L,
100L, 80L, 100L, 70L, 75L, 65L, 70L, 75L, 65L, 100L, 80L,
95L, 110L, 100L, 80L, 95L, 110L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18"))
We can use shift from data.table
library(data.table)
setDT(df)[, diff := SALES - shift(SALES), .(AUTHOR, PRODUCT)][]

R transposing repeat records

I have a data table that repeats records. I would like to transpose the table but into the unique record names.
Below is a sample of the Data table:
V1 V2 id
ClientID 29 1
CheckID 201 1
PaymentAmount 256 1
Gross 301 1
Net 256 1
Invested 130 1
Invested 53 1
Invested 118 1
ClientID 31 2
CheckID 222 2
PaymentAmount 41 2
Gross 46 2
Net 41 2
Invested 46 2
ClientID 43 3
CheckID 310 3
PaymentAmount 41 3
Gross 46 3
Net 41 3
Invested 46 3
You can see from the table above that the record in X1 called "Investment" can occur more than once for a single ClientID. I'd like to transpose the data so that it looks as such:
ClientID CheckID PaymentAmount Gross Net Invested ID
29 201 256 301 256 130 1
29 201 256 301 256 53 1
29 201 256 301 256 118 1
31 222 41 46 41 46 2
43 310 41 46 41 46 3
43 310 41 46 41 48 3
any support is greatly appreciated!
We can create a sequence column grouped by the "V1", "id" column using data.table, then convert from 'long' to 'wide' format with dcast and replace the NA with the non-NA preceding values using na.locf from zoo.
library(data.table)
library(zoo)
setDT(df1)[, N:= 1:.N , by = .(V1, id)]
dcast(df1, id+N~V1, value.var="V2")[, lapply(.SD, na.locf),
by = id, .SDcols = CheckID:PaymentAmount]
# id CheckID ClientID Gross Invested Net PaymentAmount
#1: 1 201 29 301 130 256 256
#2: 1 201 29 301 53 256 256
#3: 1 201 29 301 118 256 256
#4: 2 222 31 46 46 41 41
#5: 3 310 43 46 46 41 41
data
df1 <- structure(list(V1 = c("ClientID", "CheckID", "PaymentAmount",
"Gross", "Net", "Invested", "Invested", "Invested", "ClientID",
"CheckID", "PaymentAmount", "Gross", "Net", "Invested", "ClientID",
"CheckID", "PaymentAmount", "Gross", "Net", "Invested"), V2 = c(29L,
201L, 256L, 301L, 256L, 130L, 53L, 118L, 31L, 222L, 41L, 46L,
41L, 46L, 43L, 310L, 41L, 46L, 41L, 46L), id = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L)), .Names = c("V1", "V2", "id"), class = "data.frame",
row.names = c(NA, -20L))

R programming - data frame manoevur

Suppose I have the following dataframe:
dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
4: 50 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
5: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
6: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
lc wc li yd yr nF factdcx
1: 1 3 TRUE 1 2010 2 24
2: 1 3 TRUE 1 2010 8 41
3: 2 3 TRUE 1 2010 0 48
4: 2 3 TRUE 1 2010 0 50
5: 2 3 TRUE 1 2010 0 52
6: 3 3 FALSE 1 2010 0 57
I'd like to turn it into a new dataframe like the following:
dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
4: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
5: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
lc wc li yd yr nF factdcx
1: 1 3 TRUE 1 2010 2 24
2: 1 3 TRUE 1 2010 8 41
3: 2 3 TRUE 1 2010 0 (sum of nF for 48 and 50, factdcx) 48
4: 2 3 TRUE 1 2010 0 52
5: 3 3 FALSE 1 2010 0 57
How can I do it? (Surely, the dataframe, abc, is much larger, but I want the sum of all categories of 48 and 50 and group it into a new category, say '48').
Many thanks!
> dput(head(abc1))
structure(list(dc = c(24L, 41L, 48L, 50L, 52L, 57L), tmin = c(-1L,
-3L, 0L, 0L, 3L, -2L), tmax = c(4L, 5L, 5L, 5L, 5L, 5L), cint = c(5L,
8L, 5L, 5L, 2L, 7L), wcmin = c(-5L, -8L, -4L, -4L, -3L, -6L),
wcmax = c(-2L, -3L, 0L, 0L, 1L, -1L), wsmin = c(20L, 15L,
30L, 30L, 20L, 25L), wsmax = c(25L, 20L, 35L, 35L, 25L, 30L
), gsmin = c(35L, 35L, 45L, 45L, 35L, 35L), gsmax = c(40L,
40L, 50L, 50L, 40L, 40L), wd = c(90L, 90L, 45L, 45L, 45L,
315L), rmin = c(11.8, 10, 7.3, 7.3, 6.7, 4.4), rmax = c(26.6,
23.5, 19, 19, 17.4, 13.8), cir = c(14.8, 13.5, 11.7, 11.7,
10.7, 9.4), lr = c(3L, 3L, 6L, 6L, 6L, 7L), lc = c(1L, 1L,
2L, 2L, 2L, 3L), wc = c(3L, 3L, 3L, 3L, 3L, 3L), li = c(TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE), yd = c(1L, 1L, 1L, 1L, 1L,
1L), yr = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L), nF = c(2L,
8L, 0L, 0L, 0L, 0L), factdcx = structure(1:6, .Label = c("24",
"41", "48", "50", "52", "57", "70"), class = "factor")), .Names = c("dc",
"tmin", "tmax", "cint", "wcmin", "wcmax", "wsmin", "wsmax", "gsmin",
"gsmax", "wd", "rmin", "rmax", "cir", "lr", "lc", "wc", "li",
"yd", "yr", "nF", "factdcx"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x054b24a0>)
Still got a problem, sir/madam:
> head(abc1 (updated))
dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
4: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
5: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
6: 70 -2 3 5 -4 -1 20 25 30 35 360 3.6 10.2 6.6 7
lc wc li yd yr nF factdcx
1: 1 3 TRUE 1 2010 2 24
2: 1 3 TRUE 1 2010 8 41
3: 2 3 TRUE 1 2010 57 48
4: 2 3 TRUE 1 2010 0 52
5: 3 3 FALSE 1 2010 0 57
6: 3 2 TRUE 1 2010 1 70
The sum of nF was incorrect, it should be zero.
Try
library(data.table)
unique(setDT(df1)[, factdcx:= as.character(factdcx)][factdcx %chin%
c('48','50'), c('dc', 'factdcx', 'nF') := list('48', '48', sum(nF))])
# dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
#1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
#2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
#3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
#4: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
#5: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
# lc wc li yd yr nF factdcx
#1: 1 3 TRUE 1 2010 2 24
#2: 1 3 TRUE 1 2010 8 41
#3: 2 3 TRUE 1 2010 0 48
#4: 2 3 TRUE 1 2010 0 52
#5: 3 3 FALSE 1 2010 0 57
For abc1,
res1 <- unique(setDT(abc1)[, factdcx:= as.character(factdcx)][factdcx %chin%
c('48','50'), c('dc', 'factdcx', 'nF') := list(48, '48', sum(nF))])
res1
# dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
#1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
#2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
#3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
#4: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
#5: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
# lc wc li yd yr nF factdcx
#1: 1 3 TRUE 1 2010 2 24
#2: 1 3 TRUE 1 2010 8 41
#3: 2 3 TRUE 1 2010 0 48
#4: 2 3 TRUE 1 2010 0 52
#5: 3 3 FALSE 1 2010 0 57
data
df1 <- structure(list(dc = structure(1:6, .Label = c("24", "41",
"48",
"50", "52", "57"), class = "factor"), tmin = c(-1L, -3L, 0L,
0L, 3L, -2L), tmax = c(4L, 5L, 5L, 5L, 5L, 5L), cint = c(5L,
8L, 5L, 5L, 2L, 7L), wcmin = c(-5L, -8L, -4L, -4L, -3L, -6L),
wcmax = c(-2L, -3L, 0L, 0L, 1L, -1L), wsmin = c(20L, 15L,
30L, 30L, 20L, 25L), wsmax = c(25L, 20L, 35L, 35L, 25L, 30L
), gsmin = c(35L, 35L, 45L, 45L, 35L, 35L), gsmax = c(40L,
40L, 50L, 50L, 40L, 40L), wd = c(90L, 90L, 45L, 45L, 45L,
315L), rmin = c(11.8, 10, 7.3, 7.3, 6.7, 4.4), rmax = c(26.6,
23.5, 19, 19, 17.4, 13.8), cir = c(14.8, 13.5, 11.7, 11.7,
10.7, 9.4), lr = c(3L, 3L, 6L, 6L, 6L, 7L), lc = c(1L, 1L,
2L, 2L, 2L, 3L), wc = c(3L, 3L, 3L, 3L, 3L, 3L), li = c(TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE), yd = c(1L, 1L, 1L, 1L, 1L,
1L), yr = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L), nF = c(2L,
8L, 0L, 0L, 0L, 0L), factdcx = structure(1:6, .Label = c("24",
"41", "48", "50", "52", "57"), class = "factor")), .Names = c("dc",
"tmin", "tmax", "cint", "wcmin", "wcmax", "wsmin", "wsmax", "gsmin",
"gsmax", "wd", "rmin", "rmax", "cir", "lr", "lc", "wc", "li",
"yd", "yr", "nF", "factdcx"), row.names = c("1:", "2:", "3:",
"4:", "5:", "6:"), class = "data.frame")

Merging output in R

max=aggregate(cbind(a$VALUE,Date=a$DATE) ~ format(a$DATE, "%m") + cut(a$CLASS, breaks=c(0,2,4,6,8,10,12,14)) , data = a, max)[-1]
max$DATE=as.Date(max$DATE, origin = "1970-01-01")
Sample Data :
DATE GRADE VALUE
2008-09-01 1 20
2008-09-02 2 30
2008-09-03 3 50
.
.
2008-09-30 2 75
.
.
2008-10-01 1 95
.
.
2008-11-01 4 90
.
.
2008-12-01 1 70
2008-12-02 2 40
2008-12-28 4 30
2008-12-29 1 40
2008-12-31 3 50
My Expected output according to above table for only first month is :
DATE GRADE VALUE
2008-09-30 (0,2] 75
2008-09-02 (2,4] 50
Output in my real data :
format(DATE, "%m")
1 09
2 10
3 11
4 12
5 09
6 10
7 11
cut(a$GRADE, breaks = c(0, 2, 4, 6, 8, 10, 12, 14)) value
1 (0,2] 0.30844444
2 (0,2] 1.00000000
3 (0,2] 1.00000000
4 (0,2] 0.73333333
5 (2,4] 0.16983488
6 (2,4] 0.09368000
7 (2,4] 0.10589335
Date
1 2008-09-30
2 2008-10-31
3 2008-11-28
4 2008-12-31
5 2008-09-30
6 2008-10-31
7 2008-11-28
The output is not according to the sample data , as the data is too big . A simple logic is that there are grades from 1 to 10 , so I want to find the highest value for a month in the corresponding grade groups . Eg : I need a highest value for each group (0,2],(0,4] etc
I used an aggregate condition with function max and two grouping it by two columns Date and Grade . Now when I run the code and display the value of max , I get 3 tables as output one after the other. Now I want to plot this output but i am not able to do that because of this .So how can i merge all these output ?
Try:
library(dplyr)
a %>%
group_by(MONTH=format(DATE, "%m"), GRADE=cut(GRADE, breaks=seq(0,14,by=2))) %>%
summarise_each(funs(max))
# MONTH GRADE DATE VALUE
#1 09 (0,2] 2008-09-30 75
#2 09 (2,4] 2008-09-03 50
#3 10 (0,2] 2008-10-01 95
#4 11 (2,4] 2008-11-01 90
#5 12 (0,2] 2008-12-29 70
#6 12 (2,4] 2008-12-31 50
Or using data.table
library(data.table)
setDT(a)[, list(DATE=max(DATE), VALUE=max(VALUE)),
by= list(MONTH=format(DATE, "%m"),
GRADE=cut(GRADE, breaks=seq(0,14, by=2)))]
# MONTH GRADE DATE VALUE
#1: 09 (0,2] 2008-09-30 75
#2: 09 (2,4] 2008-09-03 50
#3: 10 (0,2] 2008-10-01 95
#4: 11 (2,4] 2008-11-01 90
#5: 12 (0,2] 2008-12-29 70
#6: 12 (2,4] 2008-12-31 50
Or using aggregate
res <- transform(with(a,
aggregate(cbind(VALUE, DATE),
list(MONTH=format(DATE, "%m") ,GRADE=cut(GRADE, breaks=seq(0,14, by=2))), max)),
DATE=as.Date(DATE, origin="1970-01-01"))
res[order(res$MONTH),]
# MONTH GRADE VALUE DATE
#1 09 (0,2] 75 2008-09-30
#4 09 (2,4] 50 2008-09-03
#2 10 (0,2] 95 2008-10-01
#5 11 (2,4] 90 2008-11-01
#3 12 (0,2] 70 2008-12-29
#6 12 (2,4] 50 2008-12-31
data
a <- structure(list(DATE = structure(c(14123, 14124, 14125, 14152,
14153, 14184, 14214, 14215, 14241, 14242, 14244), class = "Date"),
GRADE = c(1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 4L, 1L, 3L), VALUE = c(20L,
30L, 50L, 75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L)), .Names = c("DATE",
"GRADE", "VALUE"), row.names = c(NA, -11L), class = "data.frame")
Update
If you want to include YEAR also in the grouping
library(dplyr)
a %>%
group_by(MONTH=format(DATE, "%m"), YEAR=format(DATE, "%Y"), GRADE=cut(GRADE, breaks=seq(0,14, by=2)))%>%
summarise_each(funs(max))
# MONTH YEAR GRADE DATE VALUE
#1 09 2008 (0,2] 2008-09-30 75
#2 09 2008 (2,4] 2008-09-03 50
#3 09 2009 (0,2] 2009-09-30 75
#4 09 2009 (2,4] 2009-09-03 50
#5 10 2008 (0,2] 2008-10-01 95
#6 10 2009 (0,2] 2009-10-01 95
#7 11 2008 (2,4] 2008-11-01 90
#8 11 2009 (2,4] 2009-11-01 90
#9 12 2008 (0,2] 2008-12-29 70
#10 12 2008 (2,4] 2008-12-31 50
#11 12 2009 (0,2] 2009-12-29 70
#12 12 2009 (2,4] 2009-12-31 50
data
a <- structure(list(DATE = structure(c(14123, 14124, 14125, 14152,
14153, 14184, 14214, 14215, 14241, 14242, 14244, 14488, 14489,
14490, 14517, 14518, 14549, 14579, 14580, 14606, 14607, 14609
), class = "Date"), GRADE = c(1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L,
4L, 1L, 3L, 1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 4L, 1L, 3L), VALUE = c(20L,
30L, 50L, 75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L, 20L, 30L, 50L,
75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L)), .Names = c("DATE",
"GRADE", "VALUE"), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "21", "31", "41", "51", "61",
"71", "81", "91", "101", "111"), class = "data.frame")
Following code using base R may be helpful (using 'a' dataframe from akrun's answer):
xx = strsplit(as.character(a$DATE), '-')
a$month = sapply(strsplit(as.character(a$DATE), '-'),'[',2)
gradeCats = cut(a$GRADE, breaks = c(0, 2, 4, 6, 8, 10, 12, 14))
aggregate(VALUE~month+gradeCats, data= a, max)
month gradeCats VALUE
1 09 (0,2] 75
2 10 (0,2] 95
3 12 (0,2] 70
4 09 (2,4] 50
5 11 (2,4] 90
6 12 (2,4] 50

aggregate data in columns with duplicate id in R

I have a df like this:
> dat
gen M1 M1 M1 M1 M2 M2 M2
G1 150 142 130 105 96
G2 150 145 142 130 96 89
G3 150 145 130 105 96
G4 145 142 130 105 89
G5 150 142 130 105 96
G6 145 142 130 96 89
G7 150 142 105 96
G8 150 145 130 105 89
G9 150 145 142 96 89
Here, data are present in duplicated ids. I like to aggergate like this:
>dat1
gen M1 M1 M1 M1 agg M2 M2 M2 agg
G1 150 142 130 150/142/130 105 96 105/96
G2 150 145 142 130 150/145/142/130 96 89 96/89
G3 150 145 130 150/145/130 105 96 105/96
G4 145 142 130 145/142/430 105 89 105/89
G5 150 142 130 150/142/130 105 96 105/96
G6 145 142 130 145/142/130 96 89 96/89
G7 150 142 150/142 105 96 105/96
G8 150 145 130 150/145/130 105 89 105/89
G9 150 145 142 150/145/142 96 89 96/89
here, in agg column i aggregated all the values based on duplicate first row.
I like to create new column at the end of the duplicate columns and aggregate it.
How to do it in R. I am very confused
EDIT:
dput(dat)
structure(list(V1 = structure(c(10L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L), .Label = c("G1", "G2", "G3", "G4", "G5", "G6", "G7",
"G8", "G9", "gen"), class = "factor"), V2 = structure(c(2L, 1L,
1L, 1L, NA, 1L, NA, 1L, 1L, 1L), .Label = c("150", "M1"), class = "factor"),
V3 = structure(c(2L, NA, 1L, 1L, 1L, NA, 1L, NA, 1L, 1L), .Label = c("145",
"M1"), class = "factor"), V4 = structure(c(2L, 1L, 1L, NA,
1L, 1L, 1L, 1L, NA, 1L), .Label = c("142", "M1"), class = "factor"),
V5 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, NA), .Label = c("130",
"M1"), class = "factor"), V6 = structure(c(2L, 1L, NA, 1L,
1L, 1L, NA, 1L, 1L, NA), .Label = c("105", "M2"), class = "factor"),
V7 = structure(c(2L, 1L, 1L, 1L, NA, 1L, 1L, 1L, NA, 1L), .Label = c("96",
"M2"), class = "factor"), V8 = structure(c(2L, NA, 1L, NA,
1L, NA, 1L, NA, 1L, 1L), .Label = c("89", "M2"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8"), class = "data.frame", row.names = c(NA,
-10L))
This works if the missing values are blanks:
dat$agg1 <- apply(dat[,2:5],1,function(x)paste(x[nchar(x)>0],collapse="/"))
dat$agg2 <- apply(dat[,6:8],1,function(x)paste(x[nchar(x)>0],collapse="/"))
dat <- dat[,c(1:5,9,6:8,10)]
dat
# gen M1 M1.1 M1.2 M1.3 agg1 M2 M2.1 M2.2 agg2
# 1 G1 150 142 130 150/142/130 105 96 105/96
# 2 G2 150 145 142 130 150/145/142/130 96 89 96/89
# 3 G3 150 145 130 150/145/130 105 96 105/96
# 4 G4 145 142 130 145/142/130 105 89 105/89
# ...
This works if the missing values are NA
dat$agg1 <- apply(dat[,2:5],1,function(x)paste(x[!is.na(x)],collapse="/"))
dat$agg2 <- apply(dat[,6:8],1,function(x)paste(x[!is.na(x)],collapse="/"))
to aggregate them into a character vector you use paste()
x=data.frame(x1=1:10,x2=1:10,x1=11:20)
#now notice that r created my x object with three columns x1,x2 and x1.1
xnew=cbind(x,agg=paste(x$x1,x$x2,x$x1.1,sep="/"))
I am not sure if this is what you want to do because I am a bit confused about the structure of your data.
Here is my script... I Know some of you guys can make it simple and elegant!
I transposed my df (a simple example) and read as table.
> dat<-read.table("dat.txt", header=T, sep="\t", na.strings="")
> dat
gen A B C D
1 M1 1 NA 3 NA
2 M1 NA 6 NA 3
3 M1 4 8 NA NA
4 M1 NA NA 6 3
5 M2 8 NA 6 NA
6 M2 NA 2 NA 6
7 M3 3 8 NA 2
8 M3 8 9 5 NA
9 M4 3 7 8 5
10 M4 5 NA 3 2
> final<-NULL
> for(i in 1:4){
+ mar<-as.character(dat[1,1])
+ dat1<-dat[dat[,1]%in% c(mar),]
+ dat <- dat[!dat[,1]%in% c(mar),]
+ dat2 <- apply(dat1,2,function(x)paste(x[!is.na(x)],collapse="/"))
+ dat2$gen<-mar
+ dat3<-rbind(dat1,dat2)
+ final<-rbind(final, dat3)
+ }
Warning messages:
1: In dat2$gen <- mar : Coercing LHS to a list
2: In dat2$gen <- mar : Coercing LHS to a list
3: In dat2$gen <- mar : Coercing LHS to a list
4: In dat2$gen <- mar : Coercing LHS to a list
> final
gen A B C D
1 M1 1 <NA> 3 <NA>
2 M1 <NA> 6 <NA> 3
3 M1 4 8 <NA> <NA>
4 M1 <NA> <NA> 6 3
5 M1 1/ 4 6/ 8 3/ 6 3/ 3
51 M2 8 <NA> 6 <NA>
6 M2 <NA> 2 <NA> 6
31 M2 8 2 6 6
7 M3 3 8 <NA> 2
8 M3 8 9 5 <NA>
32 M3 3/8 8/9 5 2
9 M4 3 7 8 5
10 M4 5 <NA> 3 2
33 M4 3/5 7 8/3 5/2

Resources