R transposing repeat records - r

I have a data table that repeats records. I would like to transpose the table but into the unique record names.
Below is a sample of the Data table:
V1 V2 id
ClientID 29 1
CheckID 201 1
PaymentAmount 256 1
Gross 301 1
Net 256 1
Invested 130 1
Invested 53 1
Invested 118 1
ClientID 31 2
CheckID 222 2
PaymentAmount 41 2
Gross 46 2
Net 41 2
Invested 46 2
ClientID 43 3
CheckID 310 3
PaymentAmount 41 3
Gross 46 3
Net 41 3
Invested 46 3
You can see from the table above that the record in X1 called "Investment" can occur more than once for a single ClientID. I'd like to transpose the data so that it looks as such:
ClientID CheckID PaymentAmount Gross Net Invested ID
29 201 256 301 256 130 1
29 201 256 301 256 53 1
29 201 256 301 256 118 1
31 222 41 46 41 46 2
43 310 41 46 41 46 3
43 310 41 46 41 48 3
any support is greatly appreciated!

We can create a sequence column grouped by the "V1", "id" column using data.table, then convert from 'long' to 'wide' format with dcast and replace the NA with the non-NA preceding values using na.locf from zoo.
library(data.table)
library(zoo)
setDT(df1)[, N:= 1:.N , by = .(V1, id)]
dcast(df1, id+N~V1, value.var="V2")[, lapply(.SD, na.locf),
by = id, .SDcols = CheckID:PaymentAmount]
# id CheckID ClientID Gross Invested Net PaymentAmount
#1: 1 201 29 301 130 256 256
#2: 1 201 29 301 53 256 256
#3: 1 201 29 301 118 256 256
#4: 2 222 31 46 46 41 41
#5: 3 310 43 46 46 41 41
data
df1 <- structure(list(V1 = c("ClientID", "CheckID", "PaymentAmount",
"Gross", "Net", "Invested", "Invested", "Invested", "ClientID",
"CheckID", "PaymentAmount", "Gross", "Net", "Invested", "ClientID",
"CheckID", "PaymentAmount", "Gross", "Net", "Invested"), V2 = c(29L,
201L, 256L, 301L, 256L, 130L, 53L, 118L, 31L, 222L, 41L, 46L,
41L, 46L, 43L, 310L, 41L, 46L, 41L, 46L), id = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L)), .Names = c("V1", "V2", "id"), class = "data.frame",
row.names = c(NA, -20L))

Related

data.table efficiently finding common pairs between 2 columns

say I have a dataframe
subject stim1 stim2 feedback
1 1003 50 51 1
2 1003 48 50 1
3 1003 49 51 1
4 1003 47 49 1
5 1003 47 46 1
6 1003 46 48 1
10 1003 50 48 1
428 1003 48 51 0
433 1003 46 50 0
434 1003 50 49 0
435 1003 54 59 0
I want to create a new column "transitive_pair" by
group by subject (column 1),
For every row in which feedback==0 (starting index 428, otherwise transitive_pair=NaN).
I want to return a boolean which tells me whether there is any chain of pairings (but only those in which feedback==1) that would transitively link stim1 and stim2 values.
Working out a few examples.
row 428- stim1=48 and stim2=51
48 and 51 are not paired but 51 was paired with 50 (e.g.row 1 ) and 50 was paired with 48 (row 10) so transitive_pair[428]=True
row 433- stim 1=46 and stim2=50
46 and 48 were paired (row 6) and 48 was paired with 50 (row 2) so transitive_pair[433]=True
in row 435, stim1=54, stim2=59
there is no chain of pairs that could link them (59 is not paired with anything while feedback==1) so transitive_pair[435]=False
desired output
subject stim1 stim2 feedback transitive_pair
1 1003 50 51 1 NaN
2 1003 48 50 1 NaN
3 1003 49 51 1 NaN
4 1003 47 49 1 NaN
5 1003 47 46 1 NaN
6 1003 46 48 1 NaN
10 1003 50 48 1 NaN
428 1003 48 51 0 1
433 1003 46 50 0 1
434 1003 50 49 0 1
435 1003 54 59 0 0
any help would be greatly appreciated!!
and putting a recreateble df here
structure(list(subject = c(1003L, 1003L, 1003L, 1003L, 1003L,
1003L, 1003L, 1003L, 1003L, 1003L, 1003L), stim1 = c(50L, 48L,
49L, 47L, 47L, 46L, 50L, 48L, 46L, 50L, 54L), stim2 = c(51L,
50L, 51L, 49L, 46L, 48L, 48L, 51L, 50L, 49L, 59L), feedback = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), transitive_pair = c(NaN,
NaN, NaN, NaN, NaN, NaN, NaN, 1, 1, 1, 0)), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 10L, 428L, 433L, 434L, 435L), class = "data.frame")
The columns "stim1" and "stim2" define an undirected graph. Create the graph for feedback == 1, get its connected components and for each row of the data.frame, check if the values of "stim1" and "stim2" belong to the same component. In the end assign NaN to the rows where feedback is 1.
suppressPackageStartupMessages(library(igraph))
inx <- df1$feedback == 1
g <- graph_from_data_frame(df1[inx, c("stim1", "stim2")], directed = FALSE)
plot(g)
g_comp <- components(g)$membership
df1$transitive_pair_2 <- apply(df1[c("stim1", "stim2")], 1, \(x) {
i <- names(g_comp) == x[1]
j <- names(g_comp) == x[2]
if(any(i) & any(j))
g_comp[i] == g_comp[j]
else 0L
})
df1$transitive_pair_2[inx] <- NaN
df1
#> subject stim1 stim2 feedback transitive_pair transitive_pair_2
#> 1 1003 50 51 1 NaN NaN
#> 2 1003 48 50 1 NaN NaN
#> 3 1003 49 51 1 NaN NaN
#> 4 1003 47 49 1 NaN NaN
#> 5 1003 47 46 1 NaN NaN
#> 6 1003 46 48 1 NaN NaN
#> 10 1003 50 48 1 NaN NaN
#> 428 1003 48 51 0 1 1
#> 433 1003 46 50 0 1 1
#> 434 1003 50 49 0 1 1
#> 435 1003 54 59 0 0 0
Created on 2022-07-31 by the reprex package (v2.0.1)

Split data frame by class regarding to OID

I try to split dataframe by 50% by class. However, I do not want to split fields with the same OID (object identifier). I would like the fields with the same OID to be in the same set.
#Data frame:
"b1""b2""b3""CLASS" "OID"
110 134 119 "tree" 1
112 133 118 "tree" 1
105 125 110 "tree" 2
112 132 117 "tree" 2
109 125 115 "meadow" 6
93 110 101 "meadow" 6
86 106 95 "meadow" 7
105 136 116 "meadow" 7
102 128 111 "meadow" 8
108 129 115 "meadow" 8
113 134 119 "meadow" 8
Expected data:
#Expected:
"b1""b2""b3""CLASS" "OID"
110 134 119 "tree" 1
112 133 118 "tree" 1
109 125 115 "meadow" 6
93 110 101 "meadow" 6
86 106 95 "meadow" 7
105 136 116 "meadow" 7
This selects the top half of rows in each group, plus any rows which have the same OID as the rows in that top half.
library(dplyr)
df %>%
group_by(CLASS) %>%
filter(OID %in% head(OID, n() %/% 2)) %>%
ungroup
# # A tibble: 6 x 5
# b1 b2 b3 CLASS OID
# <int> <int> <int> <chr> <int>
# 1 110 134 119 tree 1
# 2 112 133 118 tree 1
# 3 109 125 115 meadow 6
# 4 93 110 101 meadow 6
# 5 86 106 95 meadow 7
# 6 105 136 116 meadow 7
If your real data is arranged by OID like this example, you could also use top_frac
df %>%
group_by(CLASS) %>%
top_frac(.5, -OID)
# # A tibble: 6 x 5
# b1 b2 b3 CLASS OID
# <int> <int> <int> <chr> <int>
# 1 110 134 119 tree 1
# 2 112 133 118 tree 1
# 3 109 125 115 meadow 6
# 4 93 110 101 meadow 6
# 5 86 106 95 meadow 7
# 6 105 136 116 meadow 7
Your data:
df = structure(list(b1 = c(110L, 112L, 105L, 112L, 109L, 93L, 86L,
105L, 102L, 108L, 113L), b2 = c(134L, 133L, 125L, 132L, 125L,
110L, 106L, 136L, 128L, 129L, 134L), b3 = c(119L, 118L, 110L,
117L, 115L, 101L, 95L, 116L, 111L, 115L, 119L), CLASS = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("meadow",
"tree"), class = "factor"), OID = c(1L, 1L, 2L, 2L, 6L, 6L, 7L,
7L, 8L, 8L, 8L)), class = "data.frame", row.names = c(NA, -11L
))
First create a function to take 1/2 according to OID
func = function(x){
x[x$OID %in% x$OID[1:round(nrow(x)/2)],]
}
We randomize the way the OID are sorted
df$OID = factor(df$OID,levels=sample(unique(df$OID)))
df = df[order(df$OID),]
do.call(rbind,by(df,df$CLASS,func))
This will ensure you get random ~ 50% everytime, with complete OID

R delete first and last x % of rows

I have a data frame with 3 ID variables, then several values for each ID.
user Log Pass Value
2 2 123 342
2 2 123 543
2 2 123 231
2 2 124 257
2 2 124 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
The start and end of each set of values is sometimes noisy, and I want to be able to delete the first few values. Unfortunately the number of values varies significantly, but it is always the first and last 20% of values that are noisy.
I want to delete the first 20% of rows, with a minimum of 1 row deleted.
So for instance if there are 20 values for user 2 log 2 pass 123 I want to delete the first and last 4 rows. If there are only 3 values for the ID variable I want to delete the first and last row.
The resulting dataset would be:
user Log Pass Value
2 2 123 543
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
I've tried fiddling around with nrow but I struggle to figure out how to reference the % of rows by id variable.
Thanks.
Jonathan.
I believe the following can do it.
DATA.
dat <-
structure(list(user = c(2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), Log = c(2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), Pass = c(123L, 123L, 123L, 124L, 124L, 125L, 125L,
125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L,
125L, 125L, 125L), Value = c(342L, 543L, 231L, 257L, 342L, 543L,
231L, 257L, 342L, 543L, 231L, 257L, 543L, 231L, 257L, 543L, 231L,
257L, 543L, 231L, 257L)), .Names = c("user", "Log", "Pass", "Value"
), class = "data.frame", row.names = c(NA, -21L))
CODE.
fun <- function(x, p = 0.20){
n <- nrow(x)
m <- max(1, round(n*p))
inx <- c(seq_len(m), n - seq_len(m) + 1)
x[-inx, ]
}
result <- do.call(rbind, lapply(split(dat, dat$user), fun))
row.names(result) <- NULL
result
# user Log Pass Value
#1 2 2 123 543
#2 2 2 123 231
#3 2 2 124 257
#4 4 3 125 342
#5 4 3 125 543
#6 4 3 125 231
#7 4 3 125 257
#8 4 3 125 543
#9 4 3 125 231
#10 4 3 125 257
#11 4 3 125 543
#12 4 3 125 231
#13 4 3 125 257
Would something like this help?
For a dataframe df:
df[-c(1:floor(nrow(df)*0.2), (1+ceiling(nrow(df)*0.8)):nrow(df)),]
Just removing the first and last 20%, taking the upper and lower values so that for smaller data frame you keep some of the information:
> df<-data.frame(a=1:100)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[31] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
> df<-data.frame(1:3)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 2
You can do this with dplyr...
library(dplyr)
df2 <- df %>% group_by(user, Log, Pass) %>%
filter(n()>2) %>% #remove those with just two elements or fewer
slice(max(2, 1+ceiling(n()*0.2)):min(n()-1, floor(0.8*n())))
df2
user Log Pass Value
1 2 2 123 543
2 4 3 125 543
3 4 3 125 231
4 4 3 125 257
5 4 3 125 543
6 4 3 125 231
7 4 3 125 257
8 4 3 125 543
9 4 3 125 231
Calculate the offset for what you want to retain:
rem <- ceiling( nrow( x ) * .2 ) + 1
Then take out the records you don-t want:
dat <- dat[ rem : ( nrow( dat ) - rem ), ]
Here is an idea using base R that returns the row indices of each user to keep and then subsets on these indices.
idx <- unlist(lapply(split(seq_along(dat[["user"]]), dat[["user"]]), function(x) {
tmp <- max(1, ceiling(.2 * length(x)))
tail(head(x, -tmp), -tmp)}),
use.names=FALSE)
split(seq_along(dat[["user"]]), dat[["user"]]) returns a list of the rows for each user. lapply loops through these rows, calculating the number of rows to drop from each end with split(seq_along(dat[["user"]]), dat[["user"]]), and then dropping them with tail(head(x, -tmp), -tmp)}). Since lapply returns a named list, this is unlisted and the names are dropped.
This returns
idx
2 3 4 10 11 12 13 14 15 16 17
Now subset
dat[idx,]
user Log Pass Value
2 2 2 123 543
3 2 2 123 231
4 2 2 124 257
10 4 3 125 543
11 4 3 125 231
12 4 3 125 257
13 4 3 125 543
14 4 3 125 231
15 4 3 125 257
16 4 3 125 543
17 4 3 125 231

R programming - data frame manoevur

Suppose I have the following dataframe:
dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
4: 50 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
5: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
6: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
lc wc li yd yr nF factdcx
1: 1 3 TRUE 1 2010 2 24
2: 1 3 TRUE 1 2010 8 41
3: 2 3 TRUE 1 2010 0 48
4: 2 3 TRUE 1 2010 0 50
5: 2 3 TRUE 1 2010 0 52
6: 3 3 FALSE 1 2010 0 57
I'd like to turn it into a new dataframe like the following:
dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
4: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
5: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
lc wc li yd yr nF factdcx
1: 1 3 TRUE 1 2010 2 24
2: 1 3 TRUE 1 2010 8 41
3: 2 3 TRUE 1 2010 0 (sum of nF for 48 and 50, factdcx) 48
4: 2 3 TRUE 1 2010 0 52
5: 3 3 FALSE 1 2010 0 57
How can I do it? (Surely, the dataframe, abc, is much larger, but I want the sum of all categories of 48 and 50 and group it into a new category, say '48').
Many thanks!
> dput(head(abc1))
structure(list(dc = c(24L, 41L, 48L, 50L, 52L, 57L), tmin = c(-1L,
-3L, 0L, 0L, 3L, -2L), tmax = c(4L, 5L, 5L, 5L, 5L, 5L), cint = c(5L,
8L, 5L, 5L, 2L, 7L), wcmin = c(-5L, -8L, -4L, -4L, -3L, -6L),
wcmax = c(-2L, -3L, 0L, 0L, 1L, -1L), wsmin = c(20L, 15L,
30L, 30L, 20L, 25L), wsmax = c(25L, 20L, 35L, 35L, 25L, 30L
), gsmin = c(35L, 35L, 45L, 45L, 35L, 35L), gsmax = c(40L,
40L, 50L, 50L, 40L, 40L), wd = c(90L, 90L, 45L, 45L, 45L,
315L), rmin = c(11.8, 10, 7.3, 7.3, 6.7, 4.4), rmax = c(26.6,
23.5, 19, 19, 17.4, 13.8), cir = c(14.8, 13.5, 11.7, 11.7,
10.7, 9.4), lr = c(3L, 3L, 6L, 6L, 6L, 7L), lc = c(1L, 1L,
2L, 2L, 2L, 3L), wc = c(3L, 3L, 3L, 3L, 3L, 3L), li = c(TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE), yd = c(1L, 1L, 1L, 1L, 1L,
1L), yr = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L), nF = c(2L,
8L, 0L, 0L, 0L, 0L), factdcx = structure(1:6, .Label = c("24",
"41", "48", "50", "52", "57", "70"), class = "factor")), .Names = c("dc",
"tmin", "tmax", "cint", "wcmin", "wcmax", "wsmin", "wsmax", "gsmin",
"gsmax", "wd", "rmin", "rmax", "cir", "lr", "lc", "wc", "li",
"yd", "yr", "nF", "factdcx"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x054b24a0>)
Still got a problem, sir/madam:
> head(abc1 (updated))
dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
4: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
5: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
6: 70 -2 3 5 -4 -1 20 25 30 35 360 3.6 10.2 6.6 7
lc wc li yd yr nF factdcx
1: 1 3 TRUE 1 2010 2 24
2: 1 3 TRUE 1 2010 8 41
3: 2 3 TRUE 1 2010 57 48
4: 2 3 TRUE 1 2010 0 52
5: 3 3 FALSE 1 2010 0 57
6: 3 2 TRUE 1 2010 1 70
The sum of nF was incorrect, it should be zero.
Try
library(data.table)
unique(setDT(df1)[, factdcx:= as.character(factdcx)][factdcx %chin%
c('48','50'), c('dc', 'factdcx', 'nF') := list('48', '48', sum(nF))])
# dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
#1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
#2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
#3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
#4: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
#5: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
# lc wc li yd yr nF factdcx
#1: 1 3 TRUE 1 2010 2 24
#2: 1 3 TRUE 1 2010 8 41
#3: 2 3 TRUE 1 2010 0 48
#4: 2 3 TRUE 1 2010 0 52
#5: 3 3 FALSE 1 2010 0 57
For abc1,
res1 <- unique(setDT(abc1)[, factdcx:= as.character(factdcx)][factdcx %chin%
c('48','50'), c('dc', 'factdcx', 'nF') := list(48, '48', sum(nF))])
res1
# dc tmin tmax cint wcmin wcmax wsmin wsmax gsmin gsmax wd rmin rmax cir lr
#1: 24 -1 4 5 -5 -2 20 25 35 40 90 11.8 26.6 14.8 3
#2: 41 -3 5 8 -8 -3 15 20 35 40 90 10.0 23.5 13.5 3
#3: 48 0 5 5 -4 0 30 35 45 50 45 7.3 19.0 11.7 6
#4: 52 3 5 2 -3 1 20 25 35 40 45 6.7 17.4 10.7 6
#5: 57 -2 5 7 -6 -1 25 30 35 40 315 4.4 13.8 9.4 7
# lc wc li yd yr nF factdcx
#1: 1 3 TRUE 1 2010 2 24
#2: 1 3 TRUE 1 2010 8 41
#3: 2 3 TRUE 1 2010 0 48
#4: 2 3 TRUE 1 2010 0 52
#5: 3 3 FALSE 1 2010 0 57
data
df1 <- structure(list(dc = structure(1:6, .Label = c("24", "41",
"48",
"50", "52", "57"), class = "factor"), tmin = c(-1L, -3L, 0L,
0L, 3L, -2L), tmax = c(4L, 5L, 5L, 5L, 5L, 5L), cint = c(5L,
8L, 5L, 5L, 2L, 7L), wcmin = c(-5L, -8L, -4L, -4L, -3L, -6L),
wcmax = c(-2L, -3L, 0L, 0L, 1L, -1L), wsmin = c(20L, 15L,
30L, 30L, 20L, 25L), wsmax = c(25L, 20L, 35L, 35L, 25L, 30L
), gsmin = c(35L, 35L, 45L, 45L, 35L, 35L), gsmax = c(40L,
40L, 50L, 50L, 40L, 40L), wd = c(90L, 90L, 45L, 45L, 45L,
315L), rmin = c(11.8, 10, 7.3, 7.3, 6.7, 4.4), rmax = c(26.6,
23.5, 19, 19, 17.4, 13.8), cir = c(14.8, 13.5, 11.7, 11.7,
10.7, 9.4), lr = c(3L, 3L, 6L, 6L, 6L, 7L), lc = c(1L, 1L,
2L, 2L, 2L, 3L), wc = c(3L, 3L, 3L, 3L, 3L, 3L), li = c(TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE), yd = c(1L, 1L, 1L, 1L, 1L,
1L), yr = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L), nF = c(2L,
8L, 0L, 0L, 0L, 0L), factdcx = structure(1:6, .Label = c("24",
"41", "48", "50", "52", "57"), class = "factor")), .Names = c("dc",
"tmin", "tmax", "cint", "wcmin", "wcmax", "wsmin", "wsmax", "gsmin",
"gsmax", "wd", "rmin", "rmax", "cir", "lr", "lc", "wc", "li",
"yd", "yr", "nF", "factdcx"), row.names = c("1:", "2:", "3:",
"4:", "5:", "6:"), class = "data.frame")

How to make a cross table with NA instead of X?

I have the following dataset (see for loading dataset below)
ID Date qty
1 ID25 2007-12-01 45
2 ID25 2008-01-01 26
3 ID25 2008-02-01 46
4 ID25 2008-03-01 0
5 ID25 2008-04-01 78
6 ID25 2008-05-01 65
7 ID25 2008-06-01 32
8 ID99 2008-02-01 99
9 ID99 2008-03-01 0
10 ID99 2008-04-01 99
And I would like to create a pivot table of that. I do that with the following command and that seems to be working fine:
pivottable <- xtabs(qty ~ ID + Date, table)
The output is the following:
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 0 0 99 0 99 0 0
However, for ID99 there are only values for 3 periods the rest is marked as '0'. I would like to display NA in the fields that have no values in the first table. I would like to get a table that looks as following:
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 NA NA 99 0 99 NA NA
Any suggestion on how to accomplish this?
Loading dataset:
table <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("ID25", "ID99"), class = "factor"), Date = structure(c(7L,
1L, 2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L), .Label = c("01/01/2008",
"01/02/2008", "01/03/2008", "01/04/2008", "01/05/2008", "01/06/2008",
"01/12/2007"), class = "factor"), qty = c(45L, 26L, 46L, 0L,
78L, 65L, 32L, 99L, 0L, 99L)), .Names = c("ID", "Date", "qty"
), class = "data.frame", row.names = c(NA, -10L))
table$Date <- as.POSIXct(table$Date, format='%d/%m/%Y')
You could use xtabs twice to obtain the output you are looking for:
Create the table:
pivottable <- xtabs(qty ~ ID + Date, table)
Replace all zeros of non-existing combinations with NA:
pivottable[!xtabs( ~ ID + Date, table)] <- NA
The output:
Date
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 99 0 99
Note that NAs are not displayed. This is due to the print function for this class. But you could use unclass(pivottable) to achieve regular behavior of print.

Resources