R - How to calculate value differences between dates with heterogeneous number of rows - r

My data look like the following example.
# A tibble: 18 x 4
DATE AUTHOR PRODUCT SALES
<dttm> <chr> <chr> <dbl>
1 2019-11-27 James B 80
2 2019-11-28 James B 100
3 2019-11-27 James A 80
4 2019-11-28 James A 100
5 2019-11-26 Frank B 70
6 2019-11-27 Frank B 75
7 2019-11-28 Frank B 65
8 2019-11-26 Frank A 70
9 2019-11-27 Frank A 75
10 2019-11-28 Frank A 65
11 2019-11-25 Mary A 100
12 2019-11-26 Mary A 80
13 2019-11-27 Mary A 95
14 2019-11-28 Mary A 110
15 2019-11-25 Mary B 100
16 2019-11-26 Mary B 80
17 2019-11-27 Mary B 95
18 2019-11-28 Mary B 110
I would like to add a "DIFF" column where the difference over day for SALES is calculated grouping by AUTHOR. My issues here are the following:
I have a different number of rows for every AUTHOR.
The same DATE could be repeated for some AUTHORS to report different information (in this example is PRODUCT), but the value for SALES will always remain the same, since it only depends on the DATE and the AUTHOR.
I have to keep every row in the dataset because every row contains specific information, so I can not just drop the rows where DATE is a duplicated.
Ideally I would implement the whole with a loop function in my script.
My desired outcome would be:
# A tibble: 18 x 4
DATE AUTHOR PRODUCT SALES DIFF
<dttm> <chr> <chr> <dbl>
1 2019-11-27 James B 80
2 2019-11-28 James B 100 20
3 2019-11-27 James A 80
4 2019-11-28 James A 100 20
5 2019-11-26 Frank B 70
6 2019-11-27 Frank B 75 5
7 2019-11-28 Frank B 65 -10
8 2019-11-26 Frank A 70
9 2019-11-27 Frank A 75 5
10 2019-11-28 Frank A 65 -10
11 2019-11-25 Mary A 100
12 2019-11-26 Mary A 80 -20
13 2019-11-27 Mary A 95 15
14 2019-11-28 Mary A 110 15
15 2019-11-25 Mary B 100
16 2019-11-26 Mary B 80 -20
17 2019-11-27 Mary B 95 15
18 2019-11-28 Mary B 110 15
I tried different things with dplyr and mutate but nothing seemed to work. Anyone has suggestions?
Thank you!

You could use lag to subtract previous value by group
library(dplyr)
df %>% group_by(AUTHOR, PRODUCT) %>% mutate(diff = SALES - lag(SALES))
# DATE AUTHOR PRODUCT SALES diff
# <fct> <fct> <fct> <int> <int>
# 1 2019-11-27 James B 80 NA
# 2 2019-11-28 James B 100 20
# 3 2019-11-27 James A 80 NA
# 4 2019-11-28 James A 100 20
# 5 2019-11-26 Frank B 70 NA
# 6 2019-11-27 Frank B 75 5
# 7 2019-11-28 Frank B 65 -10
# 8 2019-11-26 Frank A 70 NA
# 9 2019-11-27 Frank A 75 5
#10 2019-11-28 Frank A 65 -10
#11 2019-11-25 Mary A 100 NA
#12 2019-11-26 Mary A 80 -20
#13 2019-11-27 Mary A 95 15
#14 2019-11-28 Mary A 110 15
#15 2019-11-25 Mary B 100 NA
#16 2019-11-26 Mary B 80 -20
#17 2019-11-27 Mary B 95 15
#18 2019-11-28 Mary B 110 15
Or using diff
df %>% group_by(AUTHOR, PRODUCT) %>% mutate(diff = c(NA, diff(SALES)))
data
df <- structure(list(DATE = structure(c(3L, 4L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("2019-11-25",
"2019-11-26", "2019-11-27", "2019-11-28"), class = "factor"),
AUTHOR = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Frank",
"James", "Mary"), class = "factor"), PRODUCT = structure(c(2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("A", "B"), class = "factor"), SALES = c(80L,
100L, 80L, 100L, 70L, 75L, 65L, 70L, 75L, 65L, 100L, 80L,
95L, 110L, 100L, 80L, 95L, 110L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18"))

We can use shift from data.table
library(data.table)
setDT(df)[, diff := SALES - shift(SALES), .(AUTHOR, PRODUCT)][]

Related

Fill gaps using a group mean in R

I have a data set which has gaps in one of the columns (temp). I am trying to fill the gaps using the "temp" data from a "sensor" or mean of "sensors" within the same "treatment", and of course same date stamp. I am trying to do this using tidyverse/lubridate.
date treatment sensor temp
1/01/2019 1 A 30
2/01/2019 1 A 29.1
3/01/2019 1 A 21.2
4/01/2019 1 A NA
1/01/2019 1 B 20.5
2/01/2019 1 B 19.8
3/01/2019 1 B 35.1
4/01/2019 1 B 23.5
1/01/2019 2 C 31.2
2/01/2019 2 C 32.1
3/01/2019 2 C 28.1
4/01/2019 2 C 31.2
1/01/2019 2 D NA
2/01/2019 2 D 26.5
3/01/2019 2 D 27.9
4/01/2019 2 D 28
This is what I am expecting:
date treatment sensor temp
1/01/2019 1 A 30
2/01/2019 1 A 29.1
3/01/2019 1 A 21.2
4/01/2019 1 A 23.5
1/01/2019 1 B 20.5
2/01/2019 1 B 19.8
3/01/2019 1 B 35.1
4/01/2019 1 B 23.5
1/01/2019 2 C 31.2
2/01/2019 2 C 32.1
3/01/2019 2 C 28.1
4/01/2019 2 C 31.2
1/01/2019 2 D 31.2
2/01/2019 2 D 26.5
3/01/2019 2 D 27.9
4/01/2019 2 D 28
Many thanks for your help.
Another option with na.aggregate from zoo
library(dplyr)
library(zoo)
df %>%
group_by(date, treatment) %>%
mutate(temp = na.aggregate(temp))
# A tibble: 16 x 4
# Groups: date, treatment [8]
# date treatment sensor temp
# <fct> <int> <fct> <dbl>
# 1 1/01/2019 1 A 30
# 2 2/01/2019 1 A 29.1
# 3 3/01/2019 1 A 21.2
# 4 4/01/2019 1 A 23.5
# 5 1/01/2019 1 B 20.5
# 6 2/01/2019 1 B 19.8
# 7 3/01/2019 1 B 35.1
# 8 4/01/2019 1 B 23.5
# 9 1/01/2019 2 C 31.2
#10 2/01/2019 2 C 32.1
#11 3/01/2019 2 C 28.1
#12 4/01/2019 2 C 31.2
#13 1/01/2019 2 D 31.2
#14 2/01/2019 2 D 26.5
#15 3/01/2019 2 D 27.9
#16 4/01/2019 2 D 28
data
df <- structure(list(date = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1/01/2019",
"2/01/2019", "3/01/2019", "4/01/2019"), class = "factor"), treatment = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
sensor = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"
), class = "factor"), temp = c(30, 29.1, 21.2, NA, 20.5,
19.8, 35.1, 23.5, 31.2, 32.1, 28.1, 31.2, NA, 26.5, 27.9,
28)), class = "data.frame", row.names = c(NA, -16L))
How about this:
df <- df %>%
group_by(date, treatment) %>%
mutate(
fill = mean(temp, na.rm=TRUE), # value to fill in blanks
temp2 = case_when(!is.na(temp) ~ temp,
TRUE ~ fill)
)
Here is one option using map2_dbl from purrr. We group_by treatment and replace NA temp with the first non-NA temp with the same date in the group.
library(dplyr)
library(purrr)
df %>%
group_by(treatment) %>%
mutate(temp = map2_dbl(temp, date, ~if (is.na(.x))
temp[which.max(date == .y & !is.na(temp))] else .x))
# date treatment sensor temp
# <fct> <int> <fct> <dbl>
# 1 1/01/2019 1 A 30
# 2 2/01/2019 1 A 29.1
# 3 3/01/2019 1 A 21.2
# 4 4/01/2019 1 A 23.5
# 5 1/01/2019 1 B 20.5
# 6 2/01/2019 1 B 19.8
# 7 3/01/2019 1 B 35.1
# 8 4/01/2019 1 B 23.5
# 9 1/01/2019 2 C 31.2
#10 2/01/2019 2 C 32.1
#11 3/01/2019 2 C 28.1
#12 4/01/2019 2 C 31.2
#13 1/01/2019 2 D 31.2
#14 2/01/2019 2 D 26.5
#15 3/01/2019 2 D 27.9
#16 4/01/2019 2 D 28
data
df <- structure(list(date = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1/01/2019",
"2/01/2019", "3/01/2019", "4/01/2019"), class = "factor"), treatment =
c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
sensor = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"
), class = "factor"), temp = c(30, 29.1, 21.2, NA, 20.5,
19.8, 35.1, 23.5, 31.2, 32.1, 28.1, 31.2, NA, 26.5, 27.9,
28)), class = "data.frame", row.names = c(NA, -16L))

Merging output in R

max=aggregate(cbind(a$VALUE,Date=a$DATE) ~ format(a$DATE, "%m") + cut(a$CLASS, breaks=c(0,2,4,6,8,10,12,14)) , data = a, max)[-1]
max$DATE=as.Date(max$DATE, origin = "1970-01-01")
Sample Data :
DATE GRADE VALUE
2008-09-01 1 20
2008-09-02 2 30
2008-09-03 3 50
.
.
2008-09-30 2 75
.
.
2008-10-01 1 95
.
.
2008-11-01 4 90
.
.
2008-12-01 1 70
2008-12-02 2 40
2008-12-28 4 30
2008-12-29 1 40
2008-12-31 3 50
My Expected output according to above table for only first month is :
DATE GRADE VALUE
2008-09-30 (0,2] 75
2008-09-02 (2,4] 50
Output in my real data :
format(DATE, "%m")
1 09
2 10
3 11
4 12
5 09
6 10
7 11
cut(a$GRADE, breaks = c(0, 2, 4, 6, 8, 10, 12, 14)) value
1 (0,2] 0.30844444
2 (0,2] 1.00000000
3 (0,2] 1.00000000
4 (0,2] 0.73333333
5 (2,4] 0.16983488
6 (2,4] 0.09368000
7 (2,4] 0.10589335
Date
1 2008-09-30
2 2008-10-31
3 2008-11-28
4 2008-12-31
5 2008-09-30
6 2008-10-31
7 2008-11-28
The output is not according to the sample data , as the data is too big . A simple logic is that there are grades from 1 to 10 , so I want to find the highest value for a month in the corresponding grade groups . Eg : I need a highest value for each group (0,2],(0,4] etc
I used an aggregate condition with function max and two grouping it by two columns Date and Grade . Now when I run the code and display the value of max , I get 3 tables as output one after the other. Now I want to plot this output but i am not able to do that because of this .So how can i merge all these output ?
Try:
library(dplyr)
a %>%
group_by(MONTH=format(DATE, "%m"), GRADE=cut(GRADE, breaks=seq(0,14,by=2))) %>%
summarise_each(funs(max))
# MONTH GRADE DATE VALUE
#1 09 (0,2] 2008-09-30 75
#2 09 (2,4] 2008-09-03 50
#3 10 (0,2] 2008-10-01 95
#4 11 (2,4] 2008-11-01 90
#5 12 (0,2] 2008-12-29 70
#6 12 (2,4] 2008-12-31 50
Or using data.table
library(data.table)
setDT(a)[, list(DATE=max(DATE), VALUE=max(VALUE)),
by= list(MONTH=format(DATE, "%m"),
GRADE=cut(GRADE, breaks=seq(0,14, by=2)))]
# MONTH GRADE DATE VALUE
#1: 09 (0,2] 2008-09-30 75
#2: 09 (2,4] 2008-09-03 50
#3: 10 (0,2] 2008-10-01 95
#4: 11 (2,4] 2008-11-01 90
#5: 12 (0,2] 2008-12-29 70
#6: 12 (2,4] 2008-12-31 50
Or using aggregate
res <- transform(with(a,
aggregate(cbind(VALUE, DATE),
list(MONTH=format(DATE, "%m") ,GRADE=cut(GRADE, breaks=seq(0,14, by=2))), max)),
DATE=as.Date(DATE, origin="1970-01-01"))
res[order(res$MONTH),]
# MONTH GRADE VALUE DATE
#1 09 (0,2] 75 2008-09-30
#4 09 (2,4] 50 2008-09-03
#2 10 (0,2] 95 2008-10-01
#5 11 (2,4] 90 2008-11-01
#3 12 (0,2] 70 2008-12-29
#6 12 (2,4] 50 2008-12-31
data
a <- structure(list(DATE = structure(c(14123, 14124, 14125, 14152,
14153, 14184, 14214, 14215, 14241, 14242, 14244), class = "Date"),
GRADE = c(1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 4L, 1L, 3L), VALUE = c(20L,
30L, 50L, 75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L)), .Names = c("DATE",
"GRADE", "VALUE"), row.names = c(NA, -11L), class = "data.frame")
Update
If you want to include YEAR also in the grouping
library(dplyr)
a %>%
group_by(MONTH=format(DATE, "%m"), YEAR=format(DATE, "%Y"), GRADE=cut(GRADE, breaks=seq(0,14, by=2)))%>%
summarise_each(funs(max))
# MONTH YEAR GRADE DATE VALUE
#1 09 2008 (0,2] 2008-09-30 75
#2 09 2008 (2,4] 2008-09-03 50
#3 09 2009 (0,2] 2009-09-30 75
#4 09 2009 (2,4] 2009-09-03 50
#5 10 2008 (0,2] 2008-10-01 95
#6 10 2009 (0,2] 2009-10-01 95
#7 11 2008 (2,4] 2008-11-01 90
#8 11 2009 (2,4] 2009-11-01 90
#9 12 2008 (0,2] 2008-12-29 70
#10 12 2008 (2,4] 2008-12-31 50
#11 12 2009 (0,2] 2009-12-29 70
#12 12 2009 (2,4] 2009-12-31 50
data
a <- structure(list(DATE = structure(c(14123, 14124, 14125, 14152,
14153, 14184, 14214, 14215, 14241, 14242, 14244, 14488, 14489,
14490, 14517, 14518, 14549, 14579, 14580, 14606, 14607, 14609
), class = "Date"), GRADE = c(1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L,
4L, 1L, 3L, 1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 4L, 1L, 3L), VALUE = c(20L,
30L, 50L, 75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L, 20L, 30L, 50L,
75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L)), .Names = c("DATE",
"GRADE", "VALUE"), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "21", "31", "41", "51", "61",
"71", "81", "91", "101", "111"), class = "data.frame")
Following code using base R may be helpful (using 'a' dataframe from akrun's answer):
xx = strsplit(as.character(a$DATE), '-')
a$month = sapply(strsplit(as.character(a$DATE), '-'),'[',2)
gradeCats = cut(a$GRADE, breaks = c(0, 2, 4, 6, 8, 10, 12, 14))
aggregate(VALUE~month+gradeCats, data= a, max)
month gradeCats VALUE
1 09 (0,2] 75
2 10 (0,2] 95
3 12 (0,2] 70
4 09 (2,4] 50
5 11 (2,4] 90
6 12 (2,4] 50

aggregate data in columns with duplicate id in R

I have a df like this:
> dat
gen M1 M1 M1 M1 M2 M2 M2
G1 150 142 130 105 96
G2 150 145 142 130 96 89
G3 150 145 130 105 96
G4 145 142 130 105 89
G5 150 142 130 105 96
G6 145 142 130 96 89
G7 150 142 105 96
G8 150 145 130 105 89
G9 150 145 142 96 89
Here, data are present in duplicated ids. I like to aggergate like this:
>dat1
gen M1 M1 M1 M1 agg M2 M2 M2 agg
G1 150 142 130 150/142/130 105 96 105/96
G2 150 145 142 130 150/145/142/130 96 89 96/89
G3 150 145 130 150/145/130 105 96 105/96
G4 145 142 130 145/142/430 105 89 105/89
G5 150 142 130 150/142/130 105 96 105/96
G6 145 142 130 145/142/130 96 89 96/89
G7 150 142 150/142 105 96 105/96
G8 150 145 130 150/145/130 105 89 105/89
G9 150 145 142 150/145/142 96 89 96/89
here, in agg column i aggregated all the values based on duplicate first row.
I like to create new column at the end of the duplicate columns and aggregate it.
How to do it in R. I am very confused
EDIT:
dput(dat)
structure(list(V1 = structure(c(10L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L), .Label = c("G1", "G2", "G3", "G4", "G5", "G6", "G7",
"G8", "G9", "gen"), class = "factor"), V2 = structure(c(2L, 1L,
1L, 1L, NA, 1L, NA, 1L, 1L, 1L), .Label = c("150", "M1"), class = "factor"),
V3 = structure(c(2L, NA, 1L, 1L, 1L, NA, 1L, NA, 1L, 1L), .Label = c("145",
"M1"), class = "factor"), V4 = structure(c(2L, 1L, 1L, NA,
1L, 1L, 1L, 1L, NA, 1L), .Label = c("142", "M1"), class = "factor"),
V5 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, NA), .Label = c("130",
"M1"), class = "factor"), V6 = structure(c(2L, 1L, NA, 1L,
1L, 1L, NA, 1L, 1L, NA), .Label = c("105", "M2"), class = "factor"),
V7 = structure(c(2L, 1L, 1L, 1L, NA, 1L, 1L, 1L, NA, 1L), .Label = c("96",
"M2"), class = "factor"), V8 = structure(c(2L, NA, 1L, NA,
1L, NA, 1L, NA, 1L, 1L), .Label = c("89", "M2"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8"), class = "data.frame", row.names = c(NA,
-10L))
This works if the missing values are blanks:
dat$agg1 <- apply(dat[,2:5],1,function(x)paste(x[nchar(x)>0],collapse="/"))
dat$agg2 <- apply(dat[,6:8],1,function(x)paste(x[nchar(x)>0],collapse="/"))
dat <- dat[,c(1:5,9,6:8,10)]
dat
# gen M1 M1.1 M1.2 M1.3 agg1 M2 M2.1 M2.2 agg2
# 1 G1 150 142 130 150/142/130 105 96 105/96
# 2 G2 150 145 142 130 150/145/142/130 96 89 96/89
# 3 G3 150 145 130 150/145/130 105 96 105/96
# 4 G4 145 142 130 145/142/130 105 89 105/89
# ...
This works if the missing values are NA
dat$agg1 <- apply(dat[,2:5],1,function(x)paste(x[!is.na(x)],collapse="/"))
dat$agg2 <- apply(dat[,6:8],1,function(x)paste(x[!is.na(x)],collapse="/"))
to aggregate them into a character vector you use paste()
x=data.frame(x1=1:10,x2=1:10,x1=11:20)
#now notice that r created my x object with three columns x1,x2 and x1.1
xnew=cbind(x,agg=paste(x$x1,x$x2,x$x1.1,sep="/"))
I am not sure if this is what you want to do because I am a bit confused about the structure of your data.
Here is my script... I Know some of you guys can make it simple and elegant!
I transposed my df (a simple example) and read as table.
> dat<-read.table("dat.txt", header=T, sep="\t", na.strings="")
> dat
gen A B C D
1 M1 1 NA 3 NA
2 M1 NA 6 NA 3
3 M1 4 8 NA NA
4 M1 NA NA 6 3
5 M2 8 NA 6 NA
6 M2 NA 2 NA 6
7 M3 3 8 NA 2
8 M3 8 9 5 NA
9 M4 3 7 8 5
10 M4 5 NA 3 2
> final<-NULL
> for(i in 1:4){
+ mar<-as.character(dat[1,1])
+ dat1<-dat[dat[,1]%in% c(mar),]
+ dat <- dat[!dat[,1]%in% c(mar),]
+ dat2 <- apply(dat1,2,function(x)paste(x[!is.na(x)],collapse="/"))
+ dat2$gen<-mar
+ dat3<-rbind(dat1,dat2)
+ final<-rbind(final, dat3)
+ }
Warning messages:
1: In dat2$gen <- mar : Coercing LHS to a list
2: In dat2$gen <- mar : Coercing LHS to a list
3: In dat2$gen <- mar : Coercing LHS to a list
4: In dat2$gen <- mar : Coercing LHS to a list
> final
gen A B C D
1 M1 1 <NA> 3 <NA>
2 M1 <NA> 6 <NA> 3
3 M1 4 8 <NA> <NA>
4 M1 <NA> <NA> 6 3
5 M1 1/ 4 6/ 8 3/ 6 3/ 3
51 M2 8 <NA> 6 <NA>
6 M2 <NA> 2 <NA> 6
31 M2 8 2 6 6
7 M3 3 8 <NA> 2
8 M3 8 9 5 <NA>
32 M3 3/8 8/9 5 2
9 M4 3 7 8 5
10 M4 5 <NA> 3 2
33 M4 3/5 7 8/3 5/2

Calculate mean across rows with NA values in R

I have a really simple R question but I can't seem to find an adequate solution. Let's say we have the following data frame:
groupid<-rep(1:5, each=3)
names<-rep(c("Bill", "Jim", "Sarah", "Mike", "Jennifer"),3)
test1<-rep(c(90, 70, 90, NA, 100),3)
test2<-rep(c(80, NA, 92, 80, 65), 3)
testscores<-data.frame(groupid, names, test1, test2)
groupid names test1 test2
1 1 Bill 90 80
2 1 Jim 70 NA
3 1 Sarah 90 92
4 1 Mike NA 80
5 1 Jennifer 100 65
6 2 Bill 90 80
7 2 Jim 70 NA
8 2 Sarah 90 92
9 2 Mike NA 80
10 2 Jennifer 100 65
11 3 Bill 90 80
12 3 Jim 70 NA
13 3 Sarah 90 92
14 3 Mike NA 80
15 3 Jennifer 100 65
We are interested in getting the mean across rows (adding an extra column to the data frame) for each test, ignoring the NA values. For example, 'Jim' would have value of 70 for his average and 'Mike' would have a value of 80. All the others would be averaged normally.
I tried using transform from the plyr package but it did not appear to accommodate the NA issue.
testscores$testMean <- rowMeans(testscores[,3:4], na.rm=TRUE)
> testscores
groupid names test1 test2 testMean
1 1 Bill 90 80 85.0
2 1 Jim 70 NA 70.0
3 1 Sarah 90 92 91.0
4 2 Mike NA 80 80.0
5 2 Jennifer 100 65 82.5
6 2 Bill 90 80 85.0
7 3 Jim 70 NA 70.0
8 3 Sarah 90 92 91.0
9 3 Mike NA 80 80.0
10 4 Jennifer 100 65 82.5
11 4 Bill 90 80 85.0
12 4 Jim 70 NA 70.0
13 5 Sarah 90 92 91.0
14 5 Mike NA 80 80.0
15 5 Jennifer 100 65 82.5
you can also use this
testscores <- structure(list(groupid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L, 4L, 5L, 5L, 5L), names = structure(c(1L, 3L, 5L, 4L,
2L, 1L, 3L, 5L, 4L, 2L, 1L, 3L, 5L, 4L, 2L), .Label = c("Bill",
"Jennifer", "Jim", "Mike", "Sarah"), class = "factor"), test1 = c(90,
70, 90, NA, 100, 90, 70, 90, NA, 100, 90, 70, 90, NA, 100), test2 = c(80,
NA, 92, 80, 65, 80, NA, 92, 80, 65, 80, NA, 92, 80, 65)), .Names = c("groupid",
"names", "test1", "test2"), row.names = c(NA, -15L), class = "data.frame")
testscores$meanTest=rowMeans(testscores[,c("test1", "test2")], na.rm=TRUE)
# groupid names test1 test2 meanTest
#1 1 Bill 90 80 85.0
#2 1 Jim 70 NA 70.0
#3 1 Sarah 90 92 91.0
#4 2 Mike NA 80 80.0
#5 2 Jennifer 100 65 82.5
#6 2 Bill 90 80 85.0
#7 3 Jim 70 NA 70.0
#8 3 Sarah 90 92 91.0
#9 3 Mike NA 80 80.0
#10 4 Jennifer 100 65 82.5
#11 4 Bill 90 80 85.0
#12 4 Jim 70 NA 70.0
#13 5 Sarah 90 92 91.0
#14 5 Mike NA 80 80.0
#15 5 Jennifer 100 65 82.5

How to make a cross table with NA instead of X?

I have the following dataset (see for loading dataset below)
ID Date qty
1 ID25 2007-12-01 45
2 ID25 2008-01-01 26
3 ID25 2008-02-01 46
4 ID25 2008-03-01 0
5 ID25 2008-04-01 78
6 ID25 2008-05-01 65
7 ID25 2008-06-01 32
8 ID99 2008-02-01 99
9 ID99 2008-03-01 0
10 ID99 2008-04-01 99
And I would like to create a pivot table of that. I do that with the following command and that seems to be working fine:
pivottable <- xtabs(qty ~ ID + Date, table)
The output is the following:
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 0 0 99 0 99 0 0
However, for ID99 there are only values for 3 periods the rest is marked as '0'. I would like to display NA in the fields that have no values in the first table. I would like to get a table that looks as following:
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 NA NA 99 0 99 NA NA
Any suggestion on how to accomplish this?
Loading dataset:
table <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("ID25", "ID99"), class = "factor"), Date = structure(c(7L,
1L, 2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L), .Label = c("01/01/2008",
"01/02/2008", "01/03/2008", "01/04/2008", "01/05/2008", "01/06/2008",
"01/12/2007"), class = "factor"), qty = c(45L, 26L, 46L, 0L,
78L, 65L, 32L, 99L, 0L, 99L)), .Names = c("ID", "Date", "qty"
), class = "data.frame", row.names = c(NA, -10L))
table$Date <- as.POSIXct(table$Date, format='%d/%m/%Y')
You could use xtabs twice to obtain the output you are looking for:
Create the table:
pivottable <- xtabs(qty ~ ID + Date, table)
Replace all zeros of non-existing combinations with NA:
pivottable[!xtabs( ~ ID + Date, table)] <- NA
The output:
Date
ID 2007-12-01 2008-01-01 2008-02-01 2008-03-01 2008-04-01 2008-05-01 2008-06-01
ID25 45 26 46 0 78 65 32
ID99 99 0 99
Note that NAs are not displayed. This is due to the print function for this class. But you could use unclass(pivottable) to achieve regular behavior of print.

Resources