I have a dataset (STATPOP2016 by Swiss Federal Statistical Office) that contains number of households of different sizes per each hectar of Swiss territory. In other terms, for each hectar i I have:
x1 households consisting of one individual
x2 households consisting of two individuals
...
x6 households with 6 or more individuals (I consider them as having 6 people for simplicity).
I need to create a variable that will show me interquartile range for the households' number per each hectar. I have the code that works, but it is very slow. Is there a smarter way to do the same thing?
There is my code:
# Vector that contains all possible sizes of households
vector_hh_size <- c(1:6)
# Variable for interquantile range in household size. A is my dataframe
A$hh_size_IQR <- 0
# Vector that contains frequency of each size of household in a given hectar
vector_hh_frequency <- c(0,0,0,0,0,0)
for (i in 1:NROW(A)) {
for (j in 1:6){
vector_hh_frequency[j] <- eval(parse(text = paste("A$hh",j,"[",i,"]",sep = "")))
}
A$hh_size_IQR[i] <- wtd.quantile(vector_hh_size, weights = vector_hh_frequency)[4] - wtd.quantile(vector_hh_size, weights = vector_hh_frequency)[2]
}
Here is example of data:
hh1 hh2 hh3 hh4 hh5 hh6 IQR
1 0 3 0 0 0 0 0
2 0 3 0 0 0 0 0
3 0 0 3 0 0 0 0
4 0 3 0 0 0 0 0
5 3 6 3 3 0 0 1
6 0 3 0 0 3 0 3
7 11 7 4 7 3 0 3
8 3 3 0 3 0 0 3
9 3 3 0 3 0 0 3
10 0 3 0 0 0 0 0
#OBSis observation number, hhi shows how many households with i people there are. IQR is interquartile range for each observation - this is the variable I am building.
Here is a shorter version of your code:
library("Hmisc")
A <- read.table(header=TRUE, text=
" hh1 hh2 hh3 hh4 hh5 hh6
1 0 3 0 0 0 0
2 0 3 0 0 0 0
3 0 0 3 0 0 0
4 0 3 0 0 0 0
5 3 6 3 3 0 0
6 0 3 0 0 3 0
7 11 7 4 7 3 0
8 3 3 0 3 0 0
9 3 3 0 3 0 0
10 0 3 0 0 0 0")
vector_hh_size <- 1:ncol(A)
myIQR <- function(Ai) wtd.quantile(vector_hh_size, weights=Ai)[4] - wtd.quantile(vector_hh_size, weights=Ai)[2]
A$IQR <- apply(A, 1, myIQR)
# > A
# hh1 hh2 hh3 hh4 hh5 hh6 IQR
# 1 0 3 0 0 0 0 0
# 2 0 3 0 0 0 0 0
# 3 0 0 3 0 0 0 0
# 4 0 3 0 0 0 0 0
# 5 3 6 3 3 0 0 1
# 6 0 3 0 0 3 0 3
# 7 11 7 4 7 3 0 3
# 8 3 3 0 3 0 0 3
# 9 3 3 0 3 0 0 3
# 10 0 3 0 0 0 0 0
Related
Suppose I have something like this:
df<-data.frame(group=c(1, 1,2, 2, 2, 4,4,4,4,6,6,6),
binary1=c(1,0,1,0,0,0,0,0,0,0,0,0),
binary2=c(0,1,0,1,0,1,0,0,0,0,1,1),
binary3=c(0,0,0,0,1,0,1,0,0,0,0,0),
binary4=c(0,0,0,0,0,0,0,1,0,0,0,0))
I want to sum along all possible left to right diagonals within groups (i.e group 1, 2 4 and 6) and return the max sum. This is also in a dataframe, so I would like to specify to only sum along binary1-binary4. Anyone know if this is possible?
Here's my desired output:
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1
I have circled the "diagonals" I would like summed for group 4 in this image as an example:
Here is another solution where we use row and col indices to get all possible combinations of diagonals. Use by to split by group and merge it with original dataframe.
max_diag <- function(x) max(sapply(split(as.matrix(x), row(x) - col(x)), sum))
merge(df, stack(by(df[-1], df$group, max_diag)), by.x = "group", by.y = "ind")
# group binary1 binary2 binary3 binary4 values
#1 1 1 0 0 0 2
#2 1 0 1 0 0 2
#3 2 1 0 0 0 3
#4 2 0 1 0 0 3
#5 2 0 0 1 0 3
#6 4 0 1 0 0 3
#7 4 0 0 1 0 3
#8 4 0 0 0 1 3
#9 4 0 0 0 0 3
#10 6 0 0 0 0 1
#11 6 0 1 0 0 1
#12 6 0 1 0 0 1
You can split the data.frame and sum the diagonal using diag(). Once you have this sum diagonal per group, it's putting them back into the data.frame by calling the group.
Group 4 should be zero? Or am I missing something:
DIAG = by(df[,-1],df$group,function(i)sum(diag(as.matrix(i))))
df$want = DIAG[as.character(df$group)]
If I get your definition correct, we define a function to calculate sum of main diagonal:
main_diag = function(m){
sapply(1:(ncol(m)-1),function(i)sum(diag(m[,i:ncol(m)])))
}
Thanks to #IceCreamToucan for correcting this. Then we consider the max of all main diagonals, and their transpose:
DIAG = by(df[,-1],df$group,function(i){
i = as.matrix(i)
max(main_diag(i),main_diag(t(i)))
})
df$want = DIAG[as.character(df$group)]
group binary1 binary2 binary3 binary4 want
1 1 1 0 0 0 2
2 1 0 1 0 0 2
3 2 1 0 0 0 3
4 2 0 1 0 0 3
5 2 0 0 1 0 3
6 4 0 1 0 0 3
7 4 0 0 1 0 3
8 4 0 0 0 1 3
9 4 0 0 0 0 3
10 6 0 0 0 0 1
11 6 0 1 0 0 1
12 6 0 1 0 0 1
Let say I have a contingency table (made using the table function in R).
digit
ID 1 2 3 4 5 6 7 8 9
1672120 23 16 8 10 12 13 3 3 5
1672121 2 1 0 0 0 0 1 0 0
1672122 1 2 1 0 1 0 0 1 0
1672123 0 1 1 0 0 0 0 0 0
1672124 1 1 0 1 1 0 0 0 0
1672125 5 2 5 1 1 1 0 0 2
1672127 2 1 2 1 0 0 0 0 0
1672128 2 0 0 1 0 1 0 0 1
1672129 1 0 1 0 0 0 1 0 0
If I want to remove the rows where the number of counts is smaller than 5 from the contingency table, how should I do it?
Since you don't provide reproducible sample data here is an example based on the mtcars dataset
Let's create a count table of mtcars$gear vs. mtcars$carb
tbl <- table(mtcars$gear, mtcars$carb)
#
# 1 2 3 4 6 8
# 3 3 4 3 5 0 0
# 4 4 4 0 4 0 0
# 5 0 2 0 1 1 1
We then select only those rows where at least one count is larger than 2
tbl[apply(tbl > 2, 1, any), ]
#
# 1 2 3 4 6 8
# 3 3 4 3 5 0 0
# 4 4 4 0 4 0 0
I have data.frames of counts such as:
a <- data.frame(id=1:10,
"1"=c(rep(1,3),rep(0,7)),
"3"=c(rep(0,4),rep(1,6)))
names(a)[2:3] <- c("1","3")
a
> a
id 1 3
1 1 1 0
2 2 1 0
3 3 1 0
4 4 0 0
5 5 0 1
6 6 0 1
7 7 0 1
8 8 0 1
9 9 0 1
10 10 0 1
and a template data.frame such as
m <- data.frame(id=1:10,
"1"= rep(0,10),
"2"= rep(0,10),
"3"= rep(0,10),
"4"= rep(0,10))
names(m)[-1] <- 1:4
m
> m
id 1 2 3 4
1 1 0 0 0 0
2 2 0 0 0 0
3 3 0 0 0 0
4 4 0 0 0 0
5 5 0 0 0 0
6 6 0 0 0 0
7 7 0 0 0 0
8 8 0 0 0 0
9 9 0 0 0 0
10 10 0 0 0 0
and I want to add the values of a into the template m
in the appropraite columns, leaving the rest as 0.
This is working but I would like to know
if there is a more elegant way, perhaps using plyr or data.table:
provi <- rbind.fill(a,m)
provi[is.na(provi)] <- 0
mnew <- aggregate(provi[,-1],by=list(provi$id),FUN=sum)
names(mnew)[1] <- "id"
mnew <- mnew[c(1,order(names(mnew)[-1])+1)]
mnew
> mnew
id 1 2 3 4
1 1 1 0 0 0
2 2 1 0 0 0
3 3 1 0 0 0
4 4 0 0 0 0
5 5 0 0 1 0
6 6 0 0 1 0
7 7 0 0 1 0
8 8 0 0 1 0
9 9 0 0 1 0
10 10 0 0 1 0
I guess the concise option would be:
m[names(a)] <- a
Or we match the column names ('i1'), use that to create the column index with max.col, cbind with the row index ('i2'), and a similar step can be done to create 'i3'. We change the values in 'm' corresponding to 'i2' with the 'a' values based on 'i3'.
i1 <- match(names(a)[-1], names(m)[-1])
i2 <- cbind(m$id, i1[max.col(a[-1], 'first')]+1L)
i3 <- cbind(a$id, max.col(a[-1], 'first')+1L)
m[i2] <- a[i3]
m
# id 1 2 3 4
#1 1 1 0 0 0
#2 2 1 0 0 0
#3 3 1 0 0 0
#4 4 0 0 0 0
#5 5 0 0 1 0
#6 6 0 0 1 0
#7 7 0 0 1 0
#8 8 0 0 1 0
#9 9 0 0 1 0
#10 10 0 0 1 0
A data.table option would be melt/dcast
library(data.table)
dcast(melt(setDT(a), id.var='id')[,
variable:= factor(variable, levels=1:4)],
id~variable, value.var='value', drop=FALSE, fill=0)
# id 1 2 3 4
# 1: 1 1 0 0 0
# 2: 2 1 0 0 0
# 3: 3 1 0 0 0
# 4: 4 0 0 0 0
# 5: 5 0 0 1 0
# 6: 6 0 0 1 0
# 7: 7 0 0 1 0
# 8: 8 0 0 1 0
# 9: 9 0 0 1 0
#10: 10 0 0 1 0
A similar dplyr/tidyr option would be
library(dplyr)
library(tidyr)
gather(a, Var, Val, -id) %>%
mutate(Var=factor(Var, levels=1:4)) %>%
spread(Var, Val, drop=FALSE, fill=0)
You could use merge, too:
res <- suppressWarnings(merge(a, m, by="id", suffixes = c("", "")))
(res[, which(!duplicated(names(res)))][, names(m)])
# id 1 2 3 4
# 1 1 1 0 0 0
# 2 2 1 0 0 0
# 3 3 1 0 0 0
# 4 4 0 0 0 0
# 5 5 0 0 1 0
# 6 6 0 0 1 0
# 7 7 0 0 1 0
# 8 8 0 0 1 0
# 9 9 0 0 1 0
# 10 10 0 0 1 0
I am using the following R code to produce a confusion matrix comparing the true labels of some data to the output of a neural network.
t <- table(as.factor(test.labels), as.factor(nnetpredict))
However, sometimes the neural network doesn't predict any of a certain class, so the table isn't square (as, for example, there are 5 levels in the test.labels factor, but only 3 levels in the nnetpredict factor). I want to make the table square by adding in any factor levels necessary, and setting their counts to zero.
How should I go about doing this?
Example:
> table(as.factor(a), as.factor(b))
1 2 3 4 5 6 7 8 9 10
1 1 0 0 0 0 0 0 1 0 0
2 0 1 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0
You can see in the table above that there are 7 rows, but 10 columns, because the a factor only has 7 levels, whereas the b factor has 10 levels. What I want to do is to pad the table with zeros so that the row labels and the column labels are the same, and the matrix is square. From the example above, this would produce:
1 2 3 4 5 6 7 8 9 10
1 1 0 0 0 0 0 0 1 0 0
2 0 1 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
The reason I need to do this is two-fold:
For display to users/in reports
So that I can use a function to calculate the Kappa statistic, which requires a table formatted like this (square, same row and col labels)
EDIT - round II to address the additional details in the question. I deleted my first answer since it wasn't relevant anymore.
This has produced the desired output for the test cases I've given it, but I definitely advise testing thoroughly with your real data. The approach here is to find the full list of levels for both inputs into the table and set that full list as the levels before generating the table.
squareTable <- function(x,y) {
x <- factor(x)
y <- factor(y)
commonLevels <- sort(unique(c(levels(x), levels(y))))
x <- factor(x, levels = commonLevels)
y <- factor(y, levels = commonLevels)
table(x,y)
}
Two test cases:
> #Test case 1
> set.seed(1)
> x <- factor(sample(0:9, 100, TRUE))
> y <- factor(sample(3:7, 100, TRUE))
>
> table(x,y)
y
x 3 4 5 6 7
0 2 1 3 1 0
1 1 0 2 3 0
2 1 0 3 4 3
3 0 3 6 3 2
4 4 4 3 2 1
5 2 2 0 1 0
6 1 2 3 2 3
7 3 3 3 4 2
8 0 4 1 2 4
9 2 1 0 0 3
> squareTable(x,y)
y
x 0 1 2 3 4 5 6 7 8 9
0 0 0 0 2 1 3 1 0 0 0
1 0 0 0 1 0 2 3 0 0 0
2 0 0 0 1 0 3 4 3 0 0
3 0 0 0 0 3 6 3 2 0 0
4 0 0 0 4 4 3 2 1 0 0
5 0 0 0 2 2 0 1 0 0 0
6 0 0 0 1 2 3 2 3 0 0
7 0 0 0 3 3 3 4 2 0 0
8 0 0 0 0 4 1 2 4 0 0
9 0 0 0 2 1 0 0 3 0 0
> squareTable(y,x)
y
x 0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 2 1 1 0 4 2 1 3 0 2
4 1 0 0 3 4 2 2 3 4 1
5 3 2 3 6 3 0 3 3 1 0
6 1 3 4 3 2 1 2 4 2 0
7 0 0 3 2 1 0 3 2 4 3
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
>
> #Test case 2
> set.seed(1)
> xx <- factor(sample(0:2, 100, TRUE))
> yy <- factor(sample(3:5, 100, TRUE))
>
> table(xx,yy)
yy
xx 3 4 5
0 4 14 9
1 14 15 9
2 11 11 13
> squareTable(xx,yy)
y
x 0 1 2 3 4 5
0 0 0 0 4 14 9
1 0 0 0 14 15 9
2 0 0 0 11 11 13
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
> squareTable(yy,xx)
y
x 0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 4 14 11 0 0 0
4 14 15 11 0 0 0
5 9 9 13 0 0 0
I am using the following R code to produce a confusion matrix comparing the true labels of some data to the output of a neural network.
t <- table(as.factor(test.labels), as.factor(nnetpredict))
However, sometimes the neural network doesn't predict any of a certain class, so the table isn't square (as, for example, there are 5 levels in the test.labels factor, but only 3 levels in the nnetpredict factor). I want to make the table square by adding in any factor levels necessary, and setting their counts to zero.
How should I go about doing this?
Example:
> table(as.factor(a), as.factor(b))
1 2 3 4 5 6 7 8 9 10
1 1 0 0 0 0 0 0 1 0 0
2 0 1 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0
You can see in the table above that there are 7 rows, but 10 columns, because the a factor only has 7 levels, whereas the b factor has 10 levels. What I want to do is to pad the table with zeros so that the row labels and the column labels are the same, and the matrix is square. From the example above, this would produce:
1 2 3 4 5 6 7 8 9 10
1 1 0 0 0 0 0 0 1 0 0
2 0 1 0 0 0 0 0 0 1 0
3 0 0 1 0 0 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
The reason I need to do this is two-fold:
For display to users/in reports
So that I can use a function to calculate the Kappa statistic, which requires a table formatted like this (square, same row and col labels)
EDIT - round II to address the additional details in the question. I deleted my first answer since it wasn't relevant anymore.
This has produced the desired output for the test cases I've given it, but I definitely advise testing thoroughly with your real data. The approach here is to find the full list of levels for both inputs into the table and set that full list as the levels before generating the table.
squareTable <- function(x,y) {
x <- factor(x)
y <- factor(y)
commonLevels <- sort(unique(c(levels(x), levels(y))))
x <- factor(x, levels = commonLevels)
y <- factor(y, levels = commonLevels)
table(x,y)
}
Two test cases:
> #Test case 1
> set.seed(1)
> x <- factor(sample(0:9, 100, TRUE))
> y <- factor(sample(3:7, 100, TRUE))
>
> table(x,y)
y
x 3 4 5 6 7
0 2 1 3 1 0
1 1 0 2 3 0
2 1 0 3 4 3
3 0 3 6 3 2
4 4 4 3 2 1
5 2 2 0 1 0
6 1 2 3 2 3
7 3 3 3 4 2
8 0 4 1 2 4
9 2 1 0 0 3
> squareTable(x,y)
y
x 0 1 2 3 4 5 6 7 8 9
0 0 0 0 2 1 3 1 0 0 0
1 0 0 0 1 0 2 3 0 0 0
2 0 0 0 1 0 3 4 3 0 0
3 0 0 0 0 3 6 3 2 0 0
4 0 0 0 4 4 3 2 1 0 0
5 0 0 0 2 2 0 1 0 0 0
6 0 0 0 1 2 3 2 3 0 0
7 0 0 0 3 3 3 4 2 0 0
8 0 0 0 0 4 1 2 4 0 0
9 0 0 0 2 1 0 0 3 0 0
> squareTable(y,x)
y
x 0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 2 1 1 0 4 2 1 3 0 2
4 1 0 0 3 4 2 2 3 4 1
5 3 2 3 6 3 0 3 3 1 0
6 1 3 4 3 2 1 2 4 2 0
7 0 0 3 2 1 0 3 2 4 3
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
>
> #Test case 2
> set.seed(1)
> xx <- factor(sample(0:2, 100, TRUE))
> yy <- factor(sample(3:5, 100, TRUE))
>
> table(xx,yy)
yy
xx 3 4 5
0 4 14 9
1 14 15 9
2 11 11 13
> squareTable(xx,yy)
y
x 0 1 2 3 4 5
0 0 0 0 4 14 9
1 0 0 0 14 15 9
2 0 0 0 11 11 13
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
> squareTable(yy,xx)
y
x 0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 4 14 11 0 0 0
4 14 15 11 0 0 0
5 9 9 13 0 0 0