convert a data frame into a specifically formatted frequency table - r

I have a data.frame and I'm trying to create a frequency table that shows the frequency of values for each row. So I'm starting with something like this:
d <- data.frame(a=c(1,2,3), b=c(3,4,5), c=c(1,2,5))
which looks like this:
a b c
1 3 1
2 4 2
3 5 5
What I'd really like to create is a contingency data.frame or matrix that looks like this:
1, 2, 3, 4, 5, 6, 7, 8, 9
2, 0, 1, 0, 0, 0, 0, 0, 0
0, 2, 0, 1, 0, 0, 0, 0, 0
0, 0, 1, 0, 2, 0, 0, 0, 0
The top row is simply a label row and need not be in the final result. But I add it there for illustration. Each row shows the digits 1:9 and the number of times each digit shows up in each row of the starting data.
I can't wrap my head around an easy way to create this. Although it seems like the table() function should be helpful, I can't get it to give me any love. Any help or ideas are appreciated.

Here you go:
t(apply(d, 1, tabulate, nbin=9))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 2 0 1 0 0 0 0 0 0
[2,] 0 2 0 1 0 0 0 0 0
[3,] 0 0 1 0 2 0 0 0 0
(Though it probably doesn't matter in this application, tabulate() (which is used inside of the code for table()) is also nice for the impressive speed with which it performs its calculations.)
EDIT: tabulate() isn't set up to deal with 0s or negative integers. If you want another one liner that does, you could use table() though, doing something like this:
d <- data.frame(a=c(0,-1,-2), b=c(3,4,5), c=c(1,2,5))
t(apply(d, 1, function(X) table(c(X, -9:9)) - 1))
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
[1,] 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0
[3,] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0

another solution using table
library(reshape)
d <- data.frame(a=c(1,2,3), b=c(3,4,5), c=c(1,2,5))
d2 <- melt(d)
d2$rows <- rep(1:nrow(d), ncol(d))
table(d2$rows, d2$value)

Related

Collecting same answer from different questions in one variable?

I am completely new to R, but running out of time here.
In my dataset, I have people from several countries answering who they voted for last. People from different countries got different questions, so in each column, only the ones from the country have an answer, the rest is NA.
I am trying to collect everyone who voted for a green party in one variable. So far I have succeeded in coding it into a separate dummy variable for each country using ifelse, but I cant seem to merge these variables. So now I have ie a variable for Germany, where a green vote in the german election is 1, and everyone else is 0. Same goes for France etc.
But how can I collect all this information in just one variable?
Appreciate your help.
Assuming your data set looks like this...
> ctry <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
> vote_ctry_1 <- c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
> vote_ctry_2 <- c(0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0)
> vote_ctry_3 <- c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
>
> dd <- data.frame(ctry, vote_ctry_1, vote_ctry_2, vote_ctry_3)
> dd
ctry vote_ctry_1 vote_ctry_2 vote_ctry_3
1 1 1 0 0
2 1 0 0 0
3 1 0 0 0
4 1 1 0 0
5 2 0 1 0
6 2 0 1 0
7 2 0 0 0
8 2 0 1 0
9 3 0 0 1
10 3 0 0 0
11 3 0 0 0
12 3 0 0 0
... then just add up the dummy variables:
> dd$vote_all <- vote_ctry_1 + vote_ctry_2 + vote_ctry_3
> dd
ctry vote_ctry_1 vote_ctry_2 vote_ctry_3 vote_all
1 1 1 0 0 1
2 1 0 0 0 0
3 1 0 0 0 0
4 1 1 0 0 1
5 2 0 1 0 1
6 2 0 1 0 1
7 2 0 0 0 0
8 2 0 1 0 1
9 3 0 0 1 1
10 3 0 0 0 0
11 3 0 0 0 0
12 3 0 0 0 0

Selecting at least x consecutive values that are the same and removing the end points of these values

I have the following data set:
A <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0)
B <- c(0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0)
C <- c(0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1)
df <- cbind(A, B, C)
> df
A B C
[1,] 1 0 0
[2,] 1 0 1
[3,] 1 1 1
[4,] 1 1 1
[5,] 1 1 1
[6,] 1 1 1
[7,] 1 0 1
[8,] 1 1 1
[9,] 1 1 1
[10,] 1 1 0
[11,] 0 1 0
[12,] 1 1 0
[13,] 0 0 1
I want to do two things to each column: first, I want to change all the 1s that are within two places of a 0 to a 0. Then, I want to select regions of columns where there are at least four consecutive 1s; in other words, if a 1 appears that is not in a consecutive string of four or more 1s, it will become a 0. The resulting data set should look like this:
> df
A B C
[1,] 1 0 0
[2,] 1 0 0
[3,] 1 0 0
[4,] 1 0 1
[5,] 1 0 1
[6,] 1 0 1
[7,] 1 0 1
[8,] 1 0 0
[9,] 0 0 0
[10,] 0 0 0
[11,] 0 0 0
[12,] 0 0 0
[13,] 0 0 0
What’s the best way to do this? Thanks!
Here is another possible approach using base functions. Explanation inline with code.
apply(df, 2, function(x) {
#identify 0 locations, create indices 2 places away from these locations
#and set these to 0
idx <- unique(unlist(lapply(which(x==0L), `+`, -2L:2L)))
x[idx[idx > 0L & idx <= length(x)]] <- 0L
#create run length encoding, filter for those with value=1 but less than 4
#and set those lengths to 0
r <- rle(x)
r$values[r$lengths < 4L & r$values==1L] <- 0L
inverse.rle(r)
})
output:
A B C
[1,] 1 0 0
[2,] 1 0 0
[3,] 1 0 0
[4,] 1 0 1
[5,] 1 0 1
[6,] 1 0 1
[7,] 1 0 1
[8,] 1 0 0
[9,] 0 0 0
[10,] 0 0 0
[11,] 0 0 0
[12,] 0 0 0
[13,] 0 0 0
You could use lag and lead for the comparisons (for the first part).
Here's an example using your data (this is on the original version before any edits to your question):
library(dplyr)
library(tidyverse)
df <-
as.tibble(df) %>%
mutate(A_lag=lag(A)) %>%
mutate(B_lag=lag(B)) %>%
mutate(C_lag=lag(C)) %>%
mutate(A_lag2=lag(A,2)) %>%
mutate(B_lag2=lag(B,2)) %>%
mutate(C_lag2=lag(C,2)) %>%
mutate(A_lead=lead(A)) %>%
mutate(B_lead=lead(B)) %>%
mutate(C_lead=lead(C)) %>%
mutate(A_lead2=lead(A,2)) %>%
mutate(B_lead2=lead(B,2)) %>%
mutate(C_lead2=lead(C,2)) %>%
as.data.frame()
a <- df[,c(1,4,7,10,13)]
b <- df[,c(2,5,8,11,14)]
c <- df[,c(3,6,9,12,15)]
df <- data.frame(A=apply(a,1,min,na.rm=T),
B=apply(b,1,min,na.rm=T),
C=apply(c,1,min,na.rm=T)
)
This leads to an intermediate result table that looks like this:
A B C
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 1
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 0
9 0 0 0
10 0 1 0
11 0 1 0
12 0 1 0
13 0 1 0
The result from this step complies with the your logic for it.
The next step however it seems that your words ask for one thing - keep only 1's that are in a column-wise block of at least four 1's - but your example shows something slightly different.
Your example output shows column B as all 0's, even though the last 4 rows of it are all 1's as a result of the logic from the earlier step.
I've created some code that follows the logic you specified. If you forgot to include a detail or otherwise wanted something slightly different this should hopefully get you close enough (otherwise please let me know in a comment).
# You could do it without a for loop if need be
myfun <- function(x) {
for(i in 1:length(x)){
x[i] <- ifelse((sum(x[i:(max(0,i-3))]) == 4) | (sum(x[i:(min(length(x),i+3))]) == 4),1,0)
}
return(x)
}
apply(df,2,myfun)

find the best combination of columns in a matrix

assume I have a large matrix (matrix_1) of 2000 columns. Each cell has a value of 0 or 1. I want to find a best combination of 10 columns. The best combination gives the maximum number of non-0 values per row. So it basically gives maximum
sum (apply (matrix_2, 1, function(x) any(x == 1)))
I cannot go through all possible combinations since it is too computationally intensive (there is 2.758988e+26). Any suggestions?
For an example take this matrix it has 4 rows and I am only picking 2 columns at a time
mat <- matrix (c(1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0), nrow = 4, byrow = FALSE)
mat
# combination of columns 2 and 3 is best: 3 rows with at least a single 1 value
sum (apply (mat[, c(2, 3)], 1, function(x) any (x == 1)))
# combination of columns 1 and 2 is worse: 2 rows with at least a single 1 value
sum (apply (mat[, c(1, 2)], 1, function(x) any (x == 1)))
You could use a function like this...
find10 <- function(mat,n=10){
cols <- rep(FALSE,ncol(mat)) #columns to exclude
rows <- rep(TRUE,nrow(mat)) #rows to include
for(i in 1:n){
colsums <- colSums(mat[rows,])
colsums[cols] <- -1 #to exclude those already accounted for
maxcol <- which.max(colsums)
cols[maxcol] <- TRUE
rows <- rows & !as.logical(mat[,maxcol])
}
return(which(cols))
}
It looks for the column with most non-zeros, removes those rows from the comparison, and repeats. It returns the column numbers of the n best columns.
An example...
m <- matrix(sample(0:1,100,prob = c(0.8,0.2),replace=TRUE),nrow=10)
m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 1 0 0 0 0 0 1 1 0
[2,] 1 0 0 0 0 0 0 0 1 1
[3,] 0 0 0 0 1 0 0 0 0 0
[4,] 0 0 0 1 0 1 0 1 0 1
[5,] 0 0 0 0 1 0 0 1 0 0
[6,] 0 0 0 1 0 1 1 0 0 0
[7,] 0 0 1 0 0 0 0 0 0 0
[8,] 0 0 0 0 0 0 0 0 1 0
[9,] 0 0 0 0 0 0 0 1 0 0
[10,] 0 0 0 0 0 0 0 0 0 0
find10(m,5)
[1] 3 4 5 8 9
It also comes up with 2,3 for the example you give.

Convert a matrix of 1s and 0s to a row-sum count matrix

I would like to transform a matrix of 0s and 1s into a corresponding matrix that gives the cumulative row sum for non-zero entries. Example input and output is given below:
set.seed(404)
input <- matrix(rbinom(10 * 5, 1, 0.5), ncol = 5, nrow = 5)
output <- data.frame(a = c(1, 1, 1, 1, 0),
b = c(0, 0, 0, 0, 0),
c = c(2, 2, 0, 2, 1),
d = c(3, 0, 0, 3, 2),
e = c(0, 3, 0, 0, 0))
input
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 0 1 1 0
#[2,] 1 0 1 0 1
#[3,] 1 0 0 0 0
#[4,] 1 0 1 1 0
#[5,] 0 0 1 1 0
output
# a b c d e
#1 1 0 2 3 0
#2 1 0 2 0 3
#3 1 0 0 0 0
#4 1 0 2 3 0
#5 0 0 1 2 0
We can use apply with MARGIN=1 to get the cumsum of each row of 'input', transpose (t) and multiply with 'input' so that the 1 values gets replaced by the cumsum output and '0' remain the same.
input*t(apply(input, 1, cumsum))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 0 2 3 0
#[2,] 1 0 2 0 3
#[3,] 1 0 0 0 0
#[4,] 1 0 2 3 0
#[5,] 0 0 1 2 0
Or we can use rowCumsums from library(matrixStats) to get the cumsum of each row and multiply as before.
library(matrixStats)
input*rowCumsums(input)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 0 2 3 0
#[2,] 1 0 2 0 3
#[3,] 1 0 0 0 0
#[4,] 1 0 2 3 0
#[5,] 0 0 1 2 0

R: Replace "off-diagonal" elements of a random matrix

I'm using the following code to generate a random matrix with some elements = 1 near the diagonal, the rest = 0. (This is basically a random walk along the main diagonal.)
n <- 20
rw <- matrix(0, ncol = 2, nrow = n)
indx <- cbind(seq(n), sample(c(1, 2), n, TRUE))
rw[indx] <- 1
rw[,1] <- cumsum(rw[, 1])+1
rw[,2] <- cumsum(rw[, 2])+1
rw2 <- subset(rw, (rw[,1] <= 10 & rw[,2] <= 10))
field <- matrix(0, ncol = 10, nrow = 10)
field[rw2] <- 1
field
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 1 1 1 0 0 0 0 0 0
[2,] 0 0 0 1 0 0 0 0 0 0
[3,] 0 0 0 1 0 0 0 0 0 0
[4,] 0 0 0 1 1 1 1 0 0 0
[5,] 0 0 0 0 0 0 1 1 0 0
[6,] 0 0 0 0 0 0 0 1 0 0
[7,] 0 0 0 0 0 0 0 1 0 0
[8,] 0 0 0 0 0 0 0 1 1 1
[9,] 0 0 0 0 0 0 0 0 0 0
[10,] 0 0 0 0 0 0 0 0 0 0
Next thing, I would like to replace the 0 elements to the right-hand/upper side of the 1-elements by 1. For the above matrix the desired output would be:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 1 1 1 1 1 1 1 1 1
[2,] 0 0 0 1 1 1 1 1 1 1
[3,] 0 0 0 1 1 1 1 1 1 1
[4,] 0 0 0 1 1 1 1 1 1 1
[5,] 0 0 0 0 0 0 1 1 1 1
[6,] 0 0 0 0 0 0 0 1 1 1
[7,] 0 0 0 0 0 0 0 1 1 1
[8,] 0 0 0 0 0 0 0 1 1 1
[9,] 0 0 0 0 0 0 0 0 0 0
[10,] 0 0 0 0 0 0 0 0 0 0
I have tried
fill <- function(row) {first = match(1, row); if (is.na(first)) {row = rep(1, 10)} else {row[first:10] = 1}; return(row)}
field2 <- apply(field, 1, fill)
field2
But that gives me instead:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 0 1 1
[2,] 1 0 0 0 0 0 0 0 1 1
[3,] 1 0 0 0 0 0 0 0 1 1
[4,] 1 1 1 1 0 0 0 0 1 1
[5,] 1 1 1 1 0 0 0 0 1 1
[6,] 1 1 1 1 0 0 0 0 1 1
[7,] 1 1 1 1 1 0 0 0 1 1
[8,] 1 1 1 1 1 1 1 1 1 1
[9,] 1 1 1 1 1 1 1 1 1 1
[10,] 1 1 1 1 1 1 1 1 1 1
Can anyone help me fix this?
Cheers,
mce
PS: If the first row is all zeros (as it can happen with the above code) it should be changed to all ones.
Why not just:
t(apply(field,1,cummax))
One instance:
dput(field)
structure(c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0), .Dim = c(10L,
10L))
> field
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 0 0 0 0
[3,] 0 0 0 0 0 1 0 0 0 0
[4,] 0 0 0 0 0 1 0 0 0 0
[5,] 0 0 0 0 0 1 1 1 1 1
[6,] 0 0 0 0 0 0 0 0 0 0
[7,] 0 0 0 0 0 0 0 0 0 0
[8,] 0 0 0 0 0 0 0 0 0 0
[9,] 0 0 0 0 0 0 0 0 0 0
[10,] 0 0 0 0 0 0 0 0 0 0
The output:
> t(apply(field,1,cummax))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1 1 1 1
[3,] 0 0 0 0 0 1 1 1 1 1
[4,] 0 0 0 0 0 1 1 1 1 1
[5,] 0 0 0 0 0 1 1 1 1 1
[6,] 0 0 0 0 0 0 0 0 0 0
[7,] 0 0 0 0 0 0 0 0 0 0
[8,] 0 0 0 0 0 0 0 0 0 0
[9,] 0 0 0 0 0 0 0 0 0 0
[10,] 0 0 0 0 0 0 0 0 0 0
This should work:
MaxFull <- which.max((apply(field,1,sum) > 0) * (1:10))
rbind(t(apply(field[1:MaxFull,], 1, fill)),matrix(0,ncol=10,nrow=10-MaxFull))
notice that it uses fill as you defined it.
In the help for the value of apply, "If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN])". So, you want the transpose of this. Print statements were added to the fill function to confirm the operation. You may want to check if your function is hiding another function, there is a function named fill, but it doesn't matter in this case.
n <- 20
rw <- matrix(0, ncol = 2, nrow = n)
indx <- cbind(seq(n), sample(c(1, 2), n, TRUE))
rw[indx] <- 1
rw[,1] <- cumsum(rw[, 1])+1
rw[,2] <- cumsum(rw[, 2])+1
rw2 <- subset(rw, (rw[,1] <= 10 & rw[,2] <= 10))
field <- matrix(0, ncol = 10, nrow = 10)
field[rw2] <- 1
field
myfill <- function(row) {
print("Function start")
print(row)
first = match(1, row)
print(paste("Match", first))
if (is.na(first)) {
row = rep(1, 10)
} else {
row[first:10] = 1
};
print(row)
flush.console()
return(row)
}
field2 = t(apply(field, 1, myfill))
field2

Resources