Summarizing count data as proportion in a data.frame

Summarizing count data as proportion in a data.frame - r

dummy <- data.frame(Q1 = c(0, 1, 0, 1),
Q2 = c(1, 1, 0, 1),
Q3 = c(0, 1, 1, 0))
df_dummy <- data.frame(Question = c("Q1", "Q2", "Q3"),
X1 = c(2/4, 3/4, 2/4),
X0 = c(2/4, 1/4, 2/4))
> dummy
Q1 Q2 Q3
1 0 1 0
2 1 1 1
3 0 0 1
4 1 1 0
> df_dummy
Question X1 X0
1 Q1 0.50 0.50
2 Q2 0.75 0.25
3 Q3 0.50 0.50
I have some data (dummy) where I have binary responses to Q1, Q2, and Q3. I want to summarize my data in the format as shown in df_dummy, where for each question, column X1 tells me the proportion of people that answered 1 to Q1, and column X0 tells me the proportion of people that answered 0 to Q0. I tried prop.table but that didn't return the desired result.

Another way is counting the proportion of 1s and then deducing from that the proportion of 0s:
X1 <- colSums(dummy==1)/nrow(dummy)
df_dummy <- data.frame(X1, X0=1-X1)
df_dummy
# X1 X0
#Q1 0.50 0.50
#Q2 0.75 0.25
#Q3 0.50 0.50
NB, inspired from #akrun's idea of ColMeans: You can also use colMeans instead of dividing colSumsby the number of row to define X1:
X1 <- colMeans(dummy==1)
df_dummy <- data.frame(X1, X0=1-X1)
df_dummy
# X1 X0
#Q1 0.50 0.50
#Q2 0.75 0.25
#Q3 0.50 0.50

We can try apply with margin =2 and divide the counts of each value with the total length in the column
t(apply(dummy, 2, function(x) table(x)/length(x)))
# 0 1
#Q1 0.50 0.50
#Q2 0.25 0.75
#Q3 0.50 0.50

We can do this with table and prop.table
t(sapply(dummy, function(x) prop.table(table(x))))
# 0 1
#Q1 0.50 0.50
#Q2 0.25 0.75
#Q3 0.50 0.50
Or a more efficient approach is to call table once
prop.table(table(stack(dummy)[2:1]),1)
# values
#ind 0 1
# Q1 0.50 0.50
# Q2 0.25 0.75
# Q3 0.50 0.50
Or another option is colMeans (inspired from #Cath's use of colSums)
X0 <- colMeans(!dummy)
data.frame(X1 = 1 - X0, X0)
# X1 X0
#Q1 0.50 0.50
#Q2 0.75 0.25
#Q3 0.50 0.50

Another way to do this would be using do.call & lapply
do.call(cbind,lapply(dummy,function(x) data.frame(table(x))[,2]))
# Q1 Q2 Q3
[1,] 2 1 2
[2,] 2 3 2

Less elegantly than in the answer above:
d <- t(dummy)
cbind(X0 = (ncol(d) - rowSums(d)) / ncol(d), X1 = rowSums(d) / ncol(d))
Or, to avoid computing the same stuff twice, and to get a data frame:
d <- t(dummy)
i <- ncol(d)
j <- rowSums(d)
data.frame(Question = rownames(d), X0 = (i - j) / i, X1 = j / i)
There you go:
Question X0 X1
Q1 Q1 0.50 0.50
Q2 Q2 0.25 0.75
Q3 Q3 0.50 0.50

A tidyverse option:
library(tidyr)
library(janitor)
dummy %>%
gather(question, val) %>% # reshape to long form
tabyl(question, val) %>% # make crosstab table
adorn_percentages("row") %>%
clean_names()
question x0 x1
Q1 0.50 0.50
Q2 0.25 0.75
Q3 0.50 0.50

Related

block diagonal covariance matrix by group of variable

If this is my dataset, 3 subjects, measured at two time points t1,t2 , at each time point each subject is measured twice
ID Time Observation Y
1 t1 1 3.1
1 t1 2 4.5
1 t2 1 4.2
1 t2 2 1.7
2 t1 1 2.3
2 t1 2 3.1
2 t2 1 1.5
2 t2 2 2.2
3 t1 1 4.1
3 t1 2 4.9
3 t2 1 3.5
3 t2 2 3.2
What is the block diagonal covariance matrix for this dataset ? I am assuming the matrix would have three blocks, each block representing 2x2 variance-covariance matrix of subject i with t1 and t2. Is this correct ?

The code below computes the covariance matrices for each ID, then creates a block matrix out of them.
make_block_matrix <- function(x, y, fill = NA) {
u <- matrix(fill, nrow = dim(x)[1], ncol = dim(y)[2])
v <- matrix(fill, nrow = dim(y)[1], ncol = dim(x)[2])
u <- cbind(x, u)
v <- cbind(v, y)
rbind(u, v)
}
# compute covariance matrices for each ID
sp <- split(df1, df1$ID)
cov_list <- lapply(sp, \(x) {
y <- reshape(x[-1], direction = "wide", idvar = "Time", timevar = "Observation")
cov(y[-1])
})
# no longer needed, tidy up
rm(sp)
# fill with NA's, simple call
# res <- Reduce(make_block_matrix, cov_list)
# fill with zeros
res <- Reduce(\(x, y) make_block_matrix(x, y, fill = 0), cov_list)
# finally, the results' row and column names
nms <- paste(sapply(cov_list, rownames), names(cov_list), sep = ".")
dimnames(res) <- list(nms, nms)
res
#> Y.1.1 Y.2.2 Y.1.3 Y.2.1 Y.1.2 Y.2.3
#> Y.1.1 0.605 -1.54 0.00 0.000 0.00 0.000
#> Y.2.2 -1.540 3.92 0.00 0.000 0.00 0.000
#> Y.1.3 0.000 0.00 0.32 0.360 0.00 0.000
#> Y.2.1 0.000 0.00 0.36 0.405 0.00 0.000
#> Y.1.2 0.000 0.00 0.00 0.000 0.18 0.510
#> Y.2.3 0.000 0.00 0.00 0.000 0.51 1.445
Created on 2023-02-20 with reprex v2.0.2
Data
df1 <- "ID Time Observation Y
1 t1 1 3.1
1 t1 2 4.5
1 t2 1 4.2
1 t2 2 1.7
2 t1 1 2.3
2 t1 2 3.1
2 t2 1 1.5
2 t2 2 2.2
3 t1 1 4.1
3 t1 2 4.9
3 t2 1 3.5
3 t2 2 3.2"
df1 <- read.table(text = df1, header = TRUE)
Created on 2023-02-20 with reprex v2.0.2

Automated fill in columns in r

I have a dataframe (shown below) where there are some asterisks in the "sig" column.
I want to fill in asterisks in the empty cells in the sig column everywhere above the furthest down row where there is an asterisk, which in this case would be everywhere from row "H" up to get something like this:
I'm thinking some sort of a for loop where it identifies the furthest down row where there is an asterisk and then fills in asterisks in empty cells above might be the way to go, but I'm not sure how to code this.
For debugging purposes, I make the data frame in R with
df<- data.frame("variable"= c("a","b","c","d","e","f","g","h","i","j","k"),
"value" = c(0.04,0.03,0.04,0.02,0.03,0.02,0.02,0.01,0.04,0.1,0.02),
"sig" = c("*","*","*","","*","*","","*","","",""))
Any help would be greatly appreciated - thanks!

Another way:
df[1:max(which(df$sig == "*")), "sig"] = "*"
Gives:
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04
10 j 0.10
11 k 0.02

We could use replace based on finding the index of the last element having *
library(dplyr)
df <- df %>%
mutate(sig = replace(sig, seq(tail(which(sig == "*"), 1)), "*"))
-output
df
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04
10 j 0.10
11 k 0.02

Another solution would be using fill, but you need to change "" to NA
Libraries
library(tidyverse)
Data
df <-
data.frame("variable"= c("a","b","c","d","e","f","g","h","i","j","k"),
"value" = c(0.04,0.03,0.04,0.02,0.03,0.02,0.02,0.01,0.04,0.1,0.02),
"sig" = c("*","*","*","","*","*","","*","","",""))
Code
df %>%
mutate(sig = if_else(sig == "",NA_character_,sig)) %>%
fill(sig,.direction = "up")
Output
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04 <NA>
10 j 0.10 <NA>
11 k 0.02 <NA>

How to use an IF function to update columns in a data frame?

predict <- read.table(header=TRUE, text="
0 1
0.44 0.55
0.76 0.24
0.71 0.29
0.75 0.24
0.25 0.75
")
I have attached a sample data frame with 2 columns titled '0' & '1'. I want to use an IF function so that if the value in the 0 column is bigger than 0.7 the cell updates to have a 0 value in it. Also if the value in the '1' column is bigger than 0.7 the cell updates to have a 1 value in it. Finally if neither the '0' or '1' values are bigger than 0.7 I would like the cells to return as -99. I have attached an example of what my sample would look like after this IF function was applied.
predict <- read.table(header=TRUE, text="
0 1
-99 -99
0 0.24
0 0.29
0 0.24
0.25 1
")
The code I have attempted is;
if(predict[,1] > 0.7 ){predict[,1] == '0' }
if(predict[,1] > 0.7 ){predict[,2] == '1' }
If you could advise me on the best way to update this IF function that would be really appreciated.

Update
Based on the intervention of AniGoyal (Many thanks for this!!!)
I updated the answer to fulfill the exact desired output of the OP:
I combined the two answers in one code to get the desired output:
Code:
predict %>%
as_tibble %>%
mutate(a = case_when(X0 > 0.7 ~ 0,
TRUE ~ ifelse(X0 < 0.7 & X1 < 0.7, -99, X0)),
b = case_when(X1 > 0.7 ~ 1,
TRUE ~ ifelse(X1 < 0.7 & X0 < 0.7, -99, X1))
) %>%
select(X0 = a, X1=b)
Output:
X0 X1
<dbl> <dbl>
1 -99 -99
2 0 0.24
3 0 0.290
4 0 0.24
5 0.25 1
We could use case_when from the dplyr package. Mutate changes columns X0 and X1 depending on den case_when condition.
library(dplyr)
predict %>%
mutate(X0 = case_when(X0 > 0.7 ~ 0,
TRUE ~ -99),
X1 = case_when(X1 > 0.7 ~ 1,
TRUE ~ -99)
)
Output:
X0 X1
1 -99 -99
2 0 -99
3 0 -99
4 0 -99
5 -99 1
ifelse
Or we could use ifelse https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/ifelse
predict$X0 <- ifelse(predict$X0 > 0.7, 0, -99)
predict$X1 <- ifelse(predict$X1 > 0.7, 1, -99)
predict

Note - numeric names for columns are less desirable ("0" and "1"). Here they are renamed to "X0" and "X1".
One approach with base R is to subset your data for your 3 circumstances, first checking to see if neither are greater than .7 (and set both to -99), then checking the 0 column (set to 0), then checking the 1 column (set to 1):
predict[!(predict$X0 > .7 | predict$X1 > .7), c("X0", "X1")] <- -99
predict[predict$X0 > .7, "X0"] <- 0
predict[predict$X1 > .7, "X1"] <- 1
predict
Output
X0 X1
1 -99.00 -99.00
2 0.00 0.24
3 0.00 0.29
4 0.00 0.24
5 0.25 1.00

This is just another way using dplyr:
library(dplyr)
predict %>%
as_tibble() %>%
mutate(X0 = ifelse(X0 > 0.7, 0, X0),
X1 = ifelse(X1 > 0.7, 1, X1)) %>%
mutate(across(X0:X1, ~ ifelse((X0 < 0.7 & X0 != 0) & (X1 < 0.7 & X0 != 0), -99, .)))
X0 X1
<dbl> <dbl>
1 -99 -99
2 0 0.24
3 0 0.290
4 0 0.24
5 0.25 1

There are two errors in the code you are trying -
baseR's if else doesn't work iteratively. So If you have to use that for a complete vector where each element is to be checked iteratively, you'll have to use it inside a loop
usage of == for assignment. == is used for comparision/conditionals and not for assignment. Use = for assignment.
If you still want to do it baseR's if else style
for(i in 1:nrow(predict)){
if(predict[i, 1] > 0.7){
predict[i, 1] = 0
}
if(predict[i,2] > 0.7){
predict[i, 2] = 1
}
if(predict[i, 1] < 0.7 & predict[i, 2] < 0.7 & predict[i, 1] >0){
predict[i, 1] = -99
predict[i, 2] = -99
}
}
> predict
X0 X1
1 -99.00 -99.00
2 0.00 0.24
3 0.00 0.29
4 0.00 0.24
5 0.25 1.00
You may also consider use of replace like this
predict[, 1] <- replace(predict[,1], predict[,1] > 0.7, 0)
predict[, 2] <- replace(predict[,2], predict[,2] > 0.7, 1)
predict[, 1] <- replace(predict[, 1], predict[, 2] < 0.7 & predict[, 1] < 0.7 & predict[, 1] > 0, -99)
predict[, 2] <- replace(predict[, 2], predict[, 2] < 0.7 & predict[, 1] < 0.7 & predict[, 1] > 0, -99)
> predict
X0 X1
1 -99.00 0.55
2 0.00 0.24
3 0.00 0.29
4 0.00 0.24
5 0.25 1.00

Storing output for nested loop

I'm trying to do a nested loop for logistic regression.
I'm trying to run a loop for the discretization value and for each class.
Here's the code so far... I'm unable to get an output for each different iteration.
class <- c(1,2,3,4,5)
discretization_value <- seq(0.25, 0.75, by =0.05)
output<-data.frame(matrix(nrow=500, ncol=5))
names(output)=c("discretization_value", "class", "var1_coef", "var2_coef", "var3_coef")
for (i in discretization_value){
for (j in class) {
df$discretization_value <- ifelse(df$score >= i,1,0)
result <- (glm(discretization_value ~
var1 + var2 + var3,
data = df[df$class == j,], family= "binomial"))
output[i,1] <- i
output[i,2] <- j
output[i,3] <- coef(summary(result))[c("var1"),c("Estimate")]
output[i,4] <- coef(summary(result))[c("var2"),c("Estimate")]
output[i,5] <- coef(summary(result))[c("var3"),c("Estimate")]
}
}
a snippet of my df
class score var1 var2 var3
1 0.3 0.18 0.33 356
1 0.5 0.22 0.55 33
1 0.6 0.77 0.44 35
2 0.9 0.99 0.55 2
3 0 0 0 0
3 0.4 0.5 0.11 5
4 0 0.6 0 7
4 0 0.6 0 9
4 0.6 0.2 0.1 6

Could this be the problem?
data = df[df$class == j,], family= "binomial"))
I would try to remove the comma before the squared parenthesis.

replacing values in dataframe with another dataframe r

I have a dataframe of values that represent fold changes as such:
> df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
A B C
1 1.74 1.50 1.10
2 -1.30 0.90 3.01
3 3.10 0.71 1.40
And a dataframe of pvalues as such that matches rows and columns identically:
> df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
A B C
1 0.02 NA 0.01
2 0.01 0.01 0.01
3 0.80 0.06 0.03
What I want is to modify the values in df1 so that only retain the values that had a correponding pvalue in df2 < .05, and replace with NA otherwise. Note there are also NA in df2.
> desired <- data.frame(A=c(1.74,-1.3,NA), B=c(NA,.9,NA), C=c(1.1,3.01,1.4))
> desired
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
I first tried to use vector syntax on these dataframes and that didn't work. Then I tried a for loop by columns and that also failed.
I don't think i understand how to index each i,j position and then replace df1 values with df2 values based on a logical.
Or if there is a better way in R.

You can try this:
df1[!df2 < 0.05 | is.na(df2)] <- NA
Out:
> df1
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

ifelse and as.matrix seem to do the trick.
df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
x1 <- as.matrix(df1)
x2 <- as.matrix(df2)
as.data.frame( ifelse( x2 >= 0.05 | is.na(x2), NA, x1) )
Result
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Summarizing count data as proportion in a data.frame - r

We can try apply with margin =2 and divide the counts of each value with the total length in the column t(apply(dummy, 2, function(x) table(x)/length(x))) # 0 1 #Q1 0.50 0.50 #Q2 0.25 0.75 #Q3 0.50 0.50

Another way to do this would be using do.call & lapply do.call(cbind,lapply(dummy,function(x) data.frame(table(x))[,2])) # Q1 Q2 Q3 [1,] 2 1 2 [2,] 2 3 2

A tidyverse option: library(tidyr) library(janitor) dummy %>% gather(question, val) %>% # reshape to long form tabyl(question, val) %>% # make crosstab table adorn_percentages("row") %>% clean_names() question x0 x1 Q1 0.50 0.50 Q2 0.25 0.75 Q3 0.50 0.50

Related

block diagonal covariance matrix by group of variable

Automated fill in columns in r

How to use an IF function to update columns in a data frame?

Storing output for nested loop

replacing values in dataframe with another dataframe r

Categories

Resources