Frequency across multiple variables - r

My data have more than 50 variables and the same values distributed across them. I need to table each value (total values more than 1000 ) with its frequency across the 40 variables.
I need to create a table for each values across all variables for example F447 5 A257 4 G1229 5 C245 2

You can use table to get the frequency across the variables, then we can convert to a dataframe and make the variable names a column, and finally arrange by frequency. If you want to put in descending order, then we could do arrange(desc(frequency)).
library(dplyr)
data.frame(frequency = unclass(table(unlist(df)))) %>%
tibble::rownames_to_column("variable") %>%
arrange(frequency)
Output
variable frequency
1 C245 1
2 A257 2
3 F447 3
4 G1229 3
Or with base R, you could do:
results <- data.frame(frequency = unclass(table(unlist(df))))
results$variable <- row.names(results)
row.names(results) <- NULL
results <- results[order(results$frequency),c(2,1)]
Another option if you would like additional summary and visualization of the frequency, then epiDisplay is a good option.
library(epiDisplay)
tab1(unlist(df), sort.group = "decreasing", cum.percent = TRUE)
Output
Frequency Percent Cum. percent
G1229 3 33.3 33.3
F447 3 33.3 66.7
A257 2 22.2 88.9
C245 1 11.1 100.0
Total 9 100.0 100.0
Data
df <- structure(list(Var1 = c("F447", "A257", "G1229"), Var2 = c("G1229",
"F447", "A257"), Var3 = c("C245", "F447", "G1229")), class = "data.frame", row.names = c(NA,
-3L))

Related

In R, excel COUNTA equivalent, or another way to calculate averages in "char" columns?

I have a spreadsheet where some of the values are entered as "N/A", and some of the cells are blank.
joe
pete
mark
Average ⬇️ (the average per row)
90
85
N/A
87.5
N/A
92
92
88
90
89
3
2
2
3
<-- This row Counts all non-blank values in each column
I want to import these into R to do two things:
Get an average of these values for each row across multiple columns and
get a count of the values per column (see below for example)
The problem is: I want to be able to count all the non-blank cells (including those with "N/A" values, as they are actually important part of the data and are different from blanks
What I tried:
Replaced the "N/A" values in Excel before I imported into R by changing the "n/a"s to 0's, so I can import the columns as numbers, but the problem is, then my averages are messed up. If I add a 0 to the first row, for example, then my average is 58.33 (90+85+0)/3 = 58.33
That is not what I want. I want an average of only those that are not "N/A".
The other issue I have, is that if I leave those as N/A, then I can get a count, but my columns are not numeric anymore and I can't perform an average calculaiton.
I know I can do this easily in excel with =COUNTA and =AVERAGE, but I would prefer to do as much wrangling as possible in R.
Any suggesitons??
Thanks!!
try something like this. The na.rm=TRUE should be what you want:
example_data = c(90, 85, NA)
MEAN = mean(example_data, na.rm=TRUE)
base R
dat$avg <- mapply(function(...) {
dots <- unlist(list(...))
mean(suppressWarnings(as.numeric(dots[nzchar(dots)])),
na.rm = TRUE)
}, dat$joe, dat$pete, dat$mark)
dat
# joe pete mark avg
# 1 90 85 <NA> 87.5
# 2 NA 92 92.0
# 3 88 90 89.0
as.data.frame(lapply(dat, function(z) sum(nzchar(z))))
# joe pete mark avg
# 1 3 2 2 3
dat <- rbind(dat, as.data.frame(lapply(dat, function(z) sum(nzchar(z)))))
dat
# joe pete mark avg
# 1 90 85 <NA> 87.5
# 2 NA 92 92.0
# 3 88 90 89.0
# 4 3 2 2 3.0
Data
dat <- structure(list(joe = c(90, NA, 88), pete = c("85", "", "90"), mark = c(NA, "92", "")), class = "data.frame", row.names = c(NA, -3L))

Writing a function to summarize the results of dunn.test::dunn.test

In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)

Create dummy variable with survey package

I want to transform a variable into a dummy using the survey package.
I have a complex sample design defined by:
library(survey)
prestratified_design <- svydesign(
id = ~ PSU ,
strata = ~ STRAT,
data = data,
weights = ~ weight ,
nest = TRUE)
The dataset has a variable for education with 8 different categories:
# A tibble: 8 x 3
education n prop
<int> <int> <dbl>
1 1 2919 20.8
2 2 5551 39.5
3 3 447 3.18
4 4 484 3.45
5 5 3719 26.5
6 6 91 0.65
7 9 790 5.63
8 10 39 0.28
I want to create a dummy variable for categories 5 & 10 == 1 and others == 0.
I know that I have to use the update function, but I don't know how to use if in the survey package.
I have tried:
prestratified_design <- update(
prestratified_design,
dummy_educ = as.numeric (education == 5 & education == 10)
but it obviously didn't work.
thank you!
You can create dummy variables in R via ifelse() if the number of categories is two.
df$dummy_educ = with(df, ifelse(education == 5 | education == 10, 1, 0))
If the categories are more, you can use dplyr::case_when(), and if you are creating dummies from factor variable model.matrix() is fast and the best.
In order any new variable takes in count the complex design, you don't need to update your data set (in your example data), but you have to update your survey design adding the new variable. You must use the survey::update() function.
Following your example, try with the code below:
prestratified_design <- update(prestratified_design,
dummy_educ = as.integer(education == 5 | education == 10))
Good luck with that!.

Creating summary statistic table from subsets of data in R

I have a table that looks something like this:
Time Carbon OD
0 Sucrose 1.13
0 Citric acid 1.54
24 Histidine 2.1
24 Glutamine 1.7
48 Maleic acid 2.1
48 Furm acid 3.1
72 Tryptophan 2.3
72 Serine 1.2
72 etc etc
It has four time points, and 9 different carbons that can be split into three groups (organic acids, sugars, amino acids).
EDIT - if its helpful, the OD was measured for each carbon at each time point 8 times. Previously I used this code to create summary statistics for the entire thing:
summary <- aggregate(dataset2$OD,
by = list(Time = dataset2$Time, Carbon = dataset2$Carbon),
FUN = function(x) c(mean = mean(x), sd = sd(x),
n = length(x)))
summary <- do.call(data.frame, dataset2)
summary$se <- dataset2$x.sd / sqrt(dataset2$x.n)
But now I would like to generate the same summary statistics for the means of each of the three groups, if possible, so I would get something like this:
Time Group OD SD n SE
0 Group 1
24 Group 1
48 Group 1
72 Group 1
0 Group 2
I'm not quite sure how to specify this in my code?
Using dplyr:
dataset2 %>%
group_by(Time, Group)
summarise(OD = mean(OD),
SD = sd(OD),
n = n())

data frame column total in R

I have data like this (derived using the table() function):
dat <- read.table(text = "responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6", header = TRUE)
dat
responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6
All I want are row totals, so to create a new row at the bottom that says Total and then in column freq it will show 35 and in percent it will show 100. I am unable to find a solution. colSums doesn't work because of the first column which is a string.
One option is converting to 'matrix' and using addmargins to get the column sum as a separate row at the bottom. But, this will be a matrix.
m1 <- as.matrix(df1[-1])
rownames(m1) <- df1[,1]
res <- addmargins(m1, 1)
res
# freq percent
#A 9 25.7
#B 13 37.1
#C 10 28.6
#D 3 8.6
#Sum 35 100.0
If you want to convert to data.frame
data.frame(responses=rownames(res), res)
Another option would be getting the sum with colSums for the numeric columns (df1[-1]) (I think here is where the OP got into trouble, ie. applying the colSums on the entire dataset instead of subsetting), create a new data.frame with the responses column and rbind with the original dataset.
rbind(df1, data.frame(responses='Total', as.list(colSums(df1[-1]))))
# responses freq percent
#1 A 9 25.7
#2 B 13 37.1
#3 C 10 28.6
#4 D 3 8.6
#5 Total 35 100.0
data
df1 <- structure(list(responses = c("A", "B", "C", "D"), freq = c(9L,
13L, 10L, 3L), percent = c(25.7, 37.1, 28.6, 8.6)),
.Names = c("responses", "freq", "percent"), class = "data.frame",
row.names = c(NA, -4L))
This might be relevant, using SciencesPo package, see this example:
library(SciencesPo)
tab(mtcars,gear,cyl)
#output
=================================
cyl
--------------------
gear 4 6 8 Total
---------------------------------
3 1 2 12 15
6.7% 13% 80% 100%
4 8 4 0 12
66.7% 33% 0% 100%
5 2 1 2 5
40.0% 20% 40% 100%
---------------------------------
Total 11 7 14 32
34.4% 22% 44% 100%
=================================
Chi-Square Test for Independence
Number of cases in table: 32
Number of factors: 2
Test for independence of all factors:
Chisq = 18.036, df = 4, p-value = 0.001214
Chi-squared approximation may be incorrect
X^2 df P(> X^2)
Likelihood Ratio 23.260 4 0.00011233
Pearson 18.036 4 0.00121407
Phi-Coefficient : NA
Contingency Coeff.: 0.6
Cramer's V : 0.531
#akrun I posted it but you already did the same. Correct me if I'm wrong, I think we can just need this without creating a new data frame or using as.list.
rbind(df1, c("Total", colSums(df1[-1])))
Output:
responses freq percent
1 A 9 25.7
2 B 13 37.1
3 C 10 28.6
4 D 3 8.6
5 Total 35 100
sqldf Classes of the data frame are preserved.
library(sqldf)
sqldf("SELECT * FROM df1
UNION
SELECT 'Total', SUM(freq) AS freq, SUM(percent) AS percent FROM df1")
Or, alternatively you can use margin.table and rbind function within R-base. Two lines and voila...
PS: The lines here are longer as I am recreating the data, but you know what I mean :-)
Data
df1 <- matrix(c(9,25.7,13,37.1,10,28.6,3,8.6),ncol=2,byrow=TRUE)
colnames(df1) <- c("freq","percent")
rownames(df1) <- c("A","B","C","D")
Creating Total Calculation
Total <- margin.table(df1,2)
Combining Total Calculation to Original Data
df2 <- rbind(df,Total)
df2
Inelegant but it gets the job done, please provide reproducible data frames so we don't have to build them first:
data = data.frame(letters[1:4], c(9,13,10,3), c(25.7,37.1, 28.6, 8.6))
colnames(data) = c("X","Y","Z")
data = rbind(data[,1:3], matrix(c("Sum",lapply(data[,2:3], sum)), nrow = 1)[,1:3])
library(janitor)
dat %>%
adorn_totals("row")
responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6
Total 35 100.0

Resources