Plot of frequencies with counts in R

Plot of frequencies with counts in R - r

The frequency plot that I'm trying to do is
Barplot with counts above each bar
Relative frequency in left side
Cumulative frequency in right side
The dataset is
dput(x2)
c(1L, 5L, 3L, 3L, 5L, 3L, 4L, 1L, 2L, 2L, 7L, 3L, 2L, 2L, 3L,
3L, 2L, 1L, 5L, 4L, 4L, 3L, 5L, 2L, 6L, 2L, 1L, 2L, 5L, 5L, 5L,
3L, 6L, 4L, 5L, 4L, 6L, 7L)
The distribution of frequencies are
table(x2)
x2
1 2 3 4 5 6 7
4 8 8 5 8 3 2
The relative frequencies are
prop.table(table(x2))
x2
1 2 3 4 5 6 7
0.10526316 0.21052632 0.21052632 0.13157895 0.21052632 0.07894737 0.05263158
EDIT: Like in the image below, but with cumulative frequency in the right side, relative frequency in the left and the bars with counts

library(tidyverse)
library(broom)
table(x2) %>%
tidy() %>%
mutate(rel_freq = Freq/sum(Freq), sum = sum(Freq)) %>%
ggplot(aes(reorder(x2, Freq), rel_freq)) +
geom_col() +
geom_text(aes(label = Freq), vjust = -.5) +
scale_y_continuous(sec.axis = sec_axis(~.*length(x2)))

Related

`non-finite value supplied` in ggstatsplot

I am working with ggstatsplot to get visual representations of my statistical analyses.
I have numerous datasets, all very similar in make-up. Some work just fine, while others don't. data1 is a working example, and data2 doesn't work.
data1 <- structure(list(
treatment = structure(c(1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L),
.Label = c("negative_ctrl", "positive_ctrl", "treatmentA", "treatmentB", "treatmentC", "treatmentD"), class = "factor"),
value = c(1.74501, 2.04001, 1.89501, 1.84001,
1.89501, 9.75001, 8.50001, 8.80001, 11.50001, 10.25001, 7.90001,
9.25001, 11.45001, 7.75001, 7.75001, 7.55001, 8.70001, 8.20001,
6.95001, 6.60001, 7.40001, 7.15001, 8.25001, 9.20001, 8.95001,
6.45001, 6.05001, 5.40001, 7.95001, 6.80001, 4.65001, 6.40001,
6.40001, 6.70001, 5.40001, 3.20001, 2.70001, 4.30001, 4.10001,
3.60001, 4.00001, 3.00001, 4.70001, 3.10001, 3.50001, 6.45001,
5.45001, 4.90001, 7.25001, 4.55001, 4.70001, 6.25001, 5.65001,
6.00001, 5.10001)),
row.names = c(NA, -55L), class = c("tbl_df", "tbl", "data.frame"))
data2 <- structure(list(
treatment = structure(c(1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L),
.Label = c("negative_ctrl", "positive_ctrl", "treatmentA", "treatmentB", "treatmentC", "treatmentD"), class = "factor"),
value = c(1.00001, 1.00001, 1.00001, 1.00001, 1.00001, 6.77501,
5.68751, 5.99201, 8.24501, 7.01251, 4.79501, 5.99126, 8.26276,
5.35376, 5.38751, 4.60251, 5.38901, 4.85201, 4.44401, 5.20501,
6.20701, 5.77001, 4.05201, 3.65126, 3.02401, 4.68351, 3.90001,
2.56951, 3.70001, 3.61901, 3.96401, 2.93601, 1.53901, 1.40801,
2.05601, 2.08501, 1.89701, 1.79501, 1.50001, 2.09151, 1.53551,
1.57501, 3.88851, 3.09151, 2.75501, 4.40626, 2.42001, 2.60951,
3.83501, 3.37151, 3.70001, 2.92701)),
row.names = c(NA, -52L), class = c("tbl_df", "tbl", "data.frame"))
I call the most basic analysis for both datasets:
library(Rmpfr)
library(ggstatsplot)
ggstatsplot::ggbetweenstats(
data = data1,
x = treatment,
y = value,
messages = FALSE )
ggstatsplot::ggbetweenstats(
data = data2,
x = treatment,
y = value,
messages = FALSE )
For data1 I get this:
for data2 I get:
> Error in stats::optim(par = 1.1 * rep(lambda, 2), fn = function(x) { : non-finite value supplied by optim
At first I thought the issue might be a few zeros that I passed on in the negative control, but I first upped them by a tiny amount and then by 1 to make sure the range of the values is not an issue. The only discrepancy I can see is that I only have 7 instead of 10 measurements for treatmentA (level 3) in data2 but 10 in data1 (had to remove a few NAs due to sample failure). However, in both cases the negative control (level 1) only has 5 values, and I don't think that in this type of analysis there is an issue with different sample sizes between the groups.

It's a good idea to try basic plots out in these cases eg isolate the boxplots:
So comparing the two datasets:
boxplot(value ~ treatment, data=data1)
boxplot(value ~ treatment, data=data2)
data2 has a treatment with no variability ("negative_ctrl"), 0 SD. I'm guessing this function is doing some tests that require variation. You will need to read the documentation for the function to see if this is brought up but you can get views either by removing these treatments, or forcing a very small amount of variation eg
# run without negative_ctrl
ggstatsplot::ggbetweenstats(
data = data2[data2$treatment != "negative_ctrl",],
x = treatment,
y = value,
messages = FALSE )
# add some tiny fake variation to force it through (this is a hack)
data3 <- data2
data3[data3$treatment=="negative_ctrl",][1,][["value"]] <- 1.0001
ggstatsplot::ggbetweenstats(
data = data3,
x = treatment,
y = value,
messages = FALSE )

Creating a table for frequency analysis results in R

I need to create a table of a certain type and based on a certain template.
This is my data:
df = structure(list(group = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 1L),
degree = structure(c(1L, 1L, 1L, 1L, 1L, 3L, 2L, 1L, 1L, 1L),
.Label = c("Mild severity", "Moderate severity", "Severe severity"),
class = "factor")),
.Names = c("group", "degree"),
class = "data.frame",
row.names = c(NA, -10L))
I conducted a crosstab:
table(df$degree,df$group)
1 2 3
Mild severity 3 3 2
Moderate severity 0 0 1
Severe severity 0 0 1
but I need the results to be formatted in this template:
[![enter image description here][1]][1]
How can I create a table with this structure?
very important edit
full dput() (42 obs.)
df = structure(list(Study.Subject.ID = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 5L, 7L, 8L, 9L, 1L, 2L, 3L, 5L, 8L, 2L, 3L, 5L, 8L, 2L, 3L, 5L, 8L, 2L, 3L, 5L, 8L, 3L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L),
.Label = c("01-06-104", "01-09-108", "01-15-201", "01-16-202", "01-18-204", "01-27-301", "01-28-302", "01-33-305", "01-42-310"),
class = "factor"),
group = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
Degree.of.severity = structure(c(2L, 2L, 2L, 2L, 2L, 4L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
.Label = c("Life-threatening or disabling", "Mild severity", "Moderate severity", "Severe severity"),
class = "factor")),
.Names = c("Study.Subject.ID", "group", "Degree.of.severity"),
class = "data.frame",
row.names = c(NA, -42L))
There is a concept of the subject, and there is concept a number of side effects.
One person can have several side effects.
The side effect can be
severity
Moderate
Severe
I have to count how many people separated by group have this or that side effect,
and how many side effects are in this group?
I.E. In the first group we have 9 observations, but there are two unique people.
01-06-104
01-09-108
but total count Mild severity is 7.
So only two people have side effects of Mild severity (X) and total count Mild severity is 7 (Y).
Total count of patients is 42, so to calculate percentage we must divide by 42 (2/42)=4,7
That's why I expected the output to be:
degree group1 group2 group3
X (%)Y X (%)Y X (%) Y
Mild severity 2 (4,7%)7 3 (7,1%)13 3(7,1%) 12
Moderato 1 (2,3%)1 0(0,0%%)0 2(4,7%) 6
Severe severity 0(0,0%%)0 0(0,0%%)0 1(2,3) 1

I have to admit that I'm not clear on what you're trying to do. Unfortunately your expected output image does not help.
I assume you are asking how to calculate a 2-way contingency table and show both counts and percentages (of total). Here is a tidyverse possibility
library(tidyverse)
df %>%
group_by(group, degree) %>%
summarise(n = n(), perc = n() / nrow(.)) %>%
mutate(entry = sprintf("%i (%3.2f%%)", n, perc * 100)) %>%
select(-n, -perc) %>%
spread(group, entry, fill = "0 (0.0%%)")
## A tibble: 3 x 4
# degree `1` `2` `3`
# <fct> <chr> <chr> <chr>
#1 Mild severity 3 (30.00%) 3 (30.00%) 2 (20.00%)
#2 Moderate severity 0 (0.0%%) 0 (0.0%%) 1 (10.00%)
#3 Severe severity 0 (0.0%%) 0 (0.0%%) 1 (10.00%)

you want the fractions, together with the total numbers? Try:
n=table(df$degree,df$group)
df=as.data.frame(cbind(n/colSums(n)*100,n))

using base R:
a = transform(data.frame(table(df)),Freq = sprintf("%d (%3.2f%%)",Freq,prop.table(Freq)*100))
data.frame(t(unstack(a,Freq~degree)))
X1 X2 X3
Mild.severity 3 (30.00%) 3 (30.00%) 2 (20.00%)
Moderate.severity 0 (0.00%) 0 (0.00%) 1 (10.00%)
Severe.severity 0 (0.00%) 0 (0.00%) 1 (10.00%)

Stacked bar graph with fill ggplot2

I've read through the ggplot2 docs website and other question but I couldn't find a solution. I'm trying to visualize some data for varying age groups. I have sort of managed to do it but it does not look like I would intend it to.
Here is the code for my plot
p <- ggplot(suggestion, aes(interaction(Age,variable), value, color = Age, fill = factor(variable), group = Age))
p + geom_bar(stat = "identity")+
facet_grid(.~Age)![The facetting separates the age variables][1]
My ultimate goal is to created a stack bar graph, which is why I used the fill, but it does not put the TDX values in its corresponding Age group and Year. (Sometimes TDX values == DX values, but I want to visualize when they don't)
Here's the dput(suggestion)
structure(list(Age = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L,
7L), .Label = c("0-2", "3-9", "10-19", "20-39", "40-59", "60-64",
"65+", "UNSP", "(all)"), class = "factor"), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("Year.10.DX", "Year.11.DX",
"Year.12.DX", "Year.13.DX", "Year.10.TDX", "Year.11.TDX", "Year.12.TDX",
"Year.13.TDX"), class = "factor"), value = c(26.8648932910636,
30.487741796656, 31.9938838749782, 62.8189679326958, 72.8480838120064,
69.3044125928752, 36.9789457527416, 21.808001825378, 24.1073451428435,
40.3305134762935, 70.4486116545885, 68.8342676191755, 63.9227718107745,
34.6086468618636, 8.84033719571875, 13.2807072303835, 28.4781516422802,
55.139497471546, 59.7230544500003, 67.9448927372699, 37.7293286937066,
6.9507024051526, 17.4393054963572, 33.1485743479821, 61.198647580693,
58.6845873573852, 48.0073013177248, 28.4455801248562, 26.8648932910636,
19.8044453272475, 23.0189084635948, 53.7037832071889, 60.6516550126422,
58.1573725886767, 27.0791868812255, 21.808001825378, 19.8146296425633,
35.0587750051557, 62.3308555053346, 59.3299998610862, 56.5341245769817,
27.7229319271878, 8.84033719571875, 13.2807072303835, 22.4081606349585,
48.0252683906252, 52.7560684009579, 65.2890977685045, 32.4142337849399,
6.9507024051526, 15.2833655677215, 24.5268503180754, 52.536784326675,
51.4100599515986, 40.9609231655724, 18.1306673637441)), row.names = c(NA,
-56L), .Names = c("Age", "variable", "value"), class = "data.frame")

It's unclear what you need but perhaps this.
ggplot(a,aes(x=variable,y=value,fill=Age)) + geom_bar(stat='identity')
+facet_wrap(~Age)
If you want to visualize separately the TDX and the DX entries, we'll need to change the dataframe a bit.
> head(a)
Age variable value
1 0-2 Year.10.DX 26.86489
2 3-9 Year.10.DX 30.48774
3 10-19 Year.10.DX 31.99388
4 20-39 Year.10.DX 62.81897
5 40-59 Year.10.DX 72.84808
6 60-64 Year.10.DX 69.30441
The column of interest variable is a combination of year and of TDX/DX value. We'll use the tidyr package to separate this into two columns.
library(tidyr)
library(dplyr)
tidy_a<- a %>% separate(variable, into = c( 'nothing',"year",'label'), sep = "\\.")
This actually splits the levels of column variable into three components, since we split on . and the character . appears twice in each entry.
> head(tidy_a)
Age nothing year label value
1 0-2 Year 10 DX 26.86489
2 3-9 Year 10 DX 30.48774
3 10-19 Year 10 DX 31.99388
4 20-39 Year 10 DX 62.81897
5 40-59 Year 10 DX 72.84808
6 60-64 Year 10 DX 69.30441
So the column nothing is rather useless, just a necessary result of using separate and separating on .. Now this will allow us to visualize TDX/DX separately.
ggplot(tidy_a,aes(x=year,y=value,fill=label)) + geom_bar(stat='identity') + facet_wrap(~Age)

For() loop to ID dates that are between others and calculate a mean value

This is a re-post of "R: For() loop checking if date is between two dates in separate object", that has been changed to incorporate a mock/test minimal after the suggestions of Henrik and Metrics. Thanks to them.
I have two large datasets, both contain columns of date/time fields. My first dataset has a single date, the second has two dates. In short I am trying to find all dates from the first data set that are between the other two dates of the second and then find an average value. In order to provide clarity, I have created a mock minimal data set using values rather than dates.
The head() of my first mock data set is below – as well as the dput() output. The data is specific to an individual noted by the IndID column.
IndID MockDate RandNumber
1 1 5 1.862084
2 1 3 1.103154
3 1 5 1.373760
4 1 1 1.497397
5 1 1 1.319488
6 1 3 2.120354
actData <- structure(list(IndID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), MockDate = c(5L, 3L, 5L, 1L, 1L, 3L, 4L,
2L, 2L, 5L, 2L, 1L, 5L, 3L, 5L, 3L, 5L, 3L, 5L, 1L, 5L, 3L, 5L,
5L, 2L, 3L, 1L, 4L, 3L, 3L), RandNumber = c(1.862083679, 1.103154127,
1.37376001, 1.497397482, 1.319487885, 2.120353884, 1.895660195,
1.150411874, 2.61036961, 1.99354158, 1.547706758, 1.941501873,
1.739226419, 2.455590044, 2.907382515, 2.110502618, 2.076187012,
2.507527308, 2.167657681, 1.662405916, 2.428807116, 2.04699653,
1.937335768, 1.456518889, 1.948952907, 2.104325112, 2.311519732,
2.092650229, 2.109051215, 2.089144475)), .Names = c("IndID",
"MockDate", "RandNumber"), class = "data.frame", row.names = c(NA,
-30L))
The head() of my 2nd mock data set is below – as well as the dput() output.
IndID StartTime EndTime
1 1 4 5
2 1 7 11
3 1 6 9
4 1 7 9
5 1 6 10
6 1 2 12
clstrData <- structure(list(IndID.1 = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), StartTime = c(4L, 7L,
6L, 7L, 6L, 2L, 6L, 4L, 3L, 5L, 2L, 5L, 7L, 3L, 4L, 3L, 2L, 5L,
5L), EndTime = c(5L, 11L, 9L, 9L, 10L, 12L, 8L, 13L, 5L, 13L,
9L, 9L, 17L, 6L, 8L, 6L, 9L, 15L, 7L)), .Names = c("IndID",
"StartTime", "EndTime"), row.names = c(NA, 19L), class = "data.frame")
The second dataset has two number fields representing a start and end time. As above, these data are also specific to an individual noted by the IndD column.
I need to average the ‘RandNumber’ from dataset one for all the instances when ‘MockDate’ is between ‘StartTime’ and ‘EndTime’ of the second dataset for each unique IndID. Thus, ‘RandNumber’ values should only be averaged if 1) they are within the ‘StartTime’ and ‘EndTime’ and 2) the IndID for both rows are the same.
I started by creating a function to ID if MockDate is between StartTime and EndTime
is.between <- function(x, a, b) {
x > a & x < b
}
Testing that function works for a single value
is.between(actData[1,3], clstrData[,2], clstrData[,3])
But cannot figure out how to loop this for all rows, and then find the mean. My for() loop beginnings are below.
YesNo <- list()
for (i in 1:nrow(actData)) {
YesNo[[i]] <- is.between(actData[1,3], clstrData[,2], clstrData[,3])
}
YesNo[[3]]
This for() gives the same result for all row…
Hope to create...
clstrData$NEWcolum <- mean RandNum for each row.
Thanks, and as always any suggestions are greatly appreciated!

Assuming your machine can handle the data size, you can:
merge the two data frames on the ID, then
group accordingly (ie, by IndID, Start & End dates)
compute mean for those rows where mock date falls between the end dates
Here is some code using data.table
library(data.table)
DT.clstr <- data.table(clstrData, key="IndID")
DT.act <- data.table(actData, key="IndID")
# Adjust to `<=` if needed
ComputedDT <-
merge(DT.clstr, DT.act, allow.cartesian=TRUE)[
MockDate > StartTime & MockDate < EndTime
, list(Mean=mean(RandNumber))
, by=list(IndID, StartTime, EndTime)
]
Results
ComputedDT
IndID StartTime EndTime Mean
1: 1 2 12 1.671002
2: 2 4 13 2.176799
3: 2 2 9 2.244702
4: 3 3 6 1.978828
5: 3 4 8 1.940887
6: 3 2 9 2.033104

Thanks to Ricardo Saporta for earlier thoughts.
However, constructing a long conditional in my for() loop was the best option for me - although not as fast as data.table().
Using the data above, the code below is what I ended up constructing.
clstrData$meanAct = rep(NA, nrow(clstrData))
for (i in 1:nrow(clstrData)){
clstrData$meanAct[i] = mean(actData$RandNumber[actData$IndID==clstrData$IndID[i]
&is.between(actData$RandNumber, clstrData$StartTime[i], clstrData$EndTime[i])])
}
head(clstrData)
tail(clstrData)
Were there is no corresponding value between the Start and End times, NAN's are produced.

ddply summarise proportional count

I am having some trouble using the ddply function from the plyr package. I am trying to summarise the following data with counts and proportions within each group. Here's my data:
structure(list(X5employf = structure(c(1L, 3L, 1L, 1L, 1L, 3L,
1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 1L,
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L,
3L, 3L, 1L), .Label = c("increase", "decrease", "same"), class = "factor"),
X5employff = structure(c(2L, 6L, NA, 2L, 4L, 6L, 5L, 2L,
2L, 8L, 2L, 2L, 2L, 7L, 7L, 8L, 11L, 7L, 2L, 8L, 8L, 11L,
7L, 6L, 2L, 5L, 2L, 8L, 7L, 7L, 7L, 8L, 6L, 7L, 5L, 5L, 7L,
2L, 6L, 7L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 2L, 5L, 2L, 2L,
2L, 5L, 12L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 2L, 5L, 2L,
13L, 9L, 9L, 9L, 7L, 8L, 5L), .Label = c("", "1", "1 and 8",
"2", "3", "4", "5", "6", "6 and 7", "6 and 7 ", "7", "8",
"1 and 8"), class = "factor")), .Names = c("X5employf", "X5employff"
), row.names = c(NA, 73L), class = "data.frame")
And here's my call using ddply:
ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), prop=(n/sum(n))*100)
This gives me the counts of each instance of X5employff correctly, but but seems as though the proportion is being calculated across each row and not within each level of the factor X5employf as follows:
X5employf X5employff n prop
1 increase 1 26 100
2 increase 2 1 100
3 increase 3 15 100
4 increase 1 and 8 1 100
5 increase <NA> 1 100
6 decrease 4 1 100
7 decrease 5 5 100
8 decrease 6 2 100
9 decrease 7 1 100
10 decrease 8 1 100
11 same 4 4 100
12 same 5 6 100
13 same 6 5 100
14 same 6 and 7 3 100
15 same 7 1 100
When manually calculating the proportions within each group I get this:
X5employf X5employff n prop
1 increase 1 26 59.09
2 increase 2 1 2.27
3 increase 3 15 34.09
4 increase 1 and 8 1 2.27
5 increase <NA> 1 2.27
6 decrease 4 1 10.00
7 decrease 5 5 50.00
8 decrease 6 2 20.00
9 decrease 7 1 10.00
10 decrease 8 1 10.00
11 same 4 4 21.05
12 same 5 6 31.57
13 same 6 5 26.31
14 same 6 and 7 3 15.78
15 same 7 1 5.26
As you can see the sum of proportions in each level of factor X5employf equals 100.
I know this is probably ridiculously simple, but I can't seem to get my head around it despite reading all sorts of similar posts. Can anyone help with this and my understanding of how the summarise function works?!
Many, many thanks
Marty

You cannot do it in one ddply call because what gets passed to each summarize call is a subset of your data for a specific combination of your group variables. At this lowest level, you do not have access to that intermediate level sum(n). Instead, do it in two steps:
kano_final <- ddply(kano_final, .(X5employf), transform,
sum.n = length(X5employf))
ddply(kano_final, .(X5employf, X5employff), summarise,
n = length(X5employff), prop = n / sum.n[1] * 100)
Edit: using a single ddply call and using table as you hinted towards:
ddply(kano_final, .(X5employf), summarise,
n = Filter(function(x) x > 0, table(X5employff, useNA = "ifany")),
prop = 100* prop.table(n),
X5employff = names(n))

I'd add here an example with dplyr which makes it quite easily in one step, with a short-code and easy-to-read syntax.
d is your data.frame
library(dplyr)
d%.%
dplyr:::group_by(X5employf, X5employff) %.%
dplyr:::summarise(n = length(X5employff)) %.%
dplyr:::mutate(ngr = sum(n)) %.%
dplyr:::mutate(prop = n/ngr*100)
will result in
Source: local data frame [15 x 5]
Groups: X5employf
X5employf X5employff n ngr prop
1 increase 1 26 44 59.090909
2 increase 2 1 44 2.272727
3 increase 3 15 44 34.090909
4 increase 1 and 8 1 44 2.272727
5 increase NA 1 44 2.272727
6 decrease 4 1 10 10.000000
7 decrease 5 5 10 50.000000
8 decrease 6 2 10 20.000000
9 decrease 7 1 10 10.000000
10 decrease 8 1 10 10.000000
11 same 4 4 19 21.052632
12 same 5 6 19 31.578947
13 same 6 5 19 26.315789
14 same 6 and 7 3 19 15.789474
15 same 7 1 19 5.263158

What you apparently want to do is to find out the proportions of X5employff for every value of X5employf. However, you don't tell ddply that X5employf and X5employff are different; to ddply, these two variables are just two variables to split up the data. Also, since there is one observation per line, i.e. count = 1 for every line of the data, the length of each (X5employf, X5employff) combination equals the sum of each (X5employf, X5employff) combination.
The simplest "plyr way" to solve your problem that I can think of is the following:
result <- ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), drop=FALSE)
n <- result$n
n2 <- ddply(kano_final, .(X5employf), summarise, n=length(X5employff))$n
result <- data.frame(result, prop=n/rep(n2, each=13)*100)
You can also use good old xtabs:
a <- xtabs(~X5employf + X5employff, kano_final)
b <- xtabs(~X5employf, kano_final)
a/matrix(b, nrow=3, ncol=ncol(a))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Plot of frequencies with counts in R - r

library(tidyverse) library(broom) table(x2) %>% tidy() %>% mutate(rel_freq = Freq/sum(Freq), sum = sum(Freq)) %>% ggplot(aes(reorder(x2, Freq), rel_freq)) + geom_col() + geom_text(aes(label = Freq), vjust = -.5) + scale_y_continuous(sec.axis = sec_axis(~.*length(x2)))

Related

`non-finite value supplied` in ggstatsplot

Creating a table for frequency analysis results in R

Stacked bar graph with fill ggplot2

For() loop to ID dates that are between others and calculate a mean value

ddply summarise proportional count

Categories

Resources