How can I do this standard excel operation in R?

How can I do this standard excel operation in R? - r

I want to identify who are default and voluntary members in an Insurance database. Default members are ones with a certain number of units depending on their age. Voluntary members are any members with more units than default members at that age.
I want to create a column in R that says either "Default" or "Voluntary"
I have a table of the number of units a default member has. For example:
Age Units
18 2
19 2
20 2
21 2
22 2
23 2
24 2
25 3
26 3
27 3
28 3
29 3
30 3
31 4
32 4
33 4
34 4
35 4
36 4
37 4
38 4
39 4
40 4
41 4
42 4
43 4
44 4
45 4
46 4
47 4
48 4
49 4
50 3
51 3
52 3
53 3
54 3
55 3
56 3
57 3
58 3
59 3
60 2
61 2
62 2
63 2
64 2
65 1
66 1
67 1
68 1
69 1
I would usually do this in excel by vlookup-ing the member's number of units and if it equals the default number of units from above table I would say they are default and if not non default.
This is how I would achieve in excel
if( MembersUnits = vlookup(memberage,defaultunitstable,2,0),"Default", "Voluntary")
I expect out put to be "Default" or "Voluntary"

Using the data you supplied as a lookup table, I created data of person age and the number of units they have, joined the threshold values from lookup and compared the values with ifelse:
library(dplyr)
lookup <- structure(list(Age = 18:69,
Units = c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)),
row.names = c(NA,
-52L), class = c("tbl_df", "tbl", "data.frame"))
dat <- tibble(Age = c(50, 50, 49, 32, 18), Units = c(3, 5, 5, 4, 3))
left_join(dat, rename(lookup, "Threshold" = "Units"), by = "Age") %>%
mutate(member = ifelse(Units == Threshold, "Default", "Voluntary"))
# A tibble: 5 x 4
Age Units Threshold member
<dbl> <dbl> <int> <chr>
1 50 3 3 Default
2 50 5 3 Voluntary
3 49 5 4 Voluntary
4 32 4 4 Default
5 18 3 2 Voluntary

if (!require("prodlim")) {
install.packages("prodlim")
require("prodlim")
} # ensure installation and loading of package "prodlim"
ifelse(is.na(row.match(as.data.frame(dat), as.data.frame(lookup))),
"Voluntary",
"Default")
## [1] "Default" "Voluntary" "Voluntary" "Default" "Default" "Default"
## the function
## prodlim::row.match(as.data.frame(dat), as.data.frame(lookup))
## returns for each row in dat,
## the matching row number in lookup or
## NA if there is no match
##
## This resulting vector one can use to translate any non-NA to "Default" and
## any NA to "Voluntary" using the vectorized `ifelse`
Ah I used as example data following #Paul:
require(dplyr)
dat <- tibble(Age = c(50, 50, 49, 26, 32, 18), Units = c(3, 5, 5, 3, 4, 2))
lookup <- structure(list(Age = 18:69,
Units = c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)),
row.names = c(NA,
-52L), class = c("tbl_df", "tbl", "data.frame"))

Related

How to create an unique observation ID using hash functions?

I have received an data frame for analysis, each observation is a row, with 120 variables. Unfortunately I have not received an observation ID variable that uniquely identifies each observations.
I was thinking maybe I could concatenate all columns to a string and hash this string to obtain a unique ID.
How can I do this without specifying all variables like with paste(). Or is there another solution?
The data can contain NA
here is the sample dataset
structure(list(Class = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), levels = c("1st", "2nd",
"3rd", "Crew"), class = "factor"), Sex = structure(c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), levels = c("Male",
"Female"), class = "factor"), Age = structure(c(1L, NA, 1L, NA,
1L, NA, 1L, 1L, 2L, 2L, NA, 2L, 2L, 2L, 2L, NA, 1L, 1L, 1L, NA,
NA, 1L, 1L, 1L, NA, 2L, 2L, 2L, 2L, 2L, 2L, NA), levels = c("Child",
"Adult"), class = "factor"), Survived = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("No",
"Yes"), class = "factor"), Freq = c(0, 0, 35, 0, 0, 0, 17, 0,
118, 154, 387, 670, 4, 13, 89, 3, 5, 11, 13, 0, 1, 13, 14, 0,
57, 14, 75, 192, 140, 80, 76, 20)), row.names = c(NA, -32L), class = "data.frame")

Maybe you want to use the unique_identifier function from the udpipe package which does:
Create a unique identifier for each combination of fields in a data
frame. This unique identifier is unique for each combination of the
elements of the fields. The generated identifier is like a primary key
or a secondary key on a table. This is just a small wrapper around
frank
Here reproducible example:
df <- structure(list(Class = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), levels = c("1st", "2nd",
"3rd", "Crew"), class = "factor"), Sex = structure(c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), levels = c("Male",
"Female"), class = "factor"), Age = structure(c(1L, NA, 1L, NA,
1L, NA, 1L, 1L, 2L, 2L, NA, 2L, 2L, 2L, 2L, NA, 1L, 1L, 1L, NA,
NA, 1L, 1L, 1L, NA, 2L, 2L, 2L, 2L, 2L, 2L, NA), levels = c("Child",
"Adult"), class = "factor"), Survived = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("No",
"Yes"), class = "factor"), Freq = c(0, 0, 35, 0, 0, 0, 17, 0,
118, 154, 387, 670, 4, 13, 89, 3, 5, 11, 13, 0, 1, 13, 14, 0,
57, 14, 75, 192, 140, 80, 76, 20)), row.names = c(NA, -32L), class = "data.frame")
library(udpipe)
#> Warning: package 'udpipe' was built under R version 4.1.2
df$ID <- unique_identifier(df, fields = colnames(df))
df
#> Class Sex Age Survived Freq ID
#> 1 1st Male Child No 0 1
#> 2 2nd Male <NA> No 0 12
#> 3 3rd Male Child No 35 17
#> 4 Crew Male <NA> No 0 27
#> 5 1st Female Child No 0 5
#> 6 2nd Female <NA> No 0 16
#> 7 3rd Female Child No 17 21
#> 8 Crew Female Child No 0 29
#> 9 1st Male Adult No 118 3
#> 10 2nd Male Adult No 154 10
#> 11 3rd Male <NA> No 387 20
#> 12 Crew Male Adult No 670 25
#> 13 1st Female Adult No 4 6
#> 14 2nd Female Adult No 13 14
#> 15 3rd Female Adult No 89 23
#> 16 Crew Female <NA> No 3 31
#> 17 1st Male Child Yes 5 2
#> 18 2nd Male Child Yes 11 9
#> 19 3rd Male Child Yes 13 18
#> 20 Crew Male <NA> Yes 0 28
#> 21 1st Female <NA> Yes 1 8
#> 22 2nd Female Child Yes 13 13
#> 23 3rd Female Child Yes 14 22
#> 24 Crew Female Child Yes 0 30
#> 25 1st Male <NA> Yes 57 4
#> 26 2nd Male Adult Yes 14 11
#> 27 3rd Male Adult Yes 75 19
#> 28 Crew Male Adult Yes 192 26
#> 29 1st Female Adult Yes 140 7
#> 30 2nd Female Adult Yes 80 15
#> 31 3rd Female Adult Yes 76 24
#> 32 Crew Female <NA> Yes 20 32
Created on 2022-07-24 by the reprex package (v2.0.1)

Another option is to use unclass on factors (i.e., after pasting all columns together using Reduce), which will convert the factors to their numbers.
df$ID <- c(unclass(as.factor(Reduce(paste, df))))
Output
Class Sex Age Survived Freq ID
1 1st Male Child No 0 6
2 2nd Male <NA> No 0 16
3 3rd Male Child No 35 22
4 Crew Male <NA> No 0 31
5 1st Female Child No 0 3
6 2nd Female <NA> No 0 12
7 3rd Female Child No 17 19
8 Crew Female Child No 0 25
9 1st Male Adult No 118 5
10 2nd Male Adult No 154 13
11 3rd Male <NA> No 387 24
12 Crew Male Adult No 670 29
13 1st Female Adult No 4 1
14 2nd Female Adult No 13 9
15 3rd Female Adult No 89 17
16 Crew Female <NA> No 3 27
17 1st Male Child Yes 5 7
18 2nd Male Child Yes 11 15
19 3rd Male Child Yes 13 23
20 Crew Male <NA> Yes 0 32
21 1st Female <NA> Yes 1 4
22 2nd Female Child Yes 13 11
23 3rd Female Child Yes 14 20
24 Crew Female Child Yes 0 26
25 1st Male <NA> Yes 57 8
26 2nd Male Adult Yes 14 14
27 3rd Male Adult Yes 75 21
28 Crew Male Adult Yes 192 30
29 1st Female Adult Yes 140 2
30 2nd Female Adult Yes 80 10
31 3rd Female Adult Yes 76 18
32 Crew Female <NA> Yes 20 28

Calculating and looping summaries for individual participants into a table

I have data from several hundred participants who each provided between 1 and 6 sentences. They then rated their sentence(s) on 4 dimensions, as did two external raters.
I'd like to create a table, grouped by participant, with columns showing these values:
Participants' rate of agreement with rater 1 (par1), with rater 2 (par2) and overall (paro)
Participants' rate of agreement for each dimension with rater 1 (pad1.1, pad2.1 etc.), with rater 2 (pad1.2, pad2.2 etc.) and overall (pad1.o, pad2.o etc.)
Mean difference in rating between participant and rater 1 (mdrp1), rater 2 (mdrp2) and both raters (mdrpo)
Mean difference in rating for each dimension between participant and rater 1 (mdr1p1, mdr2p1 etc.), rater 2 (mdr1p2, mdr2p2 etc.) and both raters (mdr1po, mdr2po etc.)
(So with 4 dimensions there should be 30 values per participant)
Due to the size and structure of the data, I'm not sure where to start on this. I'm guessing that a loop would be necessary, but I've struggled to get my head around how to do that as well.
For agreement I'm considering adding TRUE/FALSE variables and then replacing them with 1 and 0 to eventually calculate agreement:
df <- df %>% mutate(par1 = (df$d1 == df$r1.1)
df <- df %>% mutate(par2 = (df$d1 == df$r2.1)
df <- df %>% mutate(paro = (df$d1 == df$r1.1 & df$d1 == df$r2.1)
And similarly for mean differences, adding variables with rating difference for each dimension...
df <- df %>% mutate(mdr1p1 = (df$d1 - df$r1.1))
df <- df %>% mutate(mdr1p2 = (df$d1 - df$r2.1))
df <- df %>% mutate(mdr1po = (df$d1 - ((df$r1.1 + df$r2.1)/2)))
...But these seem to be quite inefficient approaches!
My data looks like this:
ID Ans d1 d2 d3 d4 r1.1 r1.2 r1.3 r1.4 r2.1 r2.2 r2.3 r2.4
1 53 abc 3 3 3 3 3 2 4 3 3 2 4 3
2 a4 def 3 3 3 3 3 1 2 3 3 1 3 3
3 a4 ghi 4 4 4 4 3 2 5 1 3 1 5 2
4 hj jkl 3 3 3 3 3 1 3 3 3 1 5 3
5 32 mno 2 3 3 3 3 1 3 2 3 1 3 3
6 32 pqr 3 3 3 2 3 2 5 3 4 2 3 3
ID = participant
Ans = participants' written answer
d = dimension rated by participant
r1 = dimensions rated by external rater 1
r2 = dimensions rated by external rater 2
Example data:
structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 4L, 5L),
Ans = c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu"),
d1 = c(3L, 3L, 4L, 3L, 2L, 3L, 3L), d2 = c(3L, 3L, 4L, 3L, 3L, 3L, 1L),
d3 = c(3L, 3L, 4L, 3L, 3L, 3L, 1L), d4 = c(3L, 3L, 4L, 3L, 3L, 2L, 3L),
r1.1 = c(3L, 3L, 3L, 3L, 3L, 3L, 3L), r1.2 = c(2L, 1L, 2L, 1L, 1L, 2L, 3L),
r1.3 = c(4L, 2L, 5L, 3L, 3L, 5L, 3L), r1.4 = c(3L, 3L, 1L, 3L, 2L, 3L, 2L),
r2.1 = c(3L, 3L, 3L, 3L, 3L, 4L, 3L), r2.2 = c(2L, 1L, 1L, 1L, 1L, 2L, 1L),
r2.3 = c(4L, 3L, 5L, 5L, 3L, 3L, 5L), r2.4 = c(3L, 3L, 2L, 3L, 3L, 3L, 2L)),
row.names = c(1L, 2L, 3L, 4L, 5L, 6L), class = "data.frame")

Get sum of unique rows in table function in R

Suppose I have data which looks like this
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 1 X K John
1 A 2 6 9 2 X K John
1 A 2 5 8 3 X K John
2 B 2 4 6 1 X L Sam
2 B 2 3 4 2 X L Sam
2 B 2 5 7 3 X L Sam
3 C 2 5 11 1 X M John
3 C 2 5 11 2 X L John
3 C 2 5 11 3 X K John
4 D 2 8 10 1 Y M John
4 D 2 8 10 2 Y K John
4 D 2 5 7 3 Y K John
5 E 2 5 9 1 Y M Sam
5 E 2 5 9 2 Y L Sam
5 E 2 5 9 3 Y M Sam
6 F 2 4 7 1 Z M Kyle
6 F 2 5 8 2 Z L Kyle
6 F 2 5 8 3 Z M Kyle
if I apply table function, it will just combines are the rows and result will be
K L M
X 4 4 1
Y 2 1 3
Z 0 1 2
Now what if I want not the sum of all rows but only sum of those rows with Unique Id
so it looks like
K L M
X 2 2 1
Y 1 1 2
Z 0 1 1
Thanks

If df is your data.frame:
# Subset original data.frame to keep columns of interest
df1 <- df[,c("Id", "Category", "Mode")]
# Remove duplicated rows
df1 <- df1[!duplicated(df1),]
# Create table
with(df1, table(Category, Mode))
# Mode
# Category K L M
# X 2 2 1
# Y 1 1 2
# Z 0 1 1
Or in one line using unique
table(unique(df[c("Id", "Category", "Mode")])[-1])
df <- structure(list(Id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), Name = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("A",
"B", "C", "D", "E", "F"), class = "factor"), Price = c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), sales = c(5L, 6L, 5L, 4L, 3L, 5L, 5L, 5L, 5L, 8L, 8L, 5L,
5L, 5L, 5L, 4L, 5L, 5L), Profit = c(8L, 9L, 8L, 6L, 4L, 7L, 11L,
11L, 11L, 10L, 10L, 7L, 9L, 9L, 9L, 7L, 8L, 8L), Month = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("X", "Y", "Z"
), class = "factor"), Mode = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 2L, 1L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 2L, 3L), .Label = c("K",
"L", "M"), class = "factor"), Supplier = structure(c(1L, 1L,
1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L
), .Label = c("John", "Kyle", "Sam"), class = "factor")), .Names = c("Id",
"Name", "Price", "sales", "Profit", "Month", "Category", "Mode",
"Supplier"), class = "data.frame", row.names = c(NA, -18L))

We can try
library(data.table)
dcast(unique(setDT(df1[c('Category', 'Mode', 'Id')])),
Category~Mode, value.var='Id', length)
# Category K L M
#1: X 2 2 1
#2: Y 1 1 2
#3: Z 0 1 1
Or with dplyr
library(dplyr)
df1 %>%
distinct(Id, Category, Mode) %>%
group_by(Category, Mode) %>%
tally() %>%
spread(Mode, n, fill=0)
# Category K L M
# (chr) (dbl) (dbl) (dbl)
#1 X 2 2 1
#2 Y 1 1 2
#3 Z 0 1 1
Or as #David Arenburg suggested, a variant of the above is
df1 %>%
distinct(Id, Category, Mode) %>%
select(Category, Mode) %>%
table()

R- group by 2 factor variables to calculate quartile

I have a data set which has number of records collected [reccount] by [hourtime] and [Feedcodes]. What I'm trying to do is to create a column, that tells me which quartile each record falls into (probs=0:4/4), so that I can set up an alert if anything falls below 1st or the 2nd quartile and I can investigate the feed to see if something is out of the ordinary.
I tried first with this, but realized it wasn't grouping by hourtime and feedcode
df<-within(ds, quartile<-as.integer(cut(ds$reccount,quantile(ds$reccount,probs=0:4/4),inlcude.lowest=TRUE)))
Tried this but still it's not returning what I'm expecting
as<-ddply(ds,.(as.factor(ds$hourtime),ds$FeedCode) , function(df)quantile(ds$reccount,probs=0:4/4))
I just need to add a column that classifies it as which quartile.
Here's the data:
dput(head(dss,30))
structure(list(rownames = c(2371L, 2428L, 2459L, 2493L, 2573L,
2581L, 2606L, 2633L, 2668L, 2683L, 2693L, 2748L, 2756L, 2819L,
2865L, 2889L, 2896L, 2970L, 2988L, 3005L, 3047L, 3067L, 3111L,
3132L, 3154L, 3177L, 3209L, 3241L, 3272L), hourtime = c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), reccount = c(2864L,
3492L, 968L, 3271L, 6078L, 767L, 1365L, 6222L, 2515L, 3986L,
4327L, 5764L, 3676L, 5338L, 6407L, 1217L, 3058L, 5673L, 3569L,
3391L, 3169L, 6446L, 4201L, 884L, 3529L, 6461L, 3414L, 3246L,
5486L), FeedCode = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "MDSWJD", class = "factor"), quartile = c(4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L)), .Names = c("rownames",
"hourtime", "reccount", "FeedCode", "quartile"), row.names = c(NA,
29L), class = "data.frame")

You can use ave() to run the cut/quantile by grouping variables:
dss$quartile <- with(dss,
ave(reccount, hourtime, FeedCode,
FUN=function(x).bincode(x, quantile(x), T,T)
)
)

You got me confused with the quartiles, but having 5 groups {0,1,2,3,4}.
I don't know if I'm missing something, but here is a dplyr approach.
The first one calculates Q25% by group {hourtime, FeedCode} and flags everything below that. The second one splits reccount at 4 groups (quartiles) at each group and assigns the group number {1 to 4}.
Run the code step by step and let me know if you spot mistakes.
library(dplyr)
# example dataset
dt = data.frame(hourtime = c(1,1,1,1,1,2,2,2,2,2),
FeedCode = c("A","B","A","B","A","B","A","B","A","B"),
reccount = c(946,184,1404,937,137,1199,698,1311,1302,560))
dt %>%
group_by(hourtime, FeedCode) %>%
mutate(Q25 = quantile(reccount,0.25),
FlagBelowQ25 = ifelse(reccount < Q25, 1, 0)) %>%
ungroup
# hourtime FeedCode reccount Q25 FlagBelowQ25
# 1 1 A 946 541.50 0
# 2 1 B 184 372.25 1
# 3 1 A 1404 541.50 0
# 4 1 B 937 372.25 0
# 5 1 A 137 541.50 1
# 6 2 B 1199 879.50 0
# 7 2 A 698 849.00 1
# 8 2 B 1311 879.50 0
# 9 2 A 1302 849.00 0
# 10 2 B 560 879.50 1
dt %>%
group_by(hourtime, FeedCode) %>%
mutate(Quartile = ntile(reccount,4)) %>%
ungroup
# hourtime FeedCode reccount Quartile
# 1 1 A 946 2
# 2 1 B 184 1
# 3 1 A 1404 3
# 4 1 B 937 3
# 5 1 A 137 1
# 6 2 B 1199 2
# 7 2 A 698 1
# 8 2 B 1311 3
# 9 2 A 1302 3
# 10 2 B 560 1

ddply summarise proportional count

I am having some trouble using the ddply function from the plyr package. I am trying to summarise the following data with counts and proportions within each group. Here's my data:
structure(list(X5employf = structure(c(1L, 3L, 1L, 1L, 1L, 3L,
1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 1L,
3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L,
3L, 3L, 1L), .Label = c("increase", "decrease", "same"), class = "factor"),
X5employff = structure(c(2L, 6L, NA, 2L, 4L, 6L, 5L, 2L,
2L, 8L, 2L, 2L, 2L, 7L, 7L, 8L, 11L, 7L, 2L, 8L, 8L, 11L,
7L, 6L, 2L, 5L, 2L, 8L, 7L, 7L, 7L, 8L, 6L, 7L, 5L, 5L, 7L,
2L, 6L, 7L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 2L, 5L, 2L, 2L,
2L, 5L, 12L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 2L, 5L, 2L,
13L, 9L, 9L, 9L, 7L, 8L, 5L), .Label = c("", "1", "1 and 8",
"2", "3", "4", "5", "6", "6 and 7", "6 and 7 ", "7", "8",
"1 and 8"), class = "factor")), .Names = c("X5employf", "X5employff"
), row.names = c(NA, 73L), class = "data.frame")
And here's my call using ddply:
ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), prop=(n/sum(n))*100)
This gives me the counts of each instance of X5employff correctly, but but seems as though the proportion is being calculated across each row and not within each level of the factor X5employf as follows:
X5employf X5employff n prop
1 increase 1 26 100
2 increase 2 1 100
3 increase 3 15 100
4 increase 1 and 8 1 100
5 increase <NA> 1 100
6 decrease 4 1 100
7 decrease 5 5 100
8 decrease 6 2 100
9 decrease 7 1 100
10 decrease 8 1 100
11 same 4 4 100
12 same 5 6 100
13 same 6 5 100
14 same 6 and 7 3 100
15 same 7 1 100
When manually calculating the proportions within each group I get this:
X5employf X5employff n prop
1 increase 1 26 59.09
2 increase 2 1 2.27
3 increase 3 15 34.09
4 increase 1 and 8 1 2.27
5 increase <NA> 1 2.27
6 decrease 4 1 10.00
7 decrease 5 5 50.00
8 decrease 6 2 20.00
9 decrease 7 1 10.00
10 decrease 8 1 10.00
11 same 4 4 21.05
12 same 5 6 31.57
13 same 6 5 26.31
14 same 6 and 7 3 15.78
15 same 7 1 5.26
As you can see the sum of proportions in each level of factor X5employf equals 100.
I know this is probably ridiculously simple, but I can't seem to get my head around it despite reading all sorts of similar posts. Can anyone help with this and my understanding of how the summarise function works?!
Many, many thanks
Marty

You cannot do it in one ddply call because what gets passed to each summarize call is a subset of your data for a specific combination of your group variables. At this lowest level, you do not have access to that intermediate level sum(n). Instead, do it in two steps:
kano_final <- ddply(kano_final, .(X5employf), transform,
sum.n = length(X5employf))
ddply(kano_final, .(X5employf, X5employff), summarise,
n = length(X5employff), prop = n / sum.n[1] * 100)
Edit: using a single ddply call and using table as you hinted towards:
ddply(kano_final, .(X5employf), summarise,
n = Filter(function(x) x > 0, table(X5employff, useNA = "ifany")),
prop = 100* prop.table(n),
X5employff = names(n))

I'd add here an example with dplyr which makes it quite easily in one step, with a short-code and easy-to-read syntax.
d is your data.frame
library(dplyr)
d%.%
dplyr:::group_by(X5employf, X5employff) %.%
dplyr:::summarise(n = length(X5employff)) %.%
dplyr:::mutate(ngr = sum(n)) %.%
dplyr:::mutate(prop = n/ngr*100)
will result in
Source: local data frame [15 x 5]
Groups: X5employf
X5employf X5employff n ngr prop
1 increase 1 26 44 59.090909
2 increase 2 1 44 2.272727
3 increase 3 15 44 34.090909
4 increase 1 and 8 1 44 2.272727
5 increase NA 1 44 2.272727
6 decrease 4 1 10 10.000000
7 decrease 5 5 10 50.000000
8 decrease 6 2 10 20.000000
9 decrease 7 1 10 10.000000
10 decrease 8 1 10 10.000000
11 same 4 4 19 21.052632
12 same 5 6 19 31.578947
13 same 6 5 19 26.315789
14 same 6 and 7 3 19 15.789474
15 same 7 1 19 5.263158

What you apparently want to do is to find out the proportions of X5employff for every value of X5employf. However, you don't tell ddply that X5employf and X5employff are different; to ddply, these two variables are just two variables to split up the data. Also, since there is one observation per line, i.e. count = 1 for every line of the data, the length of each (X5employf, X5employff) combination equals the sum of each (X5employf, X5employff) combination.
The simplest "plyr way" to solve your problem that I can think of is the following:
result <- ddply(kano_final, .(X5employf, X5employff), summarise, n=length(X5employff), drop=FALSE)
n <- result$n
n2 <- ddply(kano_final, .(X5employf), summarise, n=length(X5employff))$n
result <- data.frame(result, prop=n/rep(n2, each=13)*100)
You can also use good old xtabs:
a <- xtabs(~X5employf + X5employff, kano_final)
b <- xtabs(~X5employf, kano_final)
a/matrix(b, nrow=3, ncol=ncol(a))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I do this standard excel operation in R? - r

Related

How to create an unique observation ID using hash functions?

Calculating and looping summaries for individual participants into a table

Get sum of unique rows in table function in R

R- group by 2 factor variables to calculate quartile

ddply summarise proportional count

Categories

Resources