summarize categorical data based on grouping

summarize categorical data based on grouping - r

I have a data frame in the following form
Id <- c(101,102,103,101,103,103,102,101,103,102)
Service <- c('A','B','A','C','A','A','B','C','A','B')
Type <- c('C','I','C','I','C','C','C','I','I','C')
Channel <- c('ATM1','ATM2','ATM1','Teller','Teller','ATM2','ATM1','ATM1','ATM2','Teller')
amount <- c(11,34,56,37,65,83,26,94,34,55)
df <- data.frame(Id,Service,Channel,Type,amount)
df in tabular formate
Id Service Channel Type amount
101 A ATM1 C 11
102 B ATM2 I 34
103 A ATM1 C 56
101 C Teller I 37
103 A Teller C 65
103 A ATM2 C 83
102 B ATM1 C 26
101 C ATM1 I 94
103 A ATM2 I 34
102 B Teller C 55
I am able to summarize my data using amount column as df %>% group_by(Id) %>% summarise(total = sum(amount)) %>% as.data.frame
Id total
101 142
102 115
103 238
How can I summarize data in a similar way using categorical columns (Service/Type/Channel) and group_by(Id)? I know we can use table() here, but I am trying to create a data frame, which I can use it for further analysis, such as clustering.

One way to restructure the categorical variables in a manner that they can be summarized by Id is to create dummy coded variables, where 1 means presence, 0 means absence. Then, aggregate results in counts of each category (i.e. number of times ATM 1 used) by Id.
We use the dummies package to create dummy coded variables.
Id <- c(101,102,103,101,103,103,102,101,103,102)
Service <- c('A','B','A','C','A','A','B','C','A','B')
Type <- c('C','I','C','I','C','C','C','I','I','C')
Channel <- c('ATM1','ATM2','ATM1','Teller','Teller','ATM2','ATM1','ATM1','ATM2','Teller')
amount <- c(11,34,56,37,65,83,26,94,34,55)
df <- data.frame(Id,Service,Channel,Type,amount)
library(dummies)
df <- dummy.data.frame(df,names=c("Service","Type","Channel"))
aggregate(. ~ Id,data=df,"sum")
...and the output:
> aggregate(. ~ Id,data=df,"sum")
Id ServiceA ServiceB ServiceC ChannelATM1 ChannelATM2 ChannelTeller TypeC
1 101 1 0 2 2 0 1 1
2 102 0 3 0 1 1 1 2
3 103 4 0 0 1 2 1 3
TypeI amount
1 2 142
2 1 115
3 1 238
>
We interpret the results as follows.
Id 101 used Service A once, Service C twice, ATM1 once, a Teller once, Type I once, and Type C twice for a total amount of 142.

Related

How to sort a data frame by column?

I want sort a data frame by datas of a column (the first column, called Initial). My data frame it's:
I called my dataframe: t2
Initial Final Changes
1 1 200
1 3 500
3 1 250
24 25 175
21 25 180
1 5 265
3 3 147
I am trying with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=False),]
But, the result is of the type:
Initial Final Changes
3 1 250
3 3 147
21 25 180
24 25 175
1 5 265
1 1 200
1 3 500
And when I try with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=TRUE),]
The result is:
Initial Final Changes
1 5 265
1 1 200
1 3 500
24 25 175
21 25 180
3 1 250
3 3 147
I don't understand what happen.
Can you help me, please?

It is possible that the column types are factors, in that case, convert it to numeric and should work
library(dplyr)
t2 %>%
arrange_at(1:2, ~ desc(as.numeric(as.character(.))))
Or with base R
t2[1:2] <- lapply(t2[1:2], function(x) as.numeric(as.character(x)))
t2[do.call(order, c(t2[1:2], decreasing = TRUE)), ]
Or the OP's code should work as well
Noticed that decreasing = False in the first option OP tried (may be a typo). In R, it is upper case, FALSE
t2[order(t2$Initial, t2$Final, decreasing=FALSE),]

Counting Attempts of an event in R

I'm relatively new in R and learning. I have the following data frame = data
ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016
I am looking to count the number of people (in this case only two unique individuals) who passed their tests after multiple attempts(passing is defined as 65 or over). So the final product would return me a list of unique ID's who had multiple counts until their test scores hit 65. This would inform me that approx. 66% of the clients in this data frame require multiple test sessions before getting a passing grade.
Below is my idea or concept more or less, I've framed it as an if statement
If ID appears twice
count how often it appears, until TEST GRADE >= 65
ifelse(duplicated(data$ID), count(ID), NA)
I'm struggling with the second piece where I want to say, count the occurrence of ID until grade >=65.
The other option I see is some sort of loop. Below is my attempt
for (i in data$ID) {
duplicated(datad$ID)
count(data$ID)
Here is where something would say until =65
}
Again the struggle comes in how to tell R to stop counting when grade hits 65.
Appreciate the help!

You can use data.table:
library(data.table)
dt <- fread(" ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
# count the number of try per ID then get only the one that have been successful
dt <- dt[, N:=.N, by=ID][grade>=65]
# proportion of successful having tried more than once
length(dt[N>1]$ID)/length(dt$ID)
[1] 0.6666667

Another option, though the other two work just fine:
library(dplyr)
dat2 <- dat %>%
group_by(ID) %>%
summarize(
multiattempts = n() > 1 & any(grade < 65),
maxgrade = max(grade)
)
dat2
# Source: local data frame [3 x 3]
# ID multiattempts maxgrade
# <int> <lgl> <int>
# 1 1 TRUE 73
# 2 2 TRUE 76
# 3 3 FALSE 66
sum(dat2$multiattempts) / nrow(dat2)
# [1] 0.6666667

Here is a method using the aggregate function and subsetting that returns the maximum score for testers that took the the test more than once starting from their second test.
multiTestMax <- aggregate(grade~ID, data=df[duplicated(df$ID),], FUN=max)
multiTestMax
ID grade
1 1 73
2 2 76
To get the number of rows, you can use nrow:
nrow(multiTestMax)
2
or the proportion of all test takers
nrow(multiTestMax) / unique(df$ID)
data
df <- read.table(header=T, text="ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")

Add a row of zeros in a data frame created with ddply if there are no observations

I used the function ddply (package plyr) to calculate the mean of a response variable for each group "Trial" and "Treatment". I get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
This data frame suggests that in the trial 4 and treatment B, there are no observations for the response variable (as no row is specified in the data frame). So, is it possible to automatically add a row of zeros in the data frame (built with the function “ddply”) when there are no observations for a given response variable?
I would like to get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
4 B 0 0

We can merge the original dataset with another data.frame created with the full combination of unique values in 'Trial', and 'Treatment'. It will give an output with the missing combinations filled with NA. If needed, this can be changed to 0 (but it is better to have the missing combination as NA).
res <- merge(expand.grid(lapply(df1[1:2], unique)), df1, all.x=TRUE)
is.na(res) <- res==0
Or with dplyr/tidyr, we can use complete (from tidyr)
library(dplyr)
library(tidyr)
df1 %>%
complete(Trial, Treatment, fill= list(N=0, Mean=0))
# Trial Treatment N Mean
# (int) (chr) (dbl) (dbl)
#1 1 A 458 125.258
#2 1 B 459 168.748
#3 2 A 742 214.266
#4 2 B 142 475.786
#5 3 A 247 145.689
#6 3 B 968 234.129
#7 4 A 436 456.287
#8 4 B 0 0.000

R - Create a new variable where each observation depends on another table and other variables in the data frame

I have the two following tables:
df <- data.frame(eth = c("A","B","B","A","C"),ZIP1 = c(1,1,2,3,5))
Inc <- data.frame(ZIP2 = c(1,2,3,4,5,6,7),A = c(56,98,43,4,90,19,59), B = c(49,10,69,30,10,4,95),C = c(69,2,59,8,17,84,30))
eth ZIP1 ZIP2 A B C
A 1 1 56 49 69
B 1 2 98 10 2
B 2 3 43 69 59
A 3 4 4 30 8
C 5 5 90 10 17
6 19 4 84
7 59 95 39
I would like to create a variable Inc in the df data frame where for each observation, the value is the intersection of the eth and ZIP of the observation. In my example, it would lead to:
eth ZIP1 Inc
A 1 56
B 1 49
B 2 10
A 3 43
C 5 17
A loop or quite brute force could solve it but it takes time on my dataset, I'm looking for a more subtle way maybe using data.table. It seems to me that it is a very standard question and I'm apologizing if it is, my unability to formulate a precise title for this problem (as you may have noticed..) is maybe why I haven't found any similar question in searching on the forum..
Thanks !

Sure, it can be done in data.table:
library(data.table)
setDT(df)
df[ melt(Inc, id.var="ZIP2", variable.name="eth", value.name="Inc"),
Inc := i.Inc
, on=c(ZIP1 = "ZIP2","eth") ]
The syntax for this "merge-assign" operation is X[i, Xcol := expression, on=merge_cols].
You can run the i = melt(Inc, id.var="ZIP", variable.name="eth", value.name="Inc") part on its own to see how it works. Inside the merge, columns from i can be referred to with i.* prefixes.
Alternately...
setDT(df)
setDT(Inc)
df[, Inc := Inc[.(ZIP1), eth, on="ZIP2", with=FALSE], by=eth]
This is built on a similar idea. The package vignettes are a good place to start for this sort of syntax.

We can use row/column indexing
df$Inc <- Inc[cbind(match(df$ZIP1, Inc$ZIP2), match(df$eth, colnames(Inc)))]
df
# eth ZIP1 Inc
#1 A 1 56
#2 B 1 49
#3 B 2 10
#4 A 3 43
#5 C 5 17

What about this?
library(reshape2)
merge(df, melt(Inc, id="ZIP2"), by.x = c("ZIP1", "eth"), by.y = c("ZIP2", "variable"))
ZIP1 eth value
1 1 A 56
2 1 B 49
3 2 B 10
4 3 A 43
5 5 C 17

Another option:
library(dplyr)
library(tidyr)
Inc %>%
gather(eth, value, -ZIP2) %>%
left_join(df, ., by = c("eth", "ZIP1" = "ZIP2"))

my solution(which maybe seems awkward)
for (i in 1:length(df$eth)) {
df$Inc[i] <- Inc[as.character(df$eth[i])][df$ZIP[i],]
}

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.

Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset

I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

summarize categorical data based on grouping - r

Related

How to sort a data frame by column?

Counting Attempts of an event in R

Add a row of zeros in a data frame created with ddply if there are no observations

R - Create a new variable where each observation depends on another table and other variables in the data frame

Printing only certain panels in R lattice

Categories

Resources