Counting unique items in data frame - r

I want a simple count of the number of subjects in each condition of a study. The data look something like this:
subjectid cond obser variable
1234 1 1 12
1234 1 2 14
2143 2 1 19
3456 1 1 12
3456 1 2 14
3456 1 3 13
etc etc etc etc
This is a large dataset and it is not always obvious how many unique subjects contribute to each condition, etc.
I have this in a data.frame.
What I want is something like
cond ofSs
1 122
2 98
Where for each "condition" I get a count of the number of unique Ss contributing data to that condition. Seems like this should be painfully simple.

Use the ddply function from the plyr package:
require(plyr)
df <- data.frame(subjectid = sample(1:3,7,T),
cond = sample(1:2,7,T), obser = sample(1:7))
> ddply(df, .(cond), summarize, NumSubs = length(unique(subjectid)))
cond NumSubs
1 1 1
2 2 2
The ddply function "splits" the data-frame by the cond variable, and produces a summary column NumSubs for each sub-data-frame.

Using your snippet of data that I loaded into object dat:
> dat
subjectid cond obser variable
1 1234 1 1 12
2 1234 1 2 14
3 2143 2 1 19
4 3456 1 1 12
5 3456 1 2 14
6 3456 1 3 13
Then one way to do this is to use aggregate to count the unique subjectid (assuming that is what you meant by "Ss"???
> aggregate(subjectid ~ cond, data = dat, FUN = function(x) length(unique(x)))
cond subjectid
1 1 2
2 2 1

or, if you like SQL and don't mind installing a package:
library(sqldf);
sqldf("select cond, count(distinct subjectid) from dat")

Just to give you even more choice, you could also use tapply
tapply(a$subjectid, a$cond, function(x) length(unique(x)))
1 2
2 1

Related

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Finding "outliers" in a group

I am working with hospital discharge data. All hospitalizations (cases) with the same Patient_ID are supposed to be of the same person. However I figured out that there are Pat_ID's with different ages and both sexes.
Imagine I have a data set like this:
Case_ID <- 1:8
Pat_ID <- c(rep("1",4), rep("2",3),"3")
Sex <- c(rep(1,4), rep(2,2),1,1)
Age <- c(rep(33,3),76,rep(19,2),49,15)
Pat_File <- data.frame(Case_ID, Pat_ID, Sex,Age)
Case_ID Pat_ID Sex Age
1 1 1 33
2 1 1 33
3 1 1 33
4 1 1 76
5 2 2 19
6 2 2 19
7 2 1 49
8 3 1 15
It was relatively easy to identify Pat_ID's with cases that differ from each other. I found these ID's by calculating an average for age and/or sex (coded as 1 and 2) with help of the function aggregate and then calculated the difference between the average and age or sex. I would like to automatically remove/identify cases where age or sex deviate from the majority of the cases of a patient ID. In my example I would like to remove cases 4 and 7.
You could try
library(data.table)
Using Mode from
Is there a built-in function for finding the mode?
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 5 2 19
#5: 2 6 2 19
#6: 3 8 1 15
Testing other cases,
Pat_File$Sex[6] <- 1
Pat_File$Age[4] <- 16
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 6 1 19
#5: 3 8 1 15
This method works, I believe, though I doubt it's the quickest or most efficient way.
Essentially I split the dataframe by your grouping variable. Then I found the 'mode' for the variables you're concerned about. Then we filtered those observations that didn't contain all of the modes. We then stuck everything back together:
library(dplyr) # I used dplyr to 'filter' though you could do it another way
temp <- split(Pat_File, Pat_ID)
Mode.Sex <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Sex)); names(temp1)[temp1 == max(temp1)]})
Mode.Age <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Age)); names(temp1)[temp1 == max(temp1)]})
temp.f<-NULL
for(i in 1:length(temp)){
temp.f[[i]] <- temp[[i]] %>% filter(Sex==Mode.Sex[[i]] & Age==Mode.Age[[i]])
}
do.call("rbind", temp.f)
# Case_ID Pat_ID Sex Age
#1 1 1 1 33
#2 2 1 1 33
#3 3 1 1 33
#4 5 2 2 19
#5 6 2 2 19
#6 8 3 1 15
Here is another approach using the sqldf package:
1) Create new dataframe (called data_groups) with unique groups based on Pat_ID, Sex, and Age
2) For each unique group, check Pat_ID against every other group and if the Pat_ID of one group matches another group, select the group with lower count and store in new vector (low_counts)
3) Take new datafame (data_groups) and take out Pat_IDs from new vector (low_counts)
4) Recombine with Pat_File
Here is the code:
library(sqldf)
# Create new dataframe with unique groups based on Pat_ID, Sex, and Age
data_groups <- sqldf("SELECT *, COUNT(*) FROM Pat_File GROUP BY Pat_ID, Sex, Age")
# Create New Vector to Store Pat_IDs with Sex and Age that differ from mode
low_counts <- vector()
# Unique groups
data_groups
for(i in 1:length(data_groups[,1])){
for(j in 1:length(data_groups[,1])){
if(i<j){
k <- length(low_counts)+1
result <- data_groups[i,2]==data_groups[j,2]
if(is.na(result)){result <- FALSE}
if(result==TRUE){
if(data_groups[i,5]<data_groups[j,5]){low_counts[k] <- data_groups[i,1]}
else{low_counts[k] <- data_groups[j,1]}
}
}
}
}
low_counts <- as.data.frame(low_counts)
# Take out lower counts
data_groups <- sqldf("SELECT * FROM data_groups WHERE Case_ID NOT IN (SELECT * FROM low_counts)")
Pat_File <- sqldf("SELECT Pat_File.Case_ID, Pat_File.Pat_ID, Pat_File.Sex, Pat_File.Age FROM data_groups, Pat_File WHERE data_groups.Pat_ID=Pat_File.Pat_ID AND data_groups.Sex=Pat_File.Sex AND data_groups.Age=Pat_File.Age ORDER BY Pat_File.Case_ID")
Pat_File
Which Provides the following results:
Case_ID Pat_ID Sex Age
1 1 1 1 33
2 2 1 1 33
3 3 1 1 33
4 5 2 2 19
5 6 2 2 19
6 8 3 1 15

R sum rows in a telecommunication matrix

I have a big matrix df with a length of over 3000 rows. I am programming in R. It looks like this:
df: person1 person2 calls
1 3 5
1 4 7
2 11 6
3 1 5
3 2 1
3 4 13
and so on.
What i want to do is to get the total number of calls that each person made and received in two matrices. This would look like this:
calls: person madecalls received: person receivedcalls
1 12 1 5
2 6 2 1
3 19 3 5
4 20
11 6
Can anyone help me with this problem?
Thanks!
Use the aggregate function:
made.calls <- aggregate(df$calls, by = list(person = df$person1), fun = sum)
.....plyr way:
library(plyr)
ddply(df, .(person1), function(x) data.frame( madecalls = sum(x$calls) )

Sequentially numbering repetitive interactions in R

I have a data frame in R that has been previously sorted with data that looks like the following:
id creatorid responderid
1 1 2
2 1 2
3 1 3
4 1 3
5 1 3
6 2 3
7 2 3
I'd like to add a value, called repetition to the data frame that shows how many times that combination of (creatorid,responderid) has previously appeared. For example, the output in this case would be:
id creatorid responderid repetition
1 1 2 0
2 1 2 1
3 1 3 0
4 1 3 1
5 1 3 2
6 2 3 0
7 2 3 1
I have a hunch that this is something that can be easily done with dlply and transform, but I haven't been able to work it out. Here's the simple code that I'm using to attempt it:
dlply(df, .(creatorid, responderid), transform, repetition=function(dfrow) {
seq(0,nrow(dfrow)-1)
})
Unfortunately, this throws the following error (pasted from my real data - the first repetition appears 166 times):
Error in data.frame(list(id = c(39684L, 55374L, 65158L, 54217L, 10004L, :
arguments imply differing number of rows: 166, 0
Any suggestions on an easy and efficient way to accomplish this task?
Using plyr:
ddply(df, .(creatorid, responderid), function(x)
transform(x, repetition = seq_len(nrow(x))-1))
Using data.table:
require(data.table)
dt <- data.table(df)
dt[, repetition := seq_len(.N)-1, by = list(creatorid, responderid)]
using ave:
within(df, {repetition <- ave(id, list(creatorid, responderid),
FUN=function(x) seq_along(x)-1)})

How to aggregate some columns while keeping other columns in R?

I have a data frame like this:
id no age
1 1 7 23
2 1 2 23
3 2 1 25
4 2 4 25
5 3 6 23
6 3 1 23
and I hope to aggregate the date frame by id to a form like this: (just sum the no if they share the same id, but keep age there)
id no age
1 1 9 23
2 2 5 25
3 3 7 23
How to achieve this using R?
Assuming that your data frame is named df.
aggregate(no~id+age, df, sum)
# id age no
# 1 1 23 9
# 2 3 23 7
# 3 2 25 5
Even better, data.table:
library(data.table)
# convert your object to a data.table (by reference) to unlock data.table syntax
setDT(DF)
DF[ , .(sum_no = sum(no), unq_age = unique(age)), by = id]
Alternatively, you could use ddply from plyr package:
require(plyr)
ddply(df,.(id,age),summarise,no = sum(no))
In this particular example the results are identical. However, this is not always the case, the difference between the both functions is outlined here. Both functions have their uses and are worth exploring, which is why I felt this alternative should be mentioned.

Resources