Finding "outliers" in a group - r

I am working with hospital discharge data. All hospitalizations (cases) with the same Patient_ID are supposed to be of the same person. However I figured out that there are Pat_ID's with different ages and both sexes.
Imagine I have a data set like this:
Case_ID <- 1:8
Pat_ID <- c(rep("1",4), rep("2",3),"3")
Sex <- c(rep(1,4), rep(2,2),1,1)
Age <- c(rep(33,3),76,rep(19,2),49,15)
Pat_File <- data.frame(Case_ID, Pat_ID, Sex,Age)
Case_ID Pat_ID Sex Age
1 1 1 33
2 1 1 33
3 1 1 33
4 1 1 76
5 2 2 19
6 2 2 19
7 2 1 49
8 3 1 15
It was relatively easy to identify Pat_ID's with cases that differ from each other. I found these ID's by calculating an average for age and/or sex (coded as 1 and 2) with help of the function aggregate and then calculated the difference between the average and age or sex. I would like to automatically remove/identify cases where age or sex deviate from the majority of the cases of a patient ID. In my example I would like to remove cases 4 and 7.

You could try
library(data.table)
Using Mode from
Is there a built-in function for finding the mode?
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 5 2 19
#5: 2 6 2 19
#6: 3 8 1 15
Testing other cases,
Pat_File$Sex[6] <- 1
Pat_File$Age[4] <- 16
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 6 1 19
#5: 3 8 1 15

This method works, I believe, though I doubt it's the quickest or most efficient way.
Essentially I split the dataframe by your grouping variable. Then I found the 'mode' for the variables you're concerned about. Then we filtered those observations that didn't contain all of the modes. We then stuck everything back together:
library(dplyr) # I used dplyr to 'filter' though you could do it another way
temp <- split(Pat_File, Pat_ID)
Mode.Sex <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Sex)); names(temp1)[temp1 == max(temp1)]})
Mode.Age <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Age)); names(temp1)[temp1 == max(temp1)]})
temp.f<-NULL
for(i in 1:length(temp)){
temp.f[[i]] <- temp[[i]] %>% filter(Sex==Mode.Sex[[i]] & Age==Mode.Age[[i]])
}
do.call("rbind", temp.f)
# Case_ID Pat_ID Sex Age
#1 1 1 1 33
#2 2 1 1 33
#3 3 1 1 33
#4 5 2 2 19
#5 6 2 2 19
#6 8 3 1 15

Here is another approach using the sqldf package:
1) Create new dataframe (called data_groups) with unique groups based on Pat_ID, Sex, and Age
2) For each unique group, check Pat_ID against every other group and if the Pat_ID of one group matches another group, select the group with lower count and store in new vector (low_counts)
3) Take new datafame (data_groups) and take out Pat_IDs from new vector (low_counts)
4) Recombine with Pat_File
Here is the code:
library(sqldf)
# Create new dataframe with unique groups based on Pat_ID, Sex, and Age
data_groups <- sqldf("SELECT *, COUNT(*) FROM Pat_File GROUP BY Pat_ID, Sex, Age")
# Create New Vector to Store Pat_IDs with Sex and Age that differ from mode
low_counts <- vector()
# Unique groups
data_groups
for(i in 1:length(data_groups[,1])){
for(j in 1:length(data_groups[,1])){
if(i<j){
k <- length(low_counts)+1
result <- data_groups[i,2]==data_groups[j,2]
if(is.na(result)){result <- FALSE}
if(result==TRUE){
if(data_groups[i,5]<data_groups[j,5]){low_counts[k] <- data_groups[i,1]}
else{low_counts[k] <- data_groups[j,1]}
}
}
}
}
low_counts <- as.data.frame(low_counts)
# Take out lower counts
data_groups <- sqldf("SELECT * FROM data_groups WHERE Case_ID NOT IN (SELECT * FROM low_counts)")
Pat_File <- sqldf("SELECT Pat_File.Case_ID, Pat_File.Pat_ID, Pat_File.Sex, Pat_File.Age FROM data_groups, Pat_File WHERE data_groups.Pat_ID=Pat_File.Pat_ID AND data_groups.Sex=Pat_File.Sex AND data_groups.Age=Pat_File.Age ORDER BY Pat_File.Case_ID")
Pat_File
Which Provides the following results:
Case_ID Pat_ID Sex Age
1 1 1 1 33
2 2 1 1 33
3 3 1 1 33
4 5 2 2 19
5 6 2 2 19
6 8 3 1 15

Related

Mapping 2 unrelated data frames in R

I need to use data from a dataframe A to fill a column in my dataframe B.
Here is a subset of dataframe A:
> dfA <- data.frame(Family=c('A','A','A','B','B'), Count=c(1,2,3,1,2), Start=c(0,10,35,0,5), End=c(10,35,50,5,25))
> dfA
Family Count Start End
1 A 1 0 10
2 A 2 10 35
3 A 3 35 50
4 B 1 0 5
5 B 2 5 25
and a subset of dataframe B
> dfB <- data.frame(Family=c('A','A','A','B','B'), Start=c(1,4,36,2,10), End=c(3,6,40,4,24), BelongToCount=c(NA,NA,NA,NA,NA))
> dfB
Family Start End BelongToCount
1 A 1 3 NA
2 A 4 6 NA
3 A 36 40 NA
4 B 2 4 NA
5 B 10 24 NA
What I want to do is to fill in the BelongToCount column in B according to the data from dataframe A, which would end up with dataframe B filled as:
Family Start End BelongToCount
A 1 3 1
A 4 6 1
A 36 40 3
B 2 4 1
B 10 24 2
I need to do this for each family (so grouping by family), and the condition to fill the BelongToCount column is that if B$Start >= A$Start && B$End <= A$End.
I can't seem to find a clean (and fast) way to do this in R.
Right now, I am doing as follows:
split_A <- split(dfA, dfA$Family)
split_A_FamilyA <- split_A[["A"]]
split_B <- split(dfB, dfB$Family)
split_B_FamilyA <- split_B[["A"]]
for(i in 1:nrow(split_B_FamilyA)) {
row <- split_B_FamilyA[i,]
start <- row$dStart
end <- row$dEnd
for(j in 1:nrow(split_A_FamilyA)) {
row_base <- split_A_FamilyA[j,]
start_base <- row_base$Start
end_base <- row_base$End
if ((start >= start_base) && (end <= end_base)) {
split_B_FamilyA[i,][i,]$BelongToCount <- row_base$Count
break
}
}
}
I admit this is a very bad way of handling the problem (and it is awfully slow). I usually use dplyr when it comes to applying operations on specific groups, but I can't find a way to do such a thing using it. Joining the tables does not make a lot of sense either because the number of rows don't match.
Can someone point me any relevant R function / an efficient way of solving this problem in R?
You can do this with non-equi join in data.table:
library(data.table)
setDT(dfB)
setDT(dfA)
set(dfB, j='BelongToCount', value = as.numeric(dfB$BelongToCount))
dfB[dfA, BelongToCount := Count, on = .(Family, Start >= Start, End <= End)]
# Family Start End BelongToCount
# 1: A 1 3 1
# 2: A 4 6 1
# 3: A 36 40 3
# 4: B 2 4 1
# 5: B 10 24 2
In case a row in dfB is contained in multiple roles of dfA:
dfA2 <- rbind(dfA, dfA)
dfA2[dfB, .(BelongToCount = sum(Count)),
on = .(Family, Start <= Start, End >= End), by = .EACHI]
# Family Start End BelongToCount
# 1: A 1 3 2
# 2: A 4 6 2
# 3: A 36 40 6
# 4: B 2 4 2
# 5: B 10 24 4

create a group variable in a data frame by a string variable starting from a certain value in R

I have following data frame.
sub1=c("2021","2121","M123","M143")
x1=c(10,5,6,7)
x2=c(11,12,34,56)
data=data.frame(sub1,x1,x2)
I need to get create a group variable for this data frame such that if the sub1 starts from number 2, then it will belongs to one group and if sub1 starts from letter M , it belongs to second group.
My desired output should be like this,
sub1 x1 x2 group
1 2021 10 11 1
2 2121 5 12 1
3 M123 6 34 2
4 M143 7 56 2
can anyone suggest any funstion that i use for this ? I tried grep funstion as follows, but i didnt get the desired result.
data$sub1[grep("^[2].*", data$sub1)]
Thank you
What about this:
data$group <- ifelse(substr(data$sub1,1,1)==2,1,2)
data
sub1 x1 x2 group
1 2021 10 11 1
2 2121 5 12 1
3 M123 6 34 2
4 M143 7 56 2
In case you do not know if it could be other cases than 2 or M:
ifelse(substr(data$sub1,1,1)==2,1,ifelse(substr(data$sub1,1,1)=='M',2,'Missing'))
Another way using substring and indexing to assign groups.
data$group <- (substr(data$sub1, 1, 1) == "M") + 1
data
# sub1 x1 x2 group
#1 2021 10 11 1
#2 2121 5 12 1
#3 M123 6 34 2
#4 M143 7 56 2
Or extract first character using regex
sub("(.).*", "\\1", data$sub1)
#[1] "2" "2" "M" "M"
and then use the same method to assign groups
(sub("(.).*", "\\1", data$sub1) == "M") + 1
#[1] 1 1 2 2
You can also do:
as.integer(!grepl("^2", data$sub1)) + 1
[1] 1 1 2 2

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

How to merge tables and fill the empty cells in the mean time in R?

Assume there are two tables a and b.
Table a:
ID AGE
1 20
2 empty
3 40
4 empty
Table b:
ID AGE
2 25
4 45
5 60
How to merge the two table in R so that the resulting table becomes:
ID AGE
1 20
2 25
3 40
4 45
You could try
library(data.table)
setkey(setDT(a), ID)[b, AGE:= i.AGE][]
# ID AGE
#1: 1 20
#2: 2 25
#3: 3 40
#4: 4 45
data
a <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
b <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
Assuming you have NA on every position in the first table where you want to use the second table's age numbers you can use rbind and na.omit.
Example
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
na.omit(rbind(x,y))
Results in what you're after (although unordered and I assume you just forgot ID 5)
ID AGE
1 20
3 40
2 25
4 45
5 60
EDIT
If you want to merge two different data.frames's and keep the columns its a different thing. You can use merge to achieve this.
Here are two data frames with different columns:
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA), COUNTY=c(1,2,3,4))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60), STATE=c('CA','CA','IL'))
Add them together into one data.frame
res <- merge(x, y, by='ID', all=T)
giving us
ID AGE.x COUNTY AGE.y STATE
1 20 1 NA <NA>
2 NA 2 25 CA
3 40 3 NA <NA>
4 NA 4 45 CA
5 NA NA 60 IL
Then massage it into the form we want
idx <- which(is.na(res$AGE.x)) # find missing rows in x
res$AGE.x[idx] <- res$AGE.y[idx] # replace them with y's values
names(res)[agrep('AGE\\.x', names(res))] <- 'AGE' # rename merged column AGE.x to AGE
subset(res, select=-AGE.y) # dump the AGE.y column
Which gives us
ID AGE COUNTY STATE
1 20 1 <NA>
2 25 2 CA
3 40 3 <NA>
4 45 4 CA
5 60 NA IL
The package in the other answer will work. Here is a dirty hack if you don't want to use the package:
x$AGE[is.na(x$AGE)] <- y$AGE[y$ID %in% x$ID]
> x
ID AGE
1 1 20
2 2 25
3 3 40
4 4 45
But, I would use the package to avoid the clunky code.

Counting unique items in data frame

I want a simple count of the number of subjects in each condition of a study. The data look something like this:
subjectid cond obser variable
1234 1 1 12
1234 1 2 14
2143 2 1 19
3456 1 1 12
3456 1 2 14
3456 1 3 13
etc etc etc etc
This is a large dataset and it is not always obvious how many unique subjects contribute to each condition, etc.
I have this in a data.frame.
What I want is something like
cond ofSs
1 122
2 98
Where for each "condition" I get a count of the number of unique Ss contributing data to that condition. Seems like this should be painfully simple.
Use the ddply function from the plyr package:
require(plyr)
df <- data.frame(subjectid = sample(1:3,7,T),
cond = sample(1:2,7,T), obser = sample(1:7))
> ddply(df, .(cond), summarize, NumSubs = length(unique(subjectid)))
cond NumSubs
1 1 1
2 2 2
The ddply function "splits" the data-frame by the cond variable, and produces a summary column NumSubs for each sub-data-frame.
Using your snippet of data that I loaded into object dat:
> dat
subjectid cond obser variable
1 1234 1 1 12
2 1234 1 2 14
3 2143 2 1 19
4 3456 1 1 12
5 3456 1 2 14
6 3456 1 3 13
Then one way to do this is to use aggregate to count the unique subjectid (assuming that is what you meant by "Ss"???
> aggregate(subjectid ~ cond, data = dat, FUN = function(x) length(unique(x)))
cond subjectid
1 1 2
2 2 1
or, if you like SQL and don't mind installing a package:
library(sqldf);
sqldf("select cond, count(distinct subjectid) from dat")
Just to give you even more choice, you could also use tapply
tapply(a$subjectid, a$cond, function(x) length(unique(x)))
1 2
2 1

Resources