How to remove individuals with fewer than 5 observations from a data frame [duplicate] - r

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed last month.
To clarify the question I'll briefly describe the data.
Each row in the data.frame is an observation, and the columns represent variables pertinent to that observation including: what individual was observed, when it was observed, where it was observed, etc. I want to exclude/filter individuals for which there are fewer than 5 observations.
In other words, if there are fewer than 5 rows where individual = x, then I want to remove all rows that contain individual x and reassign the result to a new data.frame. I'm aware of some brute force techniques using something like names == unique(df$individualname) and then subsetting out those names individually and applying nrow to determine whether or not to exclude them...but there has to be a better way. Any help is appreciated, I'm still pretty new to R.

An example using group_by and filter from dplyr package:
library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))
> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]
id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223
or with base R:
> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
Still in base R, using with is a more elegant way to do the very same thing:
df[with(df, id %in% names(which(table(id)>=5))), ]
or:
subset(df, with(df, id %in% names(which(table(id)>=5))))

Another way to do the same thing using the data.table package.
library(data.table)
set.seed(1)
dt <- data.table(id=sample(1:4,20,replace=TRUE),var=sample(1:100,20))
dt1<-dt[,count:=.N,by=id][(count>=5)]
dt2<-dt[,count:=.N,by=id][(count<5)]
dt1
id var count
1: 2 94 5
2: 2 22 5
3: 3 64 5
4: 4 13 6
5: 4 37 6
6: 4 2 6
7: 3 36 5
8: 3 81 5
9: 3 90 5
10: 2 17 5
11: 4 72 6
12: 2 57 5
13: 3 67 5
14: 4 9 6
15: 2 60 5
16: 4 34 6
dt2
id var count
1: 1 26 4
2: 1 31 4
3: 1 44 4
4: 1 54 4

It can be also with data.table using a logical condition with if after grouping by 'id'
library(data.table)
setDT(df)[, if(.N >=5) .SD, id]
# id foo
# 1: b 0.9962453
# 2: b 0.8980123
# 3: b 0.1535324
# 4: b 0.2802848
# 5: b 0.9366375
# 6: c 0.8109557
# 7: c 0.6945285
# 8: c 0.1012925
# 9: c 0.6822955
#10: c 0.3757085
#11: c 0.7348635
#12: c 0.3026395
#13: c 0.9707223
data
df <- structure(list(id = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "c"), foo = c(0.8717067, 0.9086262,
0.9962453, 0.8980123, 0.1535324, 0.2802848, 0.9366375, 0.8109557,
0.6945285, 0.1012925, 0.6822955, 0.3757085, 0.7348635, 0.3026395,
0.9707223)), .Names = c("id", "foo"), class = "data.frame",
row.names = c(NA, -15L))

you can also use table. take for instance the data.frame mtcars
table(mtcars$cyl)
you will see that cyl has 3 values 4 6 8. there are 7 cars with 6 cylinders and if you want to exclude observations with less than 10 then you can exclude the cars with 6 cylinders like that
mtcars[!mtcars$cyl%in%names(table(mtcars$cyl)[table(mtcars$cyl)<10]),]
this will exclude observations using %in% names and table alone

Related

Labelling rows according to how many times the group appeared in previous rows

Suppose I have the following data.frame object:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'))
From the snapshot above, you can see that there are two groups of rows that have col1=="a": rows 1 through 3 and rows 21 through 23. Similarly, there are three groups of rows that have col1=="e": row 15, rows 19 through 20 and rows 24 through 25 (and so on and so on with "b", "c" and "d").
Here's my main question
Is it possible to label the rows according to what "chunk" we're currently on? More explicitly: since rows 1 through 3 are the first time where we have col1=="a", they should be labelled as 1. Then, rows 21 through 23 should be labelled as 2, because that is the second time that we have a set of rows that have col1=="a". Using the same logic, but for col1=="e", we'd label row 15 as 1, rows 19 and 20 as 2 and rows 24 and 25 as 3 (again, so on and so on with "b", "c" and "d").
Desired output
Here is what the resulting data.frame would look like:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'),
grup=c(1,1,1,
1,1,1,
1,1,1,
1,1,
2,2,2,
1,
2,2,2,
2,2,
2,2,2,
3,3))
My attempt
I tried implementing a solution using a for loop, but that was quite slow (the original data I'm working on has about 500,000 rows), and it just looked a bit sloppy:
my_classifier = function(input_df, ref_column){
# Keeps a tally of how many times each unique group was "found" before.
group_counter = list()
# Dealing with the corner case of the first row
group_counter[[df$col1[1]]] = 1
output_groups = rep(-1, nrow(input_df))
output_groups[1] = 1
# The for loop starts at the second row because I've already "dealt" with the
# first row in the corner cases above
for(i in 2:nrow(input_df)){
prev_group = input_df[[ref_column]][i-1]
this_group = input_df[[ref_column]][i]
if(is.null(group_counter[[this_group]])){
this_counter = 0
}
else{
this_counter = group_counter[[this_group]]
}
if(prev_group != this_group){
this_counter = this_counter + 1
}
output_groups[i] = this_counter
group_counter[[this_group]] = this_counter
}
return(output_groups)
}
df$grup = my_classifier(df,'col1')
Is there a quicker/more efficient way to solve this problem? Maybe something that relies on vectorized functions or something?
Important notes
Consider that we cannot rely on the number of repetitions of each "block". Sometimes, col1 will have just one single row of a particular group, while other times the "block" will have several rows where col1 share the same value. Also consider that we cannot assume any logic in the "order" or the number of times each group shows up.
So, for example, there might be a a stretch of 10 rows where col1=="z", then a stretch of 15 rows where col1=="x", then another single row where col1=="x" and then finally a stretch of 100 rows where col1=="w".
You can use data.table::rleid() twice, like this:
library(data.table)
setDT(df)[,grp:=rleid(col1)][, grp:=rleid(grp), by=col1][order(id)]
Output:
id col1 grp
<int> <char> <int>
1: 1 a 1
2: 2 a 1
3: 3 a 1
4: 4 b 1
5: 5 b 1
6: 6 b 1
7: 7 c 1
8: 8 c 1
9: 9 c 1
10: 10 d 1
11: 11 d 1
12: 12 b 2
13: 13 b 2
14: 14 b 2
15: 15 e 1
16: 16 c 2
17: 17 c 2
18: 18 c 2
19: 19 e 2
20: 20 e 2
21: 21 a 2
22: 22 a 2
23: 23 a 2
24: 24 e 3
25: 25 e 3
id col1 grp
Here is a possible base R solution:
change <- with(rle(df$col1), rep(seq_along(values), lengths))
cbind(df, grp = with(df, ave(
change,
col1,
FUN = function(x)
inverse.rle(within.list(rle(x), values <- seq_along(values)))
)))
Or another option using a combination of rle and dplyr using the function from here:
rle_new <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
library(dplyr)
df %>%
mutate(grp = rle_new(col1)) %>%
group_by(col1) %>%
mutate(grp = rle_new(grp))
Output
id col1 grp
1 1 a 1
2 2 a 1
3 3 a 1
4 4 b 1
5 5 b 1
6 6 b 1
7 7 c 1
8 8 c 1
9 9 c 1
10 10 d 1
11 11 d 1
12 12 b 2
13 13 b 2
14 14 b 2
15 15 e 1
16 16 c 2
17 17 c 2
18 18 c 2
19 19 e 2
20 20 e 2
21 21 a 2
22 22 a 2
23 23 a 2
24 24 e 3
25 25 e 3

R merging tables, with different column names and retaining all columns

I have a big job to merge two large data.tables. This is new to me, and I need to demonstrate and explain to colleagues. This is the reason for the paranoid approach, I'd like to randomly select some result rows to assure us all that the merge is doing what we think it is! Here is my MWE, Thx. J
library(data.table)
first <- data.table(index = c("a", "a", "b", "c", "c"),
type = 1:5,
value = 3:7)
second <- data.table(i2 = c("a", "a", "b", "c", "c"),
t2 = c(1:3, 7, 5),
value = 5:9)
second[first , on=c(i2="index", t2="type"), nomatch=0L]
Which is doing the job correctly AFAIK, and gives this result
i2 t2 value i.value
1: a 1 5 3
2: a 2 6 4
3: b 3 7 5
4: c 5 9 7
However I would like, if possible to retain all columns from both tables such that the result would look like:
i2 t2 index type value i.value
1: a 1 a 1 5 3
2: a 2 a 2 6 4
3: b 3 b 3 7 5
4: c 5 c 5 9 7
Is it possible to retain all columns?
Yes, that's possible:
second[first, on=c(i2="index", t2="type"), nomatch=0L, .(i2, t2, index, type, value, i.value)]
i2 t2 index type value i.value
1: a 1 a 1 5 3
2: a 2 a 2 6 4
3: b 3 b 3 7 5
4: c 5 c 5 9 7

Simplest way to replace a list of values in a data frame with a list of new values

Say we have a data frame with a factor (Group) that is a grouping variable for a list of IDs:
set.seed(123)
data <- data.frame(Group = factor(sample(5,10, replace = T)),
ID = c(1:10))
In this example, the ID's belong to one of 5 Groups, labeled 1:5. We simply want to replace 1:5 with A:E. In other words, if Group == 1, we want to change it to A, if Group == 2, we want to change it to B, and so on. What is the simplest way to achieve this?
You may assign new labels= in a names list using factor once again.
data$Group1 <- factor(data$Group, labels=list("1"="A", "2"="B", "3"="C", "4"="D", "5"="E"))
## more succinct:
data$Group2 <- factor(data$Group, labels=setNames(list("A", "B", "C", "D", "E"), 1:5))
data
# Group ID Group1 Group2 Group3
# 1 3 1 C C C
# 2 3 2 C C C
# 3 2 3 B B B
# 4 2 4 B B B
# 5 3 5 C C C
# 6 5 6 E E E
# 7 4 7 D D D
# 8 1 8 A A A
# 9 2 9 B B B
# 10 3 10 C C C
This for general, if indeed capital letters are wanted see #RonakShah's solution.
You can use the built-in constant in R LETTERS :
data$new_group <- LETTERS[data$Group]
data
# Group ID new_group
#1 3 1 C
#2 3 2 C
#3 2 3 B
#4 2 4 B
#5 3 5 C
#6 5 6 E
#7 4 7 D
#8 1 8 A
#9 2 9 B
#10 3 10 C
Created a new column (new_group) here for comparison purposes. You can overwrite the same column if you wish to.

Pass variable as column name to dplyr?

I have a very ugly dataset that is a flat file of a relational database. A minimal reproducible example is here:
df <- data.frame(col1 = c(letters[1:4],"c"),
col1.p = 1:5,
col2 = c("a","c","l","c","l"),
col2.p = 6:10,
col3= letters[3:7],
col3.p = 11:20)
I need to be able to identify the '.p' value for the 'col#' that has the "c". My previous question on SO got the first part: In R, find the column that contains a string in for each row. Which I'm providing for context.
tmp <- which(projectdata=='Transmission and Distribution of Electricity', arr.ind=TRUE)
cnt <- ave(tmp[,"row"], tmp[,"row"], FUN=seq_along)
maxnames <- paste0("max",sequence(max(cnt)))
projectdata[maxnames] <- NA
projectdata[maxnames][cbind(tmp[,"row"],cnt)] <- names(projectdata)[tmp[,"col"]]
rm(tmp, cnt, maxnames)
This results in a dataframe that looks like this:
df
col1 col1.p col2 col2.p col3 col3.p max1
1 a 1 a 6 c 11 col3
2 b 2 c 7 d 12 col2
3 c 3 l 8 e 13 col1
4 d 4 c 9 f 14 col2
5 c 5 l 10 g 15 col1
6 a 1 a 6 c 16 col3
7 b 2 c 7 d 17 col2
8 c 3 l 8 e 18 col1
9 d 4 c 9 f 19 col2
10 c 5 l 10 g 20 col1
When I tried to get the ".p" that matched the value in "max1", I kept getting errors. I thought the approach would be:
df %>%
mutate(my.p = eval(as.name(paste0(max1,'.p'))))
Error: object 'col3.p' not found
Clearly, this did not work, so I thought maybe this was similar to passing a column name in a function, where I need to use 'get'. That also didn't work.
df %>%
mutate(my.p = get(as.name(paste0(max1,'.p'))))
Error: invalid first argument
df %>%
mutate(my.p = get(paste0(max1,'.p')))
Error: object 'col3.p' not found
I found something that gets rid of this error, using data.table from a different, but related problem, here: http://codereply.com/answer/7y2ra3/dplyr-error-object-found-using-rle-mutate.html. However, it gives me "col3.p" for every row. This is max1 for the first row, df$max1[1]
library('dplyr')
library('data.table') # must have the data.table package
df %>%
tbl_dt(df) %>%
mutate(my.p = get(paste0(max1,'.p')))
Source: local data table [10 x 8]
col1 col1.p col2 col2.p col3 col3.p max1 my.p
1 a 1 a 6 c 11 col3 11
2 b 2 c 7 d 12 col2 12
3 c 3 l 8 e 13 col1 13
4 d 4 c 9 f 14 col2 14
5 c 5 l 10 g 15 col1 15
6 a 1 a 6 c 16 col3 16
7 b 2 c 7 d 17 col2 17
8 c 3 l 8 e 18 col1 18
9 d 4 c 9 f 19 col2 19
10 c 5 l 10 g 20 col1 20
Using the lazyeval interp approach (from this SO: Hot to pass dynamic column names in dplyr into custom function?) doesn't work for me. Perhaps I am implementing it incorrectly?
library(lazyeval)
library(dplyr)
df %>%
mutate_(my.p = interp(~colp, colp = as.name(paste0(max1,'.p'))))
I get an error:
Error in paste0(max1, ".p") : object 'max1' not found
Ideally, I will have the new column my.p equal the appropriate p based on the column identified in max1.
I can do this all with ifelse, but I am trying to do it with less code and to make it applicable to the next ugly flat table.
We can do this with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by the the row sequence, we get the value of the paste output, and assign (:=) it to a new column ('my.p').
library(data.table)
setDT(df)[, my.p:= get(paste0(max1, '.p')), 1:nrow(df)]
df
# col1 col1.p col2 col2.p col3 col3.p max1 my.p
# 1: a 1 a 6 c 11 col3 11
# 2: b 2 c 7 d 12 col2 7
# 3: c 3 l 8 e 13 col1 3
# 4: d 4 c 9 f 14 col2 9
# 5: c 5 l 10 g 15 col1 5
# 6: a 1 a 6 c 16 col3 16
# 7: b 2 c 7 d 17 col2 7
# 8: c 3 l 8 e 18 col1 3
# 9: d 4 c 9 f 19 col2 9
#10: c 5 l 10 g 20 col1 5

R building a subset based on value in previous row

I have a problem figuering this out:
suppose this is how my data looks like:
Num condition y
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 c 8
9 c 9
10 b 10
11 b 11
12 b 12
I now want to make calculation (e.g., mean) on b, depending on whether value was in the row before b, in this example a or c?
Thanks for any help!!!
Angelika
Is this what you want?
# in order to separate between different runs of condition 'b',
# get length and value of runs of equal values of 'condition'
rl <- rle(x = df$condition)
df$run <- rep(x = seq_len(length(rl$lengths)), times = rl$lengths)
# calculate sum of y, on data grouped by condition and run, and where condition is 'b'
aggregate(y ~ condition + run, data = df, subset = condition == "b", sum)
You can add a "lagged" condition column to your dataframe (assuming DF) using
> DF <- within(DF, lag_cond <- c(NA, head(as.character(condition), -1)))
Result:
Num condition y lag_cond
1 a 1 <NA>
2 a 2 a
3 a 3 a
4 b 4 a
5 b 5 b
6 b 6 b
7 c 7 b
8 c 8 c
9 c 9 c
10 b 10 c
11 b 11 b
12 b 12 b
Now you can identify rows you want like this:
> DF[with(DF, condition=="b" & lag_cond %in% c("a","c")),]
Num condition y lag_cond
4 b 4 a
10 b 10 c

Resources