Removing specific rows from a dataframe - r

I have a data frame e.g.:
sub day
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
3 1
3 2
3 3
3 4
and I would like to remove specific rows that can be identified by the combination of sub and day.
For example say I wanted to remove rows where sub='1' and day='2' and sub=3 and day='4'. How could I do this?
I realise that I could specify the row numbers, but this needs to be applied to a huge dataframe which would be tedious to go through and ID each row.

DF[ ! ( ( DF$sub ==1 & DF$day==2) | ( DF$sub ==3 & DF$day==4) ) , ] # note the ! (negation)
Or if sub is a factor as suggested by your use of quotes:
DF[ ! paste(sub,day,sep="_") %in% c("1_2", "3_4"), ]
Could also use subset:
subset(DF, ! paste(sub,day,sep="_") %in% c("1_2", "3_4") )
(And I endorse the use of which in Dirk's answer when using "[" even though some claim it is not needed.)

This boils down to two distinct steps:
Figure out when your condition is true, and hence compute a vector of booleans, or, as I prefer, their indices by wrapping it into which()
Create an updated data.frame by excluding the indices from the previous step.
Here is an example:
R> set.seed(42)
R> DF <- data.frame(sub=rep(1:4, each=4), day=sample(1:4, 16, replace=TRUE))
R> DF
sub day
1 1 4
2 1 4
3 1 2
4 1 4
5 2 3
6 2 3
7 2 3
8 2 1
9 3 3
10 3 3
11 3 2
12 3 3
13 4 4
14 4 2
15 4 2
16 4 4
R> ind <- which(with( DF, sub==2 & day==3 ))
R> ind
[1] 5 6 7
R> DF <- DF[ -ind, ]
R> table(DF)
day
sub 1 2 3 4
1 0 1 0 3
2 1 0 0 0
3 0 1 3 0
4 0 2 0 2
R>
And we see that sub==2 has only one entry remaining with day==1.
Edit The compound condition can be done with an 'or' as follows:
ind <- which(with( DF, (sub==1 & day==2) | (sub=3 & day=4) ))
and here is a new full example
R> set.seed(1)
R> DF <- data.frame(sub=rep(1:4, each=5), day=sample(1:4, 20, replace=TRUE))
R> table(DF)
day
sub 1 2 3 4
1 1 2 1 1
2 1 0 2 2
3 2 1 1 1
4 0 2 1 2
R> ind <- which(with( DF, (sub==1 & day==2) | (sub==3 & day==4) ))
R> ind
[1] 1 2 15
R> DF <- DF[-ind, ]
R> table(DF)
day
sub 1 2 3 4
1 1 0 1 1
2 1 0 2 2
3 2 1 1 0
4 0 2 1 2
R>

Here's a solution to your problem using dplyr's filter function.
Although you can pass your data frame as the first argument to any dplyr function, I've used its %>% operator, which pipes your data frame to one or more dplyr functions (just filter in this case).
Once you are somewhat familiar with dplyr, the cheat sheet is very handy.
> print(df <- data.frame(sub=rep(1:3, each=4), day=1:4))
sub day
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> print(df <- df %>% filter(!((sub==1 & day==2) | (sub==3 & day==4))))
sub day
1 1 1
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 3 1
9 3 2
10 3 3

One simple solution:
cond1 <- df$sub == 1 & df$day == 2
cond2 <- df$sub == 3 & df$day == 4
df <- df[!(cond1 | cond2),]

Related

How to keep only first value in every sequence of duplicated values in R [duplicate]

This question already has answers here:
Select first row in each contiguous run by group
(4 answers)
Closed 5 months ago.
I am trying to create a subset where I keep the first value in each sequence of numbers in a column. I tried to use:
df %>% group_by(x) %>% slice_head(n = 1)
But it only works for the first instance of each sequence.
An example data where x column contains the repeated sequence can be seen below:
x = c(2,2,2,3,3,3,1,1,1,5,5,5,2,2,2,1,1,1,3,3,3)
y = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 2 1
3 2 1
4 3 1
5 3 1
6 3 1
7 1 1
8 1 1
9 1 1
10 5 1
11 5 1
12 5 1
13 2 1
14 2 1
15 2 1
16 1 1
17 1 1
18 1 1
19 3 1
20 3 1
21 3 1
So the end result that I would like to achive is:
x = c(2,3,1,5,2,1,3)
y = c(1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 3 1
3 1 1
4 5 1
5 2 1
6 1 1
7 3 1
Could you please help or point me to any useful existing topics as I haven't managed to find it?
Thanks
You can try rleid from package data.table
> library(data.table)
> setDT(df)[!duplicated(rleid(x))]
x y
1: 2 1
2: 3 1
3: 1 1
4: 5 1
5: 2 1
6: 1 1
7: 3 1
Base R.
df[c(1, diff(df$x)) != 0, ]
Or also with helper functions from data.table.
library(data.table)
df[rowid(rleid(df$x)) == 1L, ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 13 2 1
# 16 1 1
# 19 3 1
Using rle and match.
df[match(with(rle(df$x), values), df$x), ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 1.1 2 1
# 7.1 1 1
# 4.1 3 1

Select rows of data frame based on a vector with duplicated values

What I want can be described as: give a data frame, contains all the case-control pairs. In the following example, y is the id for the case-control pair. There are 3 pairs in my data set. I'm doing a resampling with respect to the different values of y (the pair will be both selected or neither).
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
> sample_df
x y
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 6 3
select_y = c(1,3,3)
select_y
> select_y
[1] 1 3 3
Now, I have computed a vector contains the pairs I want to resample, which is select_y above. It means the case-control pair number 1 will be in my new sample, and number 3 will also be in my new sample, but it will occur 2 times since there are two 3. The desired output will be:
x y
1 1
2 1
5 3
6 3
5 3
6 3
I can't find out an efficient way other than writing a for loop...
Solution:
Based on #HubertL , with some modifications, a 'vectorized' approach looks like:
sel_y <- as.data.frame(table(select_y))
> sel_y
select_y Freq
1 1 1
2 3 2
sub_sample_df = sample_df[sample_df$y%in%select_y,]
> sub_sample_df
x y
1 1 1
2 2 1
5 5 3
6 6 3
match_freq = sel_y[match(sub_sample_df$y, sel_y$select_y),]
> match_freq
select_y Freq
1 1 1
1.1 1 1
2 3 2
2.1 3 2
sub_sample_df$Freq = match_freq$Freq
rownames(sub_sample_df) = NULL
sub_sample_df
> sub_sample_df
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
4 6 3 2
selected_rows = rep(1:nrow(sub_sample_df), sub_sample_df$Freq)
> selected_rows
[1] 1 2 3 3 4 4
sub_sample_df[selected_rows,]
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
3.1 5 3 2
4 6 3 2
4.1 6 3 2
Another method of doing the same without a loop:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
row_names <- split(1:nrow(sample_df),sample_df$y)
select_y = c(1,3,3)
row_num <- unlist(row_names[as.character(select_y)])
ans <- sample_df[row_num,]
I can't find a way without a loop, but at least it's not a for loop, and there is only one iteration per frequency:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
select_y = c(1,3,3)
sel_y <- as.data.frame(table(select_y))
do.call(rbind,
lapply(1:max(sel_y$Freq),
function(freq) sample_df[sample_df$y %in%
sel_y[sel_y$Freq>=freq, "select_y"],]))
x y
1 1 1
2 2 1
5 5 3
6 6 3
51 5 3
61 6 3

How Can I check the occurences of the values in each individual in R?

Let's say I have a data.frame that looks like this:
ID B
1 1
1 2
1 1
1 3
2 2
2 2
2 2
2 2
3 2
3 10
3 2
Now I want to check the occurrences of B under each ID, such as that for no. 1, 1 happens twice, 2 and 3 happens 1 time each. And in no. 2, only 2 happens 4 times. How should I accomplish this? I tried to use table in ddply but somehow it did not work. Thanks.
It seems like you may just want a table
> table(dat)
## B
## ID 1 2 3 10
## 1 2 1 1 0
## 2 0 4 0 0
## 3 0 2 0 1
Then the following shows that for ID equal to 1, there are two 1s, one 2, and one 3.
> table(dat)[1, ]
## 1 2 3 10
## 2 1 1 0
And here's an aggregate solution:
> with(data, aggregate(B, list(ID=ID, B=B), length))
ID B x
1 1 1 2
2 1 2 1
3 2 2 4
4 3 2 2
5 1 3 1
6 3 10 1
Here's an approach using "dplyr" (if I understood your question correctly):
library(dplyr)
mydf %.% group_by(ID, B) %.% summarise(count = n())
# Source: local data frame [6 x 3]
# Groups: ID
#
# ID B count
# 1 1 1 2
# 2 1 2 1
# 3 1 3 1
# 4 2 2 4
# 5 3 2 2
# 6 3 10 1
In "plyr", I guess it would be something like:
library(plyr)
ddply(mydf, .(ID, B), summarise, count = length(B))
In base R, you could do something like the following and just remove the rows with 0:
data.frame(table(mydf))
# ID B Freq
# 1 1 1 2
# 2 2 1 0
# 3 3 1 0
# 4 1 2 1
# 5 2 2 4
# 6 3 2 2
# 7 1 3 1
# 8 2 3 0
# 9 3 3 0
# 10 1 10 0
# 11 2 10 0
# 12 3 10 1
And the data.table solution because there must be:
data[, .N, by=c('ID','B')]
The above won't work if you try to apply it to a data.frame. It must be converted to a data.table first. With more recent versions of "data.table", this is most easily done with setDT (as recommended by David in the comments):
library(data.table)
setDT(data)[, .N, by=c('ID', 'B')]

Subsequent row summing in dataframe object

I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15

subtract first value from each subset of dataframe

I want to subtract the smallest value in each subset of a data frame from each value in that subset i.e.
A <- c(1,3,5,6,4,5,6,7,10)
B <- rep(1:4, length.out=length(A))
df <- data.frame(A, B)
df <- df[order(B),]
Subtracting would give me:
A B
1 0 1
2 3 1
3 9 1
4 0 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
I think the output you show is not correct. In any case, from what you explain, I think this is what you want. This uses ave base function:
within(df, { A <- ave(A, B, FUN=function(x) x-min(x))})
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4
Of course there are other alternatives such as plyr and data.table.
Echoing Arun's comment above, I think your expected output might be off. In any event, you should be able to use can use tapply to calculate subsets and then use match to line those subsets up with the original values:
subs <- tapply(df$A, df$B, min)
df$A <- df$A - subs[match(df$B, names(subs))]
df
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4

Resources