Removing rows from a dataset based on conditional statement across factors - r

I am struggling to figure out how to remove rows from a dataset based on conditions across multiple factors in a large dataset. Here is some example data to illustrate the problem I am having with a smaller data frame:
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
data$Value <- (as.numeric(data$Value))
data
Code Value
1 A 1
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I want to remove values where the Code is A and the Value is < 2 from the dataset. I understand the logic of how to select for values where Code is A and Values <2, but I can't figure out how to remove these values from the dataset without also removing all values of A that are > 2, while maintaining values of the other codes that are less than 2.
#Easy to select for values of A less than 2
data2<- subset(data, (Code == "A" & Value < 2))
data2
Code Value
1 A 1
#But I want to remove values of A less than 2 without also removing values of A that are greater than 2:
data1<- subset(data, (Code != "A" & Value > 2))
data1
Code Value
3 C 3
4 D 4
### just using Value > 2 does not allow me to include values that are less than 2 for the other Codes (B,C,D):
data2<- subset(data, Value > 2)
data2
3 C 3
4 D 4
7 A 3
8 A 4
My ideal dataset would look like this:
data
Code Value
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I have tried different iterations of filter(), subset(), and select() but I can't figure out the correct conditional statement that allows me to remove the desired combination of levels of multiple factors. Any suggestions would be greatly appreciated.

Related

R randomly allocate different values from vector to dataframe column based on condition

I've got a dataset called df of answers to a question Q1
df = data.frame(ID = c(1:5), Q1 = c(1,1,3,4,2))
I also have a vector where each element is a word
words = c("good","bad","better","improved","fascinating","improvise")
My Objective
IF Q1 = 1 or Q1 = 2, then randomly assign a value from vector words to a newly created column called followup
My Attempt
#If answer to Q1 is 1 or 2, then randomly allocate a word to newly created column "followup"
#Else leave blank
df$followup=ifelse(df$Q1==1 | df$Q1==2,sample(words,1),"")
However doing this leads to repetition of the same randomly selected word for each row that contains a 1 or 2.
ID Q1 followup
1 1 1 fascinating
2 2 1 fascinating
3 3 3
4 4 4
5 5 5
I'd like every word to be randomized and different.
Any inputs would be highly appreciated.
For that we may use
df$followup[df$Q1 %in% 1:2] <- sample(words, sum(df$Q1 %in% 1:2))
df
# ID Q1 followup
# 1 1 1 better
# 2 2 1 improvise
# 3 3 3 <NA>
# 4 4 4 <NA>
# 5 5 2 bad
Since we are generating those values in a single call, replace = FALSE (the default value) in sample gives the desired result of all the values being different.

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

From a set of pairs, find all subsets s.t. no pair in the subset shares any element with a pair not in the subset

I have a set of pairs. Each pair is represented as [i,1:2]. That is, the ith pair are the numbers in the first and second column in the ith row.
I need to sort these pairs into distinct groups, s.t. there is no element in any pair in the jth group that is in any group not j. For example:
EXAMPLE 1: DATA
> col1 <- c(3, 4, 6, 7, 10, 8)
> col2 <- c(6, 7, 3, 4, 3, 1)
>
> dat <- cbind(col1, col2)
> rownames(dat) <- 1:nrow(dat)
>
> dat
col1 col2
1 3 6
2 4 7
3 6 3
4 7 4
5 10 3
6 8 1
For all pairs, it doesn't matter if the number is in column 1 or column 2, the pairs should be sorted into groups s.t. every number in every pair in every group exists only in one group. So the solved example would look like this.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 3
Rows 1, 3, and 5 are grouped together because 1 and 3 contain the same numbers and 5 shares the number 3, so it must be grouped with them. 2 and 4 share the same distinct numbers so they are grouped together and 6 has unique numbers so it is left alone.
If we change the data slightly, note the following.
EXAMPLE 2: NEW DATA
Note what happens when we add a row that shares an element with row 6 and row 5.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 1
7 1 10 1
The 10 in the 7th row connects it to the first group because it shares an elements with the 5th row. It also shares an element with the 6th row (the number 1), so the 6th row would instead be in group 1.
PROBLEM
Is there a simple way to form the groups? A vector operation? A sorting algorithm? It gets very nasty very quickly if you try to do it with a loop, since each subsequent row can change the membership of earlier rows, as demonstrated in the example.
To draw on the old answer at: identify groups of linked episodes which chain together , which assigns a group to each individual value, you could try this to assign a group to each linked pair:
library(igraph)
g <- graph_from_data_frame(dat)
links <- data.frame(col1=V(g)$name,group=components(g)$membership)
merge(dat,links,by="col1",all.x=TRUE,sort=FALSE)
# col1 col2 group
#1 3 6 1
#2 4 7 2
#3 6 3 1
#4 7 4 2
#5 10 3 1
#6 8 1 3
Your elements can be regarded as vertices in an undirected graph, and your pairs can be regarded as edges, and then (assuming that you want to find groups of minimal size -- if you don't, then e.g. the entire set of pairs could be labelled "Group 1") the groups you're looking for are the connected components in this graph. They can all be found in linear time with a depth-first or breadth-first search.

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Resources