Multirow deletion: delete row depending on other row - r

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8

Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.

You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.

Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8

if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8

try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Related

R Check for levels within a group and duplicate row if not present

I have a problem with in Shiny which I will show in a simple example:
I have the following data:
Group<-c("A","A","B","C","C","D")
Value<-c(1,2,6,7,3,9)
df<-data.frame(Group, Value)
Group Value
A 1
A 2
B 6
C 7
C 3
D 9
Then I add a row to see how many reps a group has:
df$num <- ave(df$Value, df$Group, FUN = seq_along)
Group Value num
A 1 1
A 2 2
B 6 1
C 7 1
C 3 2
D 9 1
Now, I would like it, to check if the group contains a 2nd rep, and if not, duplicate the 1st row of the group (containing num=1) and setting num to 2.
So I would like to end up with:
Group Value num
A 1 1
A 2 2
B 6 1
B 6 2 #this row is added
C 7 1
C 3 2
D 9 1
D 9 2 #this row is added
I have tried to search for solution, but I mainly ended up with subject like a condition that is based on a certain value, rather than conditions within a group.
Could someone help me? I would appreciate it a lot!
Can this code do the trick ?
res <- lapply(unique(df$Group), function(x){
a <- df[df$Group == x, ]
if(nrow(a) == 1) {
a <- a[rep(row.names(a), 2), ]
a$num <- c(1:2)
}
a
})
do.call(rbind, res)

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

remove duplicate row based only of previous row

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?
Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]
Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

Order a data frame only from a certain row index to a certain row index

Let's say we have a DF like this:
col1 col2
A 1
A 5
A 3
A 16
B 5
B 4
B 3
C 7
C 2
I'm trying to order col2 but only for same values in col1. Better said, I want it to look like this:
col1 col2
A 1
A 3
A 5
A 16
B 3
B 4
B 5
C 2
C 7
So order col2 only for A, B and C values, not order the entire col2 column
x <- function() {
values<- unique(DF[, 1])
for (i in values) {
currentData <- which(DF$col1== i)
## what to do here ?
data[order(data[, 2]), ]
}
}
so in CurrentData I have indexes for col2 values for only As, Bs etc. But how do I order only those items in my entire DF data frame ? Is it somehow possible to tell the order function to do order only on certain row indexes of data frame ?
ave will group the data by the first element, and apply the named function to the second element for each group. Here is an application of ave sorting within groups:
DF$col2 <- ave(DF$col2, DF$col1, FUN=sort)
DF
## col1 col2
## 1 A 1
## 2 A 3
## 3 A 5
## 4 A 16
## 5 B 3
## 6 B 4
## 7 B 5
## 8 C 2
## 9 C 7
This will work even if the values in col1 are not consecutive, leaving them in their original positions.
If that is not an important consideration, there are better ways to do this, such as the answer by #user314046.
It seems that
my_df[with(my_df, order(col1, col2)), ]
will do what you want - this just sorts the dataframe by col1 and col2. If you don't want to order by col1 a method is provided in the other answer.

Using R: Make a new column that counts the number of times 'n' conditions from 'n' other columns occur

I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).

Resources