remove duplicate row based only of previous row - r

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?

Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]

Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

Related

Why does dropping which(FALSE) columns delete all columns?

This answer warns of some scary behavior from which. Specifically, if you take any data frame, say df <- data.frame(x=1:5, y=2:6), and then try to subset it with something that evaluates to which(FALSE) (i.e. integer(0)), then you will delete every column in the data set. Why is this? Why would dropping all columns that correspond to integer(0) delete everything? Deleting nothing shouldn't destroy everything.
Example:
>df <- data.frame(x=1:5, y=2:6)
>df
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
>df <- df[,-which(FALSE)]
>df
data frame with 0 columns and 5 rows
Consider:
identical(integer(0), -integer(0))
# [1] TRUE
So, actually you're selecting nothing, rather than deleting nothing.
If you want to delete nothing, you could use a large negative integer, e.g. the largest possible.
df[, -.Machine$integer.max]
# x y
# 1 1 2
# 2 2 3
# 3 3 4
# 4 4 5
# 5 5 6

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

How to remove columns of data from a data frame using a vector with a regular expression

I am trying to remove columns from a dataframe using a vector of numbers, with those numbers being just a part of the whole column header. What I'm looking to use is something like the wildcard "*" in unix, so that I can say that I want to remove columns with labels xxxx, xxkx, etc... To illustrate what I mean, if I have the following data:
data_test_read <- read.table("batch_1_8c9.structure-edit.tsv",sep="\t", header=TRUE)
data_test_read[1:5,1:5]
samp pop X12706_10 X14223_16 X14481_7
1 BayOfIslands_s088.fq 1 4 1 3
2 BayOfIslands_s088.fq 1 4 1 3
3 BayOfIslands_s089.fq 1 4 1 3
4 BayOfIslands_s089.fq 1 4 3 3
5 BayOfIslands_s090.fq 1 4 1 3
And I want to take out, for example, columns with headers (X12706_10, X14481_7), the following works
data_subs1=subset(data_test_read, select = -c(X12706_10, X14481_7))
data_subs1[1:4,1:4]
samp pop X14223_16 X15213_19
1 BayOfIslands_s088.fq 1 1 3
2 BayOfIslands_s088.fq 1 1 3
3 BayOfIslands_s089.fq 1 1 3
4 BayOfIslands_s089.fq 1 3 3
However, what I need is to be able to identify these columns by only the numbers, so, using (12706,14481). But, if I try this, I get the following
data_subs2=subset(data_test_read, select = -c(12706,14481))
data_subs2[1:4,1:4]
samp pop X12706_10 X14223_16
1 BayOfIslands_s088.fq 1 4 1
2 BayOfIslands_s088.fq 1 4 1
3 BayOfIslands_s089.fq 1 4 1
4 BayOfIslands_s089.fq 1 4 3
This is clearly because I haven't specified anything to do with the "x", or the "_" or what is after the underscore. I've read so many answers on using regular expressions, and I just can't seem to sort it out. Any thoughts, or pointers to what I might turn to would be appreciated.
First you can just extract the numbers from the headers
# for testing
col_names <- c("X12706_10","X14223_16","X14481_7")
# in practice, use
# col_names <- names(data_test_read)
samples <- gsub("X(\\d+)_.*","\\1",col_names)
The find the indexes of the samples you want to drop.
samples_to_drop <- c(12706, 14481)
cols_to_drop <- match(samples_to_drop, samples)
Then you can use
data_subs2 <- subset(data_test_read, select = -cols_to_drop)
to actually get rid of those columns.
Perhaps put this all in a function to make it easier to use
sample_subset <- function(x, drop) {
samples <- gsub("X(\\d+)_.*","\\1", names(x))
subset(x, select = -match(drop, samples))
}
sample_subset(data_test_read, c(12706, 14481))

freq() renames columns during printing

I want to get a one-way frequency table for each column in my dataframe (a count of each unique value in each column). I am following this tutorial, which suggests using the count() function from the plyr package.
for (col in mtcars[c("gear","carb")]){
freq <- count(col)
write.table(freq, file='filename.txt')
}
I would expect the output to look like this:
gear freq
1 3 15
2 4 12
3 5 5
Instead the column name is replaced with 'x':
x freq
1 3 15
2 4 12
3 5 5
Why is this happening, and how can I modify my for loop so that it prints the column name instead of 'x'?
(There is probably a better, vectorized way to do this other than using a for loop, but I'm new to R and can't quite figure out the syntax.)
In a for loop:
for (col in c("gear","carb")){
print(plyr::count(mtcars, col))
}
Using lapply():
lapply(c("gear","carb"), function(col) plyr::count(mtcars, col))
To be clear, count is not renaming anything. In your loop it receives col which is a vector. A vector does not have column names, and so count does not know what name it should use. It uses x as a place holder.
This will also work (with the names of the columns of the dataset mtcar as input, with result as a list of dataframes):
lapply(c("gear","carb"), function(x){df <- as.data.frame(table(mtcars[x])); names(df) <- c(x, 'freq'); df})
[[1]]
gear freq
1 3 15
2 4 12
3 5 5
[[2]]
carb freq
1 1 7
2 2 10
3 3 3
4 4 10
5 6 1
6 8 1

Loop throug the data frame applying some function on each value in R

I am new to R . I have a data frame(usr.query) with structure as shown below
[
Now I want to take text of each id and compare it to text of all the other id and and if there is a match, i want to append it to a new column say count of match.
A0008 with A0043,A0065,A0082,B0018,B0026
A0043 with A0008,A0065,A0082,B0018,B0026
Function to apply
count_match = length(intersect(unlist(strsplit(query1," ")),unlist(strsplit(query2," "))))
The query 1 here is text of A0008 and query 2 is text of A0043,A0065,A0082,B0018,B0026
I tried the suggested solution and here is the result.
No loops are necessary; you'll usually find that's the case in R, because it's really good at utilizing vectorized operations. In this case, you can get the necessary combinations with combn, and then make the match_count column by subsetting the original data.frame with the combinations of the new one, and testing for equality. Adding zero changes the values from Boolean to numeric (use as.integer, if you prefer).
# assemble sample data
df <- data.frame(id = 1:5, text = c('apple', 'mango', 'apple', 'apple', 'mango'))
# make combinations
df2 <- as.data.frame(t(combn(df$id, 2)))
# add names
names(df2) <- c('main_id', 'compared_to_id')
# test for match
df2$match_count <- (df[df2$main_id, 'text'] == df[df2$compared_to_id, 'text']) + 0
The result:
> df2
main_id compared_to_id match_count
1 1 2 0
2 1 3 1
3 1 4 1
4 1 5 0
5 2 3 0
6 2 4 0
7 2 5 1
8 3 4 1
9 3 5 0
10 4 5 0

Resources