R subsetting rows where values in multiple columns don't match - r

Apologies if this has been asked already, but I searched and could not find an exact example of what I am trying to do. I'm trying to subset a dataframe to exclude rows that have matching numerical values across five columns. For example, for the following dataframe, df, I'd want to return a new dataframe only with rows 1:2, 5:6, and 8:10:
Row A B C D E
1 1 1 2 3 1
2 4 1 2 3 5
3 2 2 2 2 2
4 5 5 5 5 5
5 4 4 2 3 4
6 2 1 3 5 2
7 3 3 3 3 3
8 3 2 5 3 3
9 2 1 2 2 4
10 3 3 3 2 3
I'm having trouble figuring out how to do this for more than two columns. I've tried the following and know they are not right.
df2 <- df[!duplicated(df, c("A", "B", "C", "D", "E"))]
and
df2 <- df[df$A==df$B==df$C==df$D==df$E,]
Thanks in advance.

Data frames are usually operated on column-wise rather than row-wise, which is why your duplicated attempt doesn't work. (It's checking for duplicate rows within those columns.) And your == doesn't work because == is a binary operator, df$A == df$B will be TRUE or FALSE, and then (df$A == df$B) == df$C (implied parentheses) will be testing if df$C is TRUE or FALSE.
apply is a good way to run a function on each row. It will convert your data frame to a matrix to run the function, but in this case that's fine columns A through E are all numeric. Here's one way:
df[apply(df[, -1], 1, function(x) length(unique(x))) > 1, ]
# Row A B C D E
# 1 1 1 1 2 3 1
# 2 2 4 1 2 3 5
# 5 5 4 4 2 3 4
# 6 6 2 1 3 5 2
# 8 8 3 2 5 3 3
# 9 9 2 1 2 2 4
# 10 10 3 3 3 2 3
You could come up with all sorts of different functions to apply to test for all the elements being the same.
I assumed you actually have a column named Row. If that isn't the case, leave out the -1 in my code above.
Using this data, reproducibly shared with dput().
df = structure(list(Row = 1:10, A = c(1L, 4L, 2L, 5L, 4L, 2L, 3L,
3L, 2L, 3L), B = c(1L, 1L, 2L, 5L, 4L, 1L, 3L, 2L, 1L, 3L), C = c(2L,
2L, 2L, 5L, 2L, 3L, 3L, 5L, 2L, 3L), D = c(3L, 3L, 2L, 5L, 3L,
5L, 3L, 3L, 2L, 2L), E = c(1L, 5L, 2L, 5L, 4L, 2L, 3L, 3L, 4L,
3L)), .Names = c("Row", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-10L))

You can simply compare all the columns against a single column and see if all the same
df[rowSums(df[-1] == df[, 1]) < (ncol(df) - 1), ]
# A B C D E
# 1 1 1 2 3 1
# 2 4 1 2 3 5
# 5 4 4 2 3 4
# 6 2 1 3 5 2
# 8 3 2 5 3 3
# 9 2 1 2 2 4
# 10 3 3 3 2 3
Or just df[rowSums(df == df[, 1]) < (ncol(df)), ]
Or similarly, you can avoid matrix conversions all together and combine Reduce and lapply
df[!Reduce("&" , lapply(df, `==`, df[, 1])), ]
# A B C D E
# 1 1 1 2 3 1
# 2 4 1 2 3 5
# 5 4 4 2 3 4
# 6 2 1 3 5 2
# 8 3 2 5 3 3
# 9 2 1 2 2 4
# 10 3 3 3 2 3

Related

R: reordering columns based on order of different column

I have the following data:
x y id
1 2
2 2 1
3 4
5 6 2
3 4
2 1 3
The blanks in column id should have the same values as the next id value. Meaning my data should actually look like this:
x y id
1 2 1
2 2 1
3 4 2
5 6 2
3 4 3
2 1 3
I also have a list:
list[[1]] = 1 3 2
Or alternatively a column:
c(1,3,2) = 1, 3, 2
Now I would like to reorder my data based on column id accroding to the order in the list. My data should like this then:
x y id
1 2 1
2 2 1
3 4 3
2 1 3
3 4 2
5 6 2
Is there an efficient way to do this?
EDIT: I don't think it is a duplicate of in R Sorting by absolute value without changing the data because I do no want to sort by absolute value but by specific order that is given in a list.
A base R option would be (assuming that the blanks in 'id' column is NA)
i1 <- !is.na(df1$id)
df1[i1,][match(df1$id[i1], list[[1]]),] <- df1[i1, ]
df1
# x y id
#1 1 2 NA
#2 2 2 1
#3 3 4 NA
#4 2 1 3
#5 3 4 NA
#6 5 6 2
If we need to change the NA to succeeding non-NA element
library(zoo)
df1$id <- na.locf(df1$id, fromLast = TRUE)
data
df1 <- structure(list(x = c(1L, 2L, 3L, 5L, 3L, 2L), y = c(2L, 2L, 4L,
6L, 4L, 1L), id = c(NA, 1L, NA, 2L, NA, 3L)), class = "data.frame",
row.names = c(NA, -6L))

Conditionally remove rows from a database using R

ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5
You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])
Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5
A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")

How can I exclude zeros when finding a frequency and seperate into 4 categories

I have a data frame data.2016 and am trying to find the frequency in which "DIPL" occurs (excluding zero), "DIPL" is the number of a worms parasite found in the a fish.
Data looks something like this:
data.2016
Site DIPL
1 0
1 1
1 1
2 6
2 8
2 1
2 1
3 0
3 0
3 0
4 1258
4 501
I want to output to look like this:
Site freq
1 2
2 4
3 0
4 2
From this I can interpret, out of the 3 fish found in site #1 (from the data frame), 2 of them had worm parasites.
I've tried
aggregate(DIPL~Site, data=data.2016, frequency) #and get:
Site DIPL
1 1 1
2 2 1
3 3 1
4 4 1
Is there a way to count the number of fish with worms from the DIPL column (meaning the value in the column is higher than zero) per site?
Just use a custom function that removes the zeros.
aggregate(DIPL ~ Site, data.2016, function(x) length(x[x != 0])) # or sum(x != 0)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
Another option would be to temporarily transform the DIPL column then just take the sum.
aggregate(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0), sum)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
xtabs() is fun too ...
xtabs(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0))
# Site
# 1 2 3 4
# 2 4 0 2
By the way, frequency is for use on time-series data.
Data:
data.2016 <- structure(list(Site = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L), DIPL = c(0L, 1L, 1L, 6L, 8L, 1L, 1L, 0L, 0L, 0L, 1258L,
501L)), .Names = c("Site", "DIPL"), class = "data.frame", row.names = c(NA,
-12L))
Might something like this be what you're looking for?
# first some fake data
site <- c("A","A","A","B","B","B")
numworms <- c(1,0,3,0,0,42)
data.frame(site,numworms)
site numworms
1 A 1
2 A 0
3 A 3
4 B 0
5 B 0
6 B 42
tapply(numworms, site, function(x) sum(x>0))
A B
2 1

R: Sorting columns based on partial match of column names with row names

I have a data frame that can be simplified to look like this (included the dput at the end):
T2_KL_21 A1_LC_11 W3_FA_22 RR_BI_12 PL_EW_12 RT_LC_22 YU_BI_21
FA 1 2 3 4 5 6 7
BI 1 2 3 4 5 6 7
KL 1 2 3 4 5 6 7
EW 1 2 3 4 5 6 7
LC 1 2 3 4 5 6 7
I would like to sort the columns so that they follow the order of the row names (based on partial match). It would then look like this:
W3_FA_22 RR_BI_12 YU_BI_21 T2_KL_21 PL_EW_12 A1_LC_11 RT_LC_22
FA 3 4 7 1 5 2 6
BI 3 4 7 1 5 2 6
KL 3 4 7 1 5 2 6
EW 3 4 7 1 5 2 6
LC 3 4 7 1 5 2 6
If more than one column name contains the string in the row names, they should be kept side by side, but the order does not matter.
I have already filtered the columns so that they all contain a match in the row names.
Here is the dput of the data frame:
structure(list(T2_KL_21 = c(1L, 1L, 1L, 1L, 1L), A1_LC_11 = c(2L,
2L, 2L, 2L, 2L), W3_FA_22 = c(3L, 3L, 3L, 3L, 3L), RR_BI_12 = c(4L,
4L, 4L, 4L, 4L), PL_EW_12 = c(5L, 5L, 5L, 5L, 5L), RT_LC_22 = c(6L,
6L, 6L, 6L, 6L), YU_BI_21 = c(7L, 7L, 7L, 7L, 7L)), .Names = c("T2_KL_21",
"A1_LC_11", "W3_FA_22", "RR_BI_12", "PL_EW_12", "RT_LC_22", "YU_BI_21"
), class = "data.frame", row.names = c("FA", "BI", "KL", "EW",
"LC"))
I have tried using pmatch, grep and match, with no success.
Any advice will be much appreciated! Thanks
We can loop through the rownames and grep to find the index of the column names that match, unlist and use that to arrange the columns
df1[unlist(lapply(gsub("\\d+", "", row.names(df1)), function(x) grep(x, names(df1))))]
#W3_FA_22 RR_BI_12 YU_BI_21 T2_KL_21 PL_EW_12 A1_LC_11 RT_LC_22
#FA 3 4 7 1 5 2 6
#BI 3 4 7 1 5 2 6
#KL 3 4 7 1 5 2 6
#EW 3 4 7 1 5 2 6
#LC 3 4 7 1 5 2 6

Deleting Rows per ID when value gets greater than... minus 2

I have the following data frame
id<-c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)
time<-c(0,1,2,3,4,5,6,7,0,1,2,3,0,1,2,3)
value<-c(1,1,6,1,2,0,0,1,2,6,2,2,1,1,6,1)
d<-data.frame(id, time, value)
The value 6 appears only once for each id. For every id, i would like to remove all rows after the line with the value 6 per id except the first two lines coming after.
I've searched and found a similar problem, but i couldnt adapt it myself. I therefore use the code of this thread
In the above case the final data frame should be
id time value
1 0 1
1 1 1
1 2 6
1 3 1
1 4 2
2 0 2
2 1 6
2 2 2
2 3 2
3 0 1
3 1 1
3 2 6
3 3 1
On of the solution given seems getting very close to what i need. But i didn't manage to adapt it. Could u help me?
library(plyr)
ddply(d, "id",
function(x) {
if (any(x$value == 6)) {
subset(x, time <= x[x$value == 6, "time"])
} else {
x
}
}
)
Thank you very much.
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(d)). Grouped by the 'id' column, we get the position of 'value' that is equal to 6. Add 2 to it. Find the min of the number of elements for that group (.N) and the position, get the seq, and use that to subset the dataset. We can also add an if/else condition to check whether there are any 6 in the 'value' column or else to return .SD without any subsetting.
library(data.table)
setDT(d)[, if(any(value==6)) .SD[seq(min(c(which(value==6) + 2, .N)))]
else .SD, by = id]
# id time value
# 1: 1 0 1
# 2: 1 1 1
# 3: 1 2 6
# 4: 1 3 1
# 5: 1 4 2
# 6: 2 0 2
# 7: 2 1 6
# 8: 2 2 2
# 9: 2 3 2
#10: 3 0 1
#11: 3 1 1
#12: 3 2 6
#13: 3 3 1
#14: 4 0 1
#15: 4 1 2
#16: 4 2 5
Or as #Arun mentioned in the comments, we can use the ?head to subset, which would be faster
setDT(d)[, if(any(value==6)) head(.SD, which(value==6L)+2L) else .SD, by = id]
Or using dplyr, we group by 'id', get the position of 'value' 6 with which, add 2, get the seq and use that numeric index within slice to extract the rows.
library(dplyr)
d %>%
group_by(id) %>%
slice(seq(which(value==6)+2))
# id time value
#1 1 0 1
#2 1 1 1
#3 1 2 6
#4 1 3 1
#5 1 4 2
#6 2 0 2
#7 2 1 6
#8 2 2 2
#9 2 3 2
#10 3 0 1
#11 3 1 1
#12 3 2 6
#13 3 3 1
#14 4 0 1
#15 4 1 2
#16 4 2 5
data
d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), time = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 1L, 2L), value = c(1L, 1L, 6L, 1L,
2L, 2L, 6L, 2L, 2L, 1L, 1L, 6L, 1L, 1L, 2L, 5L)), .Names = c("id",
"time", "value"), class = "data.frame", row.names = c(NA, -16L))

Resources