Considering the following data frame:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df
var1 var2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 1
I'd like to remove all rows whose values are flipped across the two columns. In this case, it would be row 1 and row 5 as the values 1 and 5 in row 1 are flipped to 5 and 1 in row 5. These two rows should be removed.
I hope it came clear what I am asking for :-)
Kind regards!
Perhaps something like this could work too:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df[!do.call(paste, df) %in% do.call(paste, rev(df)), ]
var1 var2
2 2 6
3 3 7
4 4 8
I'd have to test it on a few more test cases though, but the general idea is to use rev to reverse the order of the columns in "df" and paste them together and compare that with the pasted columns from "df".
Here's a simple but not especially elegant way: make a reversed data frame with a flag, and then merge it on to df:
# Make a reversed dataset
fd <- data.frame(var1 = df$var2, var2 = df$var1, flag = TRUE)
# Merge it onto your original df, then drop the matched rows and the flag var
df.sub <- subset(merge(x = df, y = fd, by = c("var1", "var2"), all.x = TRUE),
subset = is.na(flag),
select = c("var1", "var2"))
Using a bit of maths - the two rows are the same up to a permutation if the sum and absolute value of difference are the same:
df[with(df, !duplicated(data.frame(var1 + var2, abs(var1 - var2)), fromLast = TRUE)),]
# var1 var2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
edit: should've read the question more carefully, to remove both duplicates, follow Ananda's suggestion:
df.ind = with(df, data.frame(var1 + var2, abs(var1 - var2)))
df[!duplicated(df.ind) & !duplicated(df.ind, fromLast = TRUE),]
# var1 var2
#2 2 6
#3 3 7
#4 4 8
If creating a copy doesn't cause memory issues then this works as well -
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df2 <- data.frame(var12 = 1:5, var22 = c(5,6,7,8,1))
df3 <- merge(df,df2, by.x = 'var2', by.y = 'var12', all.x = TRUE)
df3 <- subset(
df3,
is.na(var22),
select = c('var1','var2')
)
Output:
> df3
var1 var2
3 2 6
4 3 7
5 4 8
I tried merging df with df but that gives a warning about the column var2 being duplicated. Anybody know what to do?
If you can assume there are no duplicates in the data frame. Here's a one line answer, but still not too concise:
df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df) + 1:nrow(df)],]
## var1 var2
## 2 2 6
## 3 3 7
## 4 4 8
rbindlist is necessary here because rbind(df,df[,2:1]) will match by column name rather than index, so the other option is something like rbind(df,setnames(df[,2:1],names(df))). If you want to keep duplicates from the original, this gets even more unpleasant:
> df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df<-rbind(df,c(2,6))
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)],]
var1 var2
2 2 6
3 3 7
4 4 8
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)] | duplicated(df),]
var1 var2
2 2 6
3 3 7
4 4 8
6 2 6
Related
This question already has answers here:
How to omit NA values while pasting numerous column values together?
(2 answers)
suppress NAs in paste()
(13 answers)
Closed 1 year ago.
I am trying to concoctate two columns in R using:
df_new$conc_variable <- paste(df$var1, df$var2)
My dataset look as follows:
id var1 var2
1 10 NA
2 NA 8
3 11 NA
4 NA 1
I am trying to get it such that there is a third column:
id var1 var2 conc_var
1 10 NA 10
2 NA 8 8
3 11 NA 11
4 NA 1 1
but instead I get:
id var1 var2 conc_var
1 10 NA 10NA
2 NA 8 8NA
3 11 NA 11NA
4 NA 1 1NA
Is there a way to exclude NAs in the paste process? I tried including na.rm = FALSE but that just added FALSE add the end of the NA in conc_var column. Here is the dataset:
id <- c(1,2,3,4)
var1 <- c(10, NA, 11, NA)
var2 <- c(NA, 8, NA, 1)
df <- data.frame(id, var1, var2)
One out of many options is to use ifelse as in:
df <- data.frame(var1 = c(10, NA, 11, NA),
var2 = c(NA, 8, NA, 1))
df$new <- ifelse(is.na(df$var1), yes = df$var2, no = df$var1)
print(df)
Depending on the circumstances rowSums might be suitable as well as in
df$new2 <- rowSums(df[, c("var1", "var2")], na.rm = TRUE)
print(df)
You can use tidyr::unite -
df <- tidyr::unite(df, conc_var, var1, var2, na.rm = TRUE, remove = FALSE)
df
# id conc_var var1 var2
#1 1 10 10 NA
#2 2 8 NA 8
#3 3 11 11 NA
#4 4 1 NA 1
Like in the example if in a row at a time you'll have only one value you can also use pmax or coalesce.
pmax(df$var1, df$var2, na.rm = TRUE)
dplyr::coalesce(df$var1, df$var2)
You could use glue from the glue package instead.
glue::glue(10, NA, .na = '')
Given this data.frame:
var1 <- c(1, 2)
var2 <- c(3, 4)
var3 <- c(5, 6)
df <- expand.grid(var1 = var1, var2 = var2, var3 = var3)
var1 var2 var3
1 1 3 5
2 2 3 5
3 1 4 5
4 2 4 5
5 1 3 6
6 2 3 6
7 1 4 6
8 2 4 6
I would like to identify the data.frame row number matching this vector (4 is the answer in this case):
vec <- c(var1 = 2, var2 = 4, var3 = 5)
var1 var2 var3
2 4 5
I can't seem to sort out a simple subsetting method. The best I have been able to come up with is the following:
working <- apply(df, 2, match, vec)
which(apply(working, 1, anyNA) == FALSE)
This seems less straightforward than expected; I was wondering if there was a more straightforward solution?
We can transpose the dataframe, compare it with vec and select the row where all of the value matches.
which(colSums(t(df) == vec) == ncol(df))
#[1] 4
For the sake of completeness, subsetting can be implemented using data.table's join:
library(data.table)
setDT(df)[as.list(vec), on = names(vec), which = TRUE]
[1] 4
This can be solved using the prodlim library:
> library(prodlim)
> row.match(vec, df)
[1] 4
Here is a dplyr option:
library(dplyr)
library(magrittr)
df %>% mutate(new=paste0(var1,var2,var3), num=row_number()) %>%
filter(new=="245") %>% select(num) %>% as.integer()
[1] 4
I have a dataframe with multiple variables, each has values of TRUE, FALSE, or NA. I'm trying to summarize the data, but get anything to work quite the way I want.
names <- c("n1","n2","n3","n4","n5","n6")
groupname <- c("g1","g2","g3","g4","g4","g4")
var1 <- c(TRUE,TRUE,NA,FALSE,TRUE,NA)
var2 <- c(FALSE,TRUE,NA,FALSE,TRUE,NA)
var3 <- c(FALSE,TRUE,NA,FALSE,TRUE,NA)
df <- data.frame(names,groupname,var1,var2,var3)
I'm trying to summarize the data for individual groups:
G4 TRUE FALSE NA
var1 3 1 2
var2 2 2 2
var3 2 2 2
I can do table(groupname,var1) to do them individually, but I'm trying to get it all in a single table. Any suggestions?
using dplyr
library(dplyr)
df %>% gather("key", "value", var1:var3) %>%
group_by(key) %>%
summarise(true = sum(value==TRUE, na.rm=T),
false = sum(!value, na.rm=T),
missing = sum(is.na(value)))
# key true false missing
#1 var1 3 1 2
#2 var2 2 2 2
#3 var3 2 2 2
In base R, you could use table to get the counts, lapply to run through the variables, and do.call to put the results together. A minor subsetting with [ orders the columns as desired.
do.call(rbind, lapply(df[3:5], table, useNA="ifany"))[, c(2,1,3)]
TRUE FALSE <NA>
var1 3 1 2
var2 2 2 2
var3 2 2 2
This will work if each variable has all levels (TRUE, FALSE, NA). If one of the levels is missing, you can tell table to fill it with a 0 count by feeding it a factor variable.
Here is an example.
# expand data set
df$var4 <- c(TRUE, NA)
do.call(rbind, lapply(df[3:6],
function(i) table(factor(i, levels=c(TRUE, FALSE, NA)),
useNA="ifany")))[, c(2,1,3)]
FALSE TRUE <NA>
var1 1 3 2
var2 2 2 2
var3 2 2 2
var4 0 3 3
I am trying to find a more R-esque way of selecting the 2nd element (but NOT the first) element of a group in R.
I ended up: 1. creating an index rowNumIndex; 2. selecting and putting the first rows in a one data frame and then the first two rows in a separate data frame; and then 3. "reverse merging" the 2 data frames to get just the unique values from the data frame with the first two rows:
firsts <- ddply(df,.(group), function(x) head(x,1)) # 2 records using data below
seconds <- ddply(df,.(group), function(x) head(x,2)) # 4 records using data below
real.seconds <- seconds[!seconds$rowNumIndex %in% firsts$rowNumIndex, ] # 2 records, the second elements only
Here's some pretend data:
group var1 rowNumIndex
A 8 1
A 9 2
A 10 3
B 11 4
B 12 5
B 13 6
B 14 7
structure(list(group = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("A", "B"), class = "factor"), var1 = 8:14, rowNumIndex = 1:7), .Names = c("group",
"var1", "rowNumIndex"), class = "data.frame", row.names = c(NA,
-7L))
So, data frame firsts looks like:
group var1 rowNumIndex
A 8 1
B 11 4
And data frame seconds looks like:
group var1 rowNumIndex
A 8 1
A 9 2
B 11 4
B 12 5
And data frame real.seconds looks like:
group var1 rowNumIndex
A 9 2
B 12 5
Is there a way to do this w/o resorting to, e.g., the index? Thanks in advance for what will undoubtedly be a soul-crushingly simple and elegant solution!
A solution with dplyr:
library(dplyr)
group_by(df, group) %>% slice(2)
# group var1 rowNumIndex
# <fctr> <int> <int>
# 1 A 9 2
# 2 B 12 5
Pre-dplyr 0.3 alternative:
group_by(df, group)%.%filter(seq_along(var1)==2)
group var1 rowNumIndex
1 A 9 2
2 B 12 5
This solution will keep all the columns of the data. If you just want the two columns (group and var), you can do this:
group_by(df, group)%.%summarise(var1[2])
group var1[2]
1 A 9
2 B 12
A solution with split, lapply and do.call
real.seconds<-do.call("rbind", lapply(split(df, df$group), function(x) x[2,]))
This will give you:
real.seconds
group var1 rowNumIndex
A A 9 2
B B 12 5
Or, more elegantly, with by:
real.seconds <- do.call(rbind, by(df, df$group, function(x) x[2, ]))
I would use data.table:
library(data.table)
dt = data.table(df)
dt[,var1[2],by=group]
As I think about it, there's no reason you shouldn't be able to do this with plyr:
ddply(df, .(group), function(x) x[2,])
A base alternative, where only 'var1' is aggregated:
aggregate(var1 ~ group, data = df, `[`, 2)
...or if you wish to aggregate all columns in the data frame, you can use the ''dot notation':
aggregate(. ~ group, data = df, `[`, 2)
I have a data set containing product prototype test data. Not all tests were run on all lots, and not all tests were executed with the same sample sizes. To illustrate, consider this case:
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
In the past, I've only had to deal with cases of mis-matched repetitions, which has been easy with aggregate(cbind(var1, var2) ~ name, test, FUN = mean, na.action = na.omit) (or the default setting). I'll get averages for each lot over three values for var1 and over four values for var2.
Unfortunately, this will leave me with a dataset completely missing lot A in this case:
aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.omit)
name var1 var2 var3
1 B 2 6 2
2 C 2 10 6
If I use na.pass, however, I also don't get what I want:
aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.pass)
name var1 var2 var3
1 A NA 2.5 NA
2 B NA 6.5 2.5
3 C NA 10.5 6.5
Now I lose the good data I had in var1 since it contained instances of NA.
What I'd like is:
NA as the output of mean() if all unique combinations of varN ~ name are NAs
Output of mean() if there are one or more actual values for varN ~ name
I'm guessing this is pretty simple, but I just don't know how. Do I need to use ddply for something like this? If so... the reason I tend to avoid it is that I end up writing really long equivalents to aggregate() like so:
ddply(test, .(name), summarise,
var1 = mean(var1, na.rm = T),
var2 = mean(var2, na.rm = T),
var3 = mean(var3, na.rm = T))
Yeah... so the result of that apparently does what I want. I'll leave the question anyway in case there's 1) a way to do this with aggregate() or 2) shorter syntax for ddply.
Pass both na.action=na.pass and na.rm=TRUE to aggregate. The former tells aggregate not to delete rows where NAs exist; and the latter tells mean to ignore them.
aggregate(cbind(var1, var2, var3) ~ name, test, mean,
na.action=na.pass, na.rm=TRUE)