Select dataframe if both values exists - r

Here is example:
df1 <- data.frame(x=1:2, account=c(-1,-1))
df2 <- data.frame(x=1:3, account=c(1,-1,1))
df3 <- data.frame(x=1, account=c(-1))
ls <- list(df1,df2,df3)
Failed attempt:
for(i in 1:length(ls)){
d <- ls[[i]]; if(d$account %in% c(-1,1)) { dout <- d} else {next}
}
I also tried: (not sure why this doesn't work)
grepl(paste(c(-1,1), collapse="|"), as.character(df1$account))
gives: (which is correct, since | means or, so one of the values is matched)
[1] TRUE TRUE
however, I have tried this:
df1 <- data.frame(x=1:2, account=c(-1,1))
grepl(paste(c(-1,1), collapse="&"), as.character(df1$account))
gives:
[1] FALSE FALSE
I would like to store only the subset of dataframes that contain both -1,1 values in column account otherwise neglect.
Desired result:
d
x account
1 1 1
2 2 -1
3 3 1

Or, you could stop using a list of data.frames:
library(data.table)
DT <- rbindlist(ls, idcol="id")
# id x account
# 1: 1 1 -1
# 2: 1 2 -1
# 3: 2 1 1
# 4: 2 2 -1
# 5: 2 3 1
# 6: 3 1 -1
And filter the single table:
DT[, if (uniqueN(account) > 1) .SD, by=id]
# id x account
# 1: 2 1 1
# 2: 2 2 -1
# 3: 2 3 1
(This follows #akrun's answer; uniqueN(x) is a fast shortcut to length(unique(x)).)

We could loop through the list and check whether the length of unique elements in 'account' is greater than 1 (assuming that there are only -1 and 1 as possible elements). Use this logical index to filter the list.
ls[sapply(ls, function(x) length(unique(x$account))>1)]

Related

if values of a column is in between two columns in R, populate a new column

I have two data frames of different lengths, like :
df1
locusnum CHR MinBP MaxBP
1: 1 1 13982248 14126651
2: 2 1 21538708 21560253
3: 3 1 28892760 28992798
4: 4 1 43760070 43927877
5: 5 1 149999059 150971195
6: 6 1 200299701 200441048
df2
position chr
27751 13982716 1
27750 13982728 1
10256 13984208 1
27729 13985591 1
27730 13988076 1
27731 13988403 1
both dfs has other columns. df2 has 60000 rows and df1 has 64 rows.
I want to populate a new column in df2 with locusnum from df1. The condition would be df2$chr == df1$CHR & df2$position %in% df1$MinBP:df1$MaxBP
My expected output would be
position chr locusnum
27751 13982716 1 1
27750 13982728 1 1
10256 13984208 1 1
27729 13985591 1 1
27730 13988076 1 1
27731 13988403 1 1
So far I have tried with ifelse statement and for loop as below:
if (df2$chr == df1$CHR & df2$position >= df1$MinBP & df2$position <= df1$MaxBP) df2$locusnum=df1$locusnum
and
for(i in 1:length(df2$position)){ #runs the following code for each line
if(df2$chr[i] == df1$CHR & df2$position[i] %in% df1$MinBP:df1$MaxBP){ #if logical TRUE then it runs the next line
df2$locusnum[i] <- df1$locusnum #gives value of another column to a new column
but got error:
the condition has length > 1
longer object length is not a multiple of shorter object length
Any help? Did I explain the issue clearly?
}
}
Using foverlaps(...) from the data.table package.
Your example is uninteresting because all the rows correspond to locusnum = 1, so I changed df2 a little bit to demonstrate how this works.
##
# df1 is as you provided it
# in df2: note changes to position column in row 2, 3, and 6
#
df2 <- read.table(text="
id position chr
27751 13982716 1
27750 21538718 1
10256 43760080 1
27729 13985591 1
27730 13988076 1
27731 200299711 1", header=TRUE)
##
# you start here
#
library(data.table)
setDT(df1)
setDT(df2)
df2[, c('indx', 'start', 'end'):=.(seq(.N), position, position)]
setkey(df1, CHR, MinBP, MaxBP)
setkey(df2, chr, start, end)
result <- foverlaps(df2, df1)[order(indx), .(id, position, chr, locusnum)]
## id position chr locusnum
## 1: 27751 13982716 1 1
## 2: 27750 21538718 1 2
## 3: 10256 43760080 1 4
## 4: 27729 13985591 1 1
## 5: 27730 13988076 1 1
## 6: 27731 200299711 1 6
foverlaps(...) works best if both data.tables are keyed, but this changes the order of the rows in df2, so I added an index column to recover the original ordering, then removed it at the end.
This should be extremely fast but 60,000 rows is a tiny data-set tbh so you might not notice a difference.

Is there a quick way of filtering rows of a data.frame more than once?

I have a dataframe xd from which I wish to filter data for id=1,2, but with 1 and 2 both repeated twice.
set.seed(12)
xd <- data.frame(id = sort(sample(3,20, rep=TRUE)), y = rnorm(20))
fxd <- subset(xd, subset = id %in% c(1,2,1,2)) # doesn't work
str(fxd)
However, this doesn't work because it only selects id=1 and id=2 only once. Is there any quick way of getting around it?
The subset argument of function subset expects logical expression, meaning you can select from rows by mentioning TRUE/FALSE for each row.
If you want to replicate selection then an option is to use which. Which returns row-number that can be replicated. Hence, option can be as:
set.seed(12)
xd <- data.frame(id = sort(sample(3,20, rep=TRUE)), y = rnorm(20))
fxd <- xd[rep(which(xd$id %in% c(1,2)), each = 2),]
fxd
# id y
# 1 1 -0.77771958
# 1.1 1 -0.77771958
# 2 1 -1.29388230
# 2.1 1 -1.29388230
# 3 1 -0.77956651
# 3.1 1 -0.77956651
# 4 1 0.01195176
# 4.1 1 0.01195176
# 5 1 -0.15241624
# 5.1 1 -0.15241624
# 6 1 -0.70346425
# 6.1 1 -0.70346425
# 7 1 1.18887916
# 7.1 1 1.18887916
# 8 1 0.34051227
# 8.1 1 0.34051227

How to use merge or replace to update a table in R with multiple columns

I want to do something VERY similar to this question: how to use merge() to update a table in R
but instead of just one column being the index, I want to match the new values on an arbitrary number of columns >=1.
foo <- data.frame(index1=c('a', 'b', 'b', 'd','e'),index2=c(1, 1, 2, 3, 2), value=c(100,NA, 101, NA, NA))
Which has the following values
foo
index1 index2 value
1 a 1 100
2 b 1 NA
3 b 2 101
4 d 3 NA
5 e 2 NA
And the data frame bar
bar <- data.frame(index1=c('b', 'd'),index2=c(1,3), value=c(200, 201))
Which has the following values:
bar
index1 index2 value
1 b 1 200
2 d 3 201
merge(foo, bar, by='index', all=T)
It results in this output:
Desired output:
foo
index1 index2 value
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
I think you don't need a merge but more to rbind and filter them later. Here I am using data.table for sugar syntax.
dx <- rbind(bar,foo)
library(data.table)
setDT(dx)
## note this can be applied to any number of index
setkeyv(dx,grep("index",names(dx),v=T))
## using unqiue to remove all duplicated
## here it will remove the duplicated with missing values which is the
## expected behavior
unique(dx)
# index1 index2 value
# 1: b 1 200
# 2: b 2 101
# 3: d 3 201
# 4: a 1 100
# 5: e 2 NA
you can be more explicit and filter your rows by group of indexs:
dx[,ifelse(length(value)>1,value[!is.na(value)],value),key(dx)]
Here's an R base approach
> temp <- merge(foo, bar, by=c("index1","index2"), all=TRUE)
> temp$value <- with(temp, ifelse(is.na(value.x) & is.na(value.y), NA, rowSums(temp[,3:4], na.rm=TRUE)))
> temp <- temp[, -c(3,4)]
> temp
index1 index2 value
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
You can use some dplyr voodoo to produce what you want. The following subsets the data by unique combinations of "index1" and "index2", and checks the contents of "value" for each subset. If "value" has any non-NA values, those are returned. If only an NA value is found, that is returned.
Seems a little specific, but it seems to do what you want!
library(dplyr)
df.merged <- merge(foo, bar, all = T) %>%
group_by(index1, index2) %>%
do(
if (any(!is.na(.$value))) {
return(subset(., !is.na(value)))
} else {
return(.)
}
)
Output:
index1 index2 value
<fctr> <fctr> <dbl>
1 a 1 100
2 b 1 200
3 b 2 101
4 d 3 201
5 e 2 NA
You can specify as many columns as you want with merge:
out <- merge(foo, bar, by=c("index1", "index2"), all.x=TRUE)
new <- apply(out[,3:4], 1, function(x) sum(x, na.rm=TRUE))
new <- ifelse(is.na(out[,3]) & is.na(out[,4]), NA, new)
out <- cbind(out[,1:2], new)

Matching without replacement by id in R

In R, I can easily match unique identifiers using the match function:
match(c(1,2,3,4),c(2,3,4,1))
# [1] 4 1 2 3
When I try to match non-unique identifiers, I get the following result:
match(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 3
Is there a way to match the indices "without replacement", that is, each index appearing only once?
othermatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4 # note the 4 where there was a 3 at the end
you're looking for pmatch
pmatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4
A more naive approach -
library(data.table)
a <- data.table(p = c(1,2,3,1))
a[,indexa := .I]
b <- data.table(q = c(2,3,1,1))
b[,indexb := .I]
setkey(a,p)
setkey(b,q)
# since they are permutation, therefore cbinding the ordered vectors should return ab with ab[,p] = ab[,q]
ab <- cbind(a,b)
setkey(ab,indexa)
ab[,indexb]
#[1] 3 1 2 4

R: How can I sum across variables, within cases, while counting NA as zero

Fake data for illustration:
df <- data.frame(a=c(1,2,3,4,5), b=(c(2,2,2,2,NA)),
c=c(NA,2,3,4,5)))
This would get me the answer I want IF it weren't for the NA values:
df$count <- with(df, (a==1) + (b==2) + (c==3))
Also, would there be an even more elegant way if I was only interested in, e.g. variables==2?
df$count <- with(df, (a==2) + (b==2) + (c==2))
Many thanks!
The following works for your specific example, but I have a suspicion that your real use case is more complicated:
df$count <- apply(df,1,function(x){sum(x == 1:3,na.rm = TRUE)})
> df
a b c count
1 1 2 NA 2
2 2 2 2 1
3 3 2 3 2
4 4 2 4 1
5 5 NA 5 0
but this general approach should work. For instance, your second example would be something like this:
df$count <- apply(df,1,function(x){sum(x == 2,na.rm = TRUE)})
or more generally you could allow yourself to pass in a variable for the comparison:
df$count <- apply(df,1,function(x,compare){sum(x == compare,na.rm = TRUE)},compare = 1:3)
Another way is to subtract your target vector from each row of your data.frame, negate and then do rowSums with na.rm=TRUE:
target <- 1:3
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 2 1 2 1 0
target <- rep(2,3)
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 1 3 1 1 0

Resources