Summarise whether a value is contained in multiple other columns - r

I am investigating a large dataset with 100+ columns. One set of columns contain integers where the integers are not repeated across columns. For example, the number 6 may or may not appear in a row, but it will only appear once across the columns.
An example mock-up (bearing in mind that there are hundreds of other, non-related columns surrounding these):
> x1 <- c(1,6,4,5)
> x2 <- c(6,0,11,3)
> x3 <- c(5,0,9,6)
> df <- data.frame(cbind(x1, x2, x3))
> df
x1 x2 x3
1 1 6 5
2 6 0 0
3 4 11 9
4 5 3 6
Ideally using dplyr (since I am trying to become more "fluent"), how would I most cleanly create a new column to indicate whether or not a 6 was contained in the other columns? I am hesitant to use a function like reshape2's melt given the 100s of other columns in the dataset.
My current, messy, solution:
> library(dplyr)
> df <- mutate(df, Contains6 = (x1 == 6) + (x2 == 6) + (x3 == 6),
+ Contains6 = revalue(as.factor(as.character(Contains6)),
+ c("0"="No","1"="Yes")))
> df
x1 x2 x3 Contains6
1 1 6 5 Yes
2 6 0 0 Yes
3 4 11 9 No
4 5 3 6 Yes
Possible extension to this: would there be a clean, programmatic way of creating similar columns for all values contained in x1:x3, e.g. Contains1, Contains4, etc?

We can use apply with MARGIN=1
df$Contains6 <- c("no", "yes")[(apply(df==6, 1, any))+1L]
df$Contains6
#[1] "yes" "yes" "no" "yes"
If we need to create multiple "Contains" columns, we can loop with lapply
v1 <- c(1,4,6)
df[paste0("Contains", v1)] <- lapply(v1, function(i)
c('no', 'yes')[(apply(df==i, 1, any))+1L])

Related

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

R data.table sum number of columns exceeding threshold

I would like to sum the number of columns whose values exceed a threshold in an observation. Additionally, I would like to specify those column names and thresholds as vectors (cols, th)
Take the example data set:
x <- data.table(x1=c(1,2,3),x2=c(3,2,1))
The goal is to create a new column exceed.count with number of columns in which x1 and x2 exceed a respective threshold. Assuming the case in which the thresholds for both x1 and x2 are 2:
th <- c(2,2)
The function could be defined as:
fn <- function(z,th) (sum(z[,x1]>th[1],z[,x2]>th[2]))
And the number of columns exceeding the thresholds calculated by:
x[,exceed.count:=fn(.SD,th),by=seq_len(nrow(x))]
The results are:
x1 x2 exceed.count
1: 1 3 1
2: 2 2 0
3: 3 1 1
What I would like to do is be able to specify the column names as vector, e.g.
cols <- c("x1","x2")
I was playing around with a function of the form:
fn.i <- function(z,i) (sum(z[,cols[i],with=FALSE] > th[i]))
which works for a single i, but how do I vectorize this across elements of cols? (cols and th will always be the same length)
I think there is an easier way to solve your problem:
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
th<-c(2,2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x))]
Or, taking into account your input (only a subset of columns):
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x)), .SDcols=sd.cols]
Or
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2,2)
x[,exceed.count:=sum(.SD>th[1]),by=seq_len(nrow(x)), .SDcols=sd.cols]
#JonnyCrunch's approach, specifying a subset of columns with .SDcols=sd.cols works fine (as long as you ensure ncol(x) == length(th), otherwise vector recycling will mess things up).
Here's an alternative that is shorter syntax (but will be less performant for very wide columns):
x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
no need to explicitly specify .SDcols, let it default to all columns
define the threshold vector th for all columns, using the don't-care value +Inf in those columns you don't want counted.
.
> x <- data.table(x0=4:6, x1=1:3, x2=3:1, x3=7:5)
x0 x1 x2 x3
1: 4 1 3 7
2: 5 2 2 6
3: 6 3 1 5
> th <- c(+Inf, 2, +Inf, 2)
> fn <- function(z,th) (z>th)
> x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
x0 x1 x2 x3 exceed.count
1: 4 1 3 7 1
2: 5 2 2 6 1
3: 6 3 1 5 2
Here's one way to get around iteration over rows:
x <- data.table(x1=c(1,2,3), x2=c(3,2,1))
thL <- list(x1 = 2, x2 = 2)
nm = names(thL)
x[, n := 0L]
for (i in seq_along(thL)) x[thL[i], on=sprintf("%s>%s", nm[i], nm[i]), n := n + 1L][]
x1 x2 n
1: 1 3 1
2: 2 2 0
3: 3 1 1

Anti Merging Large DataSets with Multiple Conditions

2Suppose I have two data frames:
A <- data.frame(X1=c(1,2,3,4,5), X2=c(3,3,4,4,6), X3=c(3,2,14,5,4))
B <- data.frame(X1=c(1,3,5), X2=c(3,4,6))
I want to merge the two so that the when X1 and X2 in A are in a row in B, then those entire rows (with all columns) are returned from A. I have tried anti_join and merge, but the results are not working as planned and merge can not handle larger dataframes. I have also tried things with the data table package.
I would like the below dataframe to be returned or saved to a new object.
C <- data.frame(X1=c(2,4), X2=c(3,4), X3=c(2,5))
Wouldn't you just do A%>%anti_join(B, by = c("X1", "X2"))? That way you have the by set to both X1 and X2, and you get all the outliers.
> A <- data.frame(X1=c(1,2,3,4,5), X2=c(3,3,4,4,6), X3=c(3,2,14,5,4))
> B <- data.frame(X1=c(1,3,5), X2=c(3,4,6))
> A%>%inner_join(B, by = c("X1", "X2"))
X1 X2 X3
1 1 3 3
2 3 4 14
3 5 6 4
> A%>%anti_join(B, by = c("X1", "X2"))
X1 X2 X3
1 2 3 2
2 4 4 5

How can I split a character string in a dataframe into multiple columns

I'm working with a dataframe, one column of which contains values that are mostly numeric but may contain non-numeric entries. I would like to split this column into multiple columns. One of the new columns should contain the numeric portion of the original entry and another column should contain any non-numeric elements.
Here is a sample data frame:
df <- data.frame(ID=1:4,x=c('< 0.1','100','A 2.5', '200'))
Here is what I would like the data frame to look like:
ID x1 x2
1 < 0.1
2 100
3 A 2.5
4 200
On feature of the data I am currently taking advantage of is that the structure of the character strings is always as follows: the non-numeric elements (if they exist) always precede the numeric elements and the two elements are always separated with a space.
I can use colsplit from the reshape package to split the column based on whitespace. The problem with this is that it replicates any entry that can't be split into two elements,
require(reshape)
df <- transform(df, x=colsplit(x,split=" ", names("x1","x2")))
df
ID x1 x2
1 < 0.1
2 100 100
3 A 2.5
4 200 200
This is not terribly problematic as I can just do some post-processing to remove the numeric elements from column "x1."
I can also accomplish what I would like to do using strsplit inside a function:
split.fn <- function(id){
new.val <- unlist(strsplit(as.character(df$x[df$ID==id])," "))
if(length(new.val)==1){
return(data.frame(ID=id,x1="NA",x2=new.val))
}else{
return(data.frame(ID=id,x1=new.val[1],x2=new.val[2]))
}
}
data.frame(rbindlist(lapply(unique(df$ID),split.fn)))
ID x1 x2
1 < 0.1
2 NA 100
3 A 2.5
4 NA 200
but this seems cumbersome.
Basically both options I've outlined here will work. But I suspect there is a more elegant or direct way to do get the desired data frame.
You can use separate() from tidyr
tidyr::separate(df, x, c("x1", "x2"), " ", fill = "left")
# ID x1 x2
# 1 1 < 0.1
# 2 2 <NA> 100
# 3 3 A 2.5
# 4 4 <NA> 200
If you absolutely need to remove the NA values, then you can do
tdy <- tidyr::separate(df, x, c("x1", "x2"), " ", fill = "left")
tdy[is.na(tdy)] <- ""
and then we have
tdy
# ID x1 x2
# 1 1 < 0.1
# 2 2 100
# 3 3 A 2.5
# 4 4 200
This does not use any packages:
transform(df,
x1 = ifelse(grepl(" ", x), sub(" .*", "", x), NA),
x2 = sub(".* ", "", paste(x)))
giving:
ID x x1 x2
1 1 < 0.1 < 0.1
2 2 100 <NA> 100
3 3 A 2.5 A 2.5
4 4 200 <NA> 200

Multiply various subsets of a data frame by different vectors

I would like to multiply several columns in my data frame by a vector of values. The specific vector of values changes depending on the value in another column.
--EDIT--
What if I make the data set more complicated, i.e., more than 2 conditions and the conditions are randomly shuffled around the data set?
Here is an example of my data set:
df=data.frame(
Treatment=(rep(LETTERS[1:4],each=2)),
Species=rep(1:4,each=2),
Value1=c(0,0,1,3,4,2,0,0),
Value2=c(0,0,3,4,2,1,4,5),
Value3=c(0,2,4,5,2,1,4,5),
Condition=c("A","B","A","C","B","A","B","C")
)
Which looks like:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 2 B
B 2 1 3 4 A
B 2 3 4 5 C
C 3 4 2 2 B
C 3 2 1 1 A
D 4 0 4 4 B
D 4 0 5 5 C
If Condition=="A", I would like to multiply columns 3-5 by the vector c(1,2,3). If Condition=="B", I would like to multiply columns 3-5 by the vector c(4,5,6). If Condition=="C", I would like to multiply columns 3-5 by the vector c(0,1,0). The resulting data frame would therefore look like this:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 12 B
B 2 1 6 12 A
B 2 0 4 0 C
C 3 16 10 12 B
C 3 2 2 3 A
D 4 0 20 24 B
D 4 0 5 0 C
I have tried subsetting the data frame and multiplying by the vector:
t(t(subset(df[,3:5],df[,6]=="A")) * c(1,2,3))
But I can't return the subsetted data frame to the original. Is there any way to perform this operation without subsetting the data frame, so that other columns (e.g., Treatment, Species) are preserved?
Here's a fairly general solution that you should be able to adapt to fit your needs.
Note the first argument in the outer call is a logical vector and the second is numeric, so before multiplication TRUE and FALSE are converted to 1 and 0, respectively. We can add the outer results because the conditions are non-overlapping and the FALSE elements will be zero.
multiples <-
outer(df$Condition=="A",c(1,2,3)) +
outer(df$Condition=="B",c(4,5,6)) +
outer(df$Condition=="C",c(0,1,0))
df[,3:5] <- df[,3:5] * multiples
Here's a non-vectorized, but easy to understand solution:
replaceFunction <- function(v){
m <- as.numeric(v[3:5])
if (v[6]=="A")
out <- m * c(1,2,3)
else if (v[6]=="B")
out <- m * c(4,5,6)
else
out <- m
return(out)
}
g <- apply(df, 1, replaceFunction)
df[3:5] <- t(g)
df
Edited to reflect some notes from the comments
Assuming that Condition is a factor, you could do this:
#Modified to reflect OP's edit - the same solution works just fine
m <- matrix(c(1:6,0,1,0),3,3,byrow = TRUE)
df[,3:5] <- with(df,df[,3:5] * m[Condition,])
which makes use of fairly quick vectorized multiplication. And obviously, wrapping this in with isn't strictly necessary, it's just what popped out of my brain. Also note the subsetting comment below by Backlin.
More globally, remember that every subsetting you can do with subset you can also do with [, and crucially, [ support assignment via [<-. So if you want to alter a portion of a data frame or matrix, you can always use this type of idiom:
df[rowCondition,colCondition] <- <replacement values>
assuming of course that <replacement values> is the same dimension as your subset of df. It may work otherwise, but you will run afoul of R's recycling rules and R may kick back a warning.
df[3:5] <- df[3:5] * t(sapply(df$Condition, function(x) if(x=="B") 4:6 else 1:3))
Or by vector multiplication
df[3:5] <- df[3:5] * (3*(df$Condition == "B") %*% matrix(1, 1, 3)
+ matrix(1:3, nrow(df), 3, byrow=T))

Resources