data frame condition multiple rows - r

Say one has a data frame as follows:
data <- data.frame('obs' = c('a','c','b'), 'top1' = c('a','b','c'), 'top2' = c('b', 'c', 'f'), 'top3' = c('g', 'h', 'd'))
I wan to compute a new column topn which is a conditional that works in the following fashion: if the value of obs is in any of the top columns then topn should be equal to obs, otherwise topn can be assigned any value, say top1. Of course I know I can do this with an or and ifelse, but I'm looking for a shorter way to write it because in my table I can have up to 10 top columns.
obs top1 top2 top3 topn
a a b g a
c b c h c
b c f d c

If we are looking for a vectorized approach, then we can use the rowSums on a logical matrix to find if there are any matches, then with ifelse get the column values based on the logical vector
i1 <- data[-1] == data['obs'][col(data[-1])]
data$topn <- ifelse(rowSums(i1) != 0, as.character(data$obs), as.character(data$top1))
data$topn
#[1] "a" "c" "c"

This might be helpful and quick.
f=function(a){
if(a[1] %in% a[-1]){
return (a[1])
}
else{sample(a[-1],1)}
}
data$topn=apply(data,1,f)

Related

In which column there is a value of a specific variable

I have this dataframe:
a <- c(2,5,90,77,56,65,85,75,12,24,52,32)
b <- c(45,78,98,55,63,12,23,38,75,68,99,73)
c <- c(77,85,3,22,4,69,86,39,78,36,96,11)
d <- c(52,68,4,25,79,120,97,20,7,19,37,67)
e <- c(14,73,91,87,94,38,1,685,47,102,666,74)
df <- data.frame(a,b,c,d,e)
and this variable:
bb <- 120
I need to know the column number of df in which there is the value of the variable "bb". How can I do?
Thx everyone!
We could use which with arr.ind = TRUE to extract the row/col index after creating a logical matrix. Then, extract the second column to get the column index
which(df == bb, arr.ind = TRUE)[,2]
col
4
If there are duplicate elements in the column for the value compared, wrap with unique to return the unique column index
unique(which(df == bb, arr.ind = TRUE)[,2])
[1] 4
I think we could use grep
grep(bb, df)
[1] 4

extract a column in dataframe based on condition for another column R

I want to extract a column from a dataframe in R based on a condition for another column in the same dataframe, the dataframe is given below.
b <- c(1,2,3,4)
g <- c("a", "b" ,"b", "c")
df <- data.frame(b,g)
row.names(df) <- c("aa", "bb", "cc" , "dd")
I want to extract all values for column b as a dataframe (with rownames) where column g has value 'b',
My required output is given below:
df
b
cc 3
dd 4
I have tried several methods like which or subset but it does not work. I have also tried to find the answer to this question on stackoverflow but I was not able to find it. Is there a way to do it?
Thanks,
You can use the subset function in base R -
subset(df, g == 'b', select = b)
# b
#bb 2
#cc 3
Using data.table
library(data.table)
setDT(df, key = 'g')['b', .(b)]
b
1: 2
2: 3
Or with collapse
library(collapse)
sbt(df, g == 'b', b)
b
1 2
2 3
This is the basic way of slicing data in r
df[df$g == 'b',]['b']
Or the tidyverse answer
df %>%
filter(g == 'b') %>%
select(b)

Find row with certain value in column and move it to top of dataframe

I have the dataframe below and I would like to move the row with C in column Reg as 1st row.
Reg <- rep(LETTERS[1:3], each = 1)
Res <- c("Urban", "Rural","Urban")
df <- data.frame(Reg, Res)
The simple way is to create a logical vector on 'Reg' where the logical operator specified is != (not equal to) returning all other values as TRUE and the row corresponding to 'C' as FALSE. When we order, 'F' comes before 'T' in alphabetic ordering and thus 'C' rows will be the top followed by the rest
df[order(df$Reg != 'C'),]
Or similar options in dplyr are
library(dplyr)
df %>%
slice(order(Reg != 'C'))
Or with arrange
df %>%
arrange(Reg != 'C')
Here is another option
> df[order(replace(seq(nrow(df)),df$Reg=="C",0)),]
Reg Res
3 C Urban
1 A Urban
2 B Rural

Replace each instance of multiple values in a vector with a value specific for the unique value

I have vector
a <- c(2, 1, 1, 2)
and I want to replace each unique value with another unique value. For instance, I want 2 -> 'a' and 1 -> 'b' to create a vector with the same order like this:
c <- c('a', 'b', 'b', 'a')
I tried something this, but it didn't work:
replace(a, a %in% unique(a), b)
I want to avoid manually going to all unique values to generalize in case a is large. The replacement strings are just examples. The solution should also generalize to completely different strings
or values. E.g.: 2 -> 'Walter' and 1 -> 'Getrude'.
We can use match to get a numeric index and then replace based on it (using base R)
c('a', 'b')[match(a, unique(a))]
#[1] "a" "b" "b" "a"
One way to proceed here would be to define two data frames, one for the starting values and the other for the values to be mapped:
library(plyr)
df1 <- data.frame(a=c(2, 1, 1, 2))
df2 <- data.frame(a=c(1, 2), value=c('a', 'b'), stringsAsFactors=FALSE)
result <- join(df1, df2)$value
result
[1] "b" "a" "a" "b"
Having a dedicated data frame or table for mapping the values is probably a good long term strategy.

Filter Data Frame by Matching Multiple String in Multiple Columns

I have been unsuccessfully trying to filter my data frame using the dplyr and grep libraries using a list of string across multiple columns of my data frame. I would assume this is a simple task, but either nobody has asked my specific question or it's not as easy as I thought it would originally be.
For the following data frame...
foo <- data.frame(var.1 = c('a', 'b',' c'),
var.2 = c('b', 'd', 'e'),
var.3 = c('c', 'f', 'g'),
var.4 = c('z', 'a', 'b'))
... I would like to be able to filter row wise to find rows that contain all three variables a, b, and c in them. My sought after answer would only return row 1, as it contains a, b, and c, and not return rows 2 and 3 even though they contain two of the three sought after variables, they do not contain all three in the same row.
I'm running into issues where grep only allows specifying vectors or one column at a time when I really just care about finding string across many columns in the same row.
I've also used dplyr to filter using %in%, but it just returns when any of the variables are present:
foo %>%
filter(var.1 %in% c('a', 'b', 'c') |
var.2 %in% c('a', 'b', 'c') |
var.3 %in% c('a', 'b', 'c'))
Thanks for any and all help and please, let me know if you need any clarification!
Here's an approach in base R where we check if the elements of foo are equal to "a", "b", or "c" successively, add the Booleans and check if the sum of those Booleans for each row is greater than or equal to 3
Reduce("+", lapply(c("a", "b", "c"), function(x) rowSums(foo == x) > 0)) >=3
#[1] TRUE FALSE FALSE
Timings
foo = matrix(sample(letters[1:26], 1e7, replace = TRUE), ncol = 5)
system.time(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20)
# user system elapsed
# 3.26 0.48 3.79
system.time(apply(foo, 1, function(x) all(letters[1:20] %in% x)))
# user system elapsed
# 18.86 0.00 19.19
identical(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20,
apply(foo, 1, function(x) all(letters[1:20] %in% x)))
#[1] TRUE
>
Your problem arises from trying to apply "tidyverse" solutions to data that isn't tidy. Here's the tidy solution, which uses melt to make your data tidy. See how much tidier this solution is?
> library(reshape2)
> rows = foo %>%
mutate(id=1:nrow(foo)) %>%
melt(id="id") %>%
filter(value=="a" | value=="b" | value=="c") %>%
group_by(id) %>%
summarize(N=n()) %>%
filter(N==3) %>%
select(id) %>%
unlist
Warning message:
attributes are not identical across measure variables; they will be dropped
That gives you a vector of matching row indexes, which you can then subset your original data frame with:
> foo[rows,]
var.1 var.2 var.3 var.4
1 a b c z
>

Resources