Keeping rows if any column matches one of a set of values - r

I have a simple question about subsetting using R; I think I am close but can't quite get it. Basically, I have 25 columns of interest and about 100 values. Any row that has ANY of those values in at one of the columns, I want to keep. Simple example:
Values <- c(1,2,5)
col1 <- c(2,6,8,1,3,5)
col2 <- c(1,4,5,9,0,0)
col3 <- c('dog', 'cat', 'cat', 'pig', 'chicken', 'cat')
df <- cbind.data.frame(col1, col2, col3)
df1 <- subset(df, col1%in%Values)
(Note that the third column is to indicate that there are additional columns but I don't need to match the values to those; the rows retained only depend upon columns 1 and 2). I know that in this trivial case I could just add
| col2%in%Values
to get the additional rows from column 2, but with 25 columns I don't want to add an OR statement for every single one. I tried
file2011_test <- file2011[file2011[,9:33]%in%CO_codes] #real names of values
but it didn't work. (And yes I know this is mixing subsetting types; I find subset() easier to understand but I don't think it can help me with what I need?)

May be you can try:
df[Reduce(`|`, lapply(as.data.frame(df), function(x) x %in% Values)),]
# col1 col2
#[1,] 2 1
#[2,] 8 5
#[3,] 1 9
#[4,] 5 0
Or
indx <- df %in% Values
dim(indx) <- dim(df)
df[!!rowSums(indx),]
# col1 col2
# [1,] 2 1
# [2,] 8 5
# [3,] 1 9
# [4,] 5 0
Update
Using the new dataset
df[Reduce(`|`, lapply(df[sapply(df, is.numeric)], function(x) x %in% Values)),]
# col1 col2 col3
#1 2 1 dog
#3 8 5 cat
#4 1 9 pig
#6 5 0 cat

take a look at data.table package. It is very intuitive and literally 100 times faster.
library(data.table)
df <- data.table(col1, col2, col3)
df[col1%in%Values | col2%in%Values]
# col1 col2 col3
#1: 2 1 dog
#2: 8 5 cat
#3: 1 9 pig
#4: 5 0 cat
If you want to do this for all column you can do this with:
df[rowSums(sapply(df, '%in%', Values) )>0]
# col1 col2 col3
#1: 2 1 dog
#2: 8 5 cat
#3: 1 9 pig
#4: 5 0 cat

Related

Ifelse condition over a mask

I have a problem. Imagine I have a data set a:
row1 row2 row3
col1 2 3 5
col2 5 3 4
col3 3 1 6
And I have a mask, which identifies those entries, which should be transformated:
row1 row2 row3
col1 T F F
col2 F F T
col3 F T F
So basically, I want all of the entries which are labelled as TRUE (T) be replaced by their value they have right now minus the value out of another dataset b:
row1 row2 row3
col1 1 4 8
col2 4 1 1
col3 6 2 7
So the result should be:
row1 row2 row3
col1 1 3 5
col2 5 3 3
col3 3 -1 6
What i tried was:
new_dataset <- ifelse(Mask == 'FALSE', a, a - b)
However, I end up with a list instead of a data frame. I know, that is because R creates every entry within the list by the total dataset a or (a-b). But how can I handle this?
Thank you very much in advance! :)
You can do a - b * mask:
a - b * mask
# row1 row2 row3
# col1 1 3 5
# col2 5 3 3
# col3 3 -1 6
It works for both data frames and matrices:
as.data.frame(a) - as.data.frame(b) * as.data.frame(mask)
# row1 row2 row3
# col1 1 3 5
# col2 5 3 3
# col3 3 -1 6
as.matrix(a) - as.matrix(b) * as.matrix(mask)
# row1 row2 row3
# col1 1 3 5
# col2 5 3 3
# col3 3 -1 6
Assuming that all the datasets are data.frames, we convert the 'i1' (i.e. the TRUE/FALSE dataset to matrix, use that to extract the elements from 'a' and 'b', subtract the corresponding elements and assign it to the 'a' that is TRUE for 'i1'.
i2 <- as.matrix(i1)
a[i2] <- a[i2] - b[i2]
a
# row1 row2 row3
#col1 1 3 5
#col2 5 3 3
#col3 3 -1 6
Or if the datasets are really big, then looping through columns might be more efficient. We can use mapply to replace the corresponding columns of 'a' with 'b' based on the index column in 'i1'
mapply(function(x, y, z) {x[z] <- x[z] - y[z]; x}, a, b, i1)
Another way around:
res <- a-b
w <- which(mask==F, arr.ind = T)
res[w] <- a[w]
OR
res <- a
res[mask] <- a[mask]-b[mask]
# row1 row2 row3
# col1 1 3 5
# col2 5 3 3
# col3 3 -1 6
w contains those values in a that remain unchanged.

Transforming dataset to aggregate values [duplicate]

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a dataset like this below
Col Value
A 1
A 0
A 1
A 1
A 1
B 0
B 1
B 0
B 1
B 1
How do I transform this so that it looks like this below
Col1 Col2 Col3
A 4 1
B 3 2
Col2 counts all the 1s and Col3 counts all the 0s for each factor value in Col1.
Or we can use dcast
library(reshape2)
dcast(df1, Col~Value, value.var='Value', length)
For this, you can just use table:
table(mydf)
## Value
## Col 0 1
## A 1 4
## B 2 3
Or:
library(data.table)
as.data.table(mydf)[, as.list(table(Value)), by = Col]
## Col 0 1
## 1: A 1 4
## 2: B 2 3
Another approach of aggregating the values is:
df <- data.frame(Col=c("A","A","A","A","A","B","B","B","B","B"), Value=c(1,0,1,1,1,0,1,0,1,1))
new_df <- as.data.frame(with(df, tapply(Value, list(Col, Value), FUN = function(x) length(x))))
new_df <- setNames(cbind(rownames(new_df), new_df), c("Col1","Col2","Col3"))
new_df
Col1 Col2 Col3
A A 1 4
B B 2 3
We can set rownames to NULL if do not wish to see them:
rownames(new_df) <- NULL
Result:
Col1 Col2 Col3
1 A 1 4
2 B 2 3

Maintain NA's after aggregation R

I have a data frame as follows
test_df<-data.frame(col1=c(1,NA,NA,4,5),col2=c(3,NA,NA,5,6),col3=c("a","b","c","d","c"))
test_df
col1 col2 col3
1 3 a
NA NA b
NA NA c
4 5 d
5 6 c
I am aggregating data based on col3
agg_test<-aggregate(list(test_df$col1,test_df$col2),by=list(test_df$col3),sum,na.rm=T)
agg_test
Col3 col1 col2
a 1 3
b 0 0
c 5 6
d 4 5
From what I know for summation to be correct we need to explicitly define what is to be done with NA's, in this case I have specified that NA's are to be removed from summation, I guess internally R converts all NA's to 0 and sums up according to the by condition. I need to treat the NA's and 0's in my data differently and therefore have to maintain the NA's that are valid (in this case the observations for b are NA's and not 0). How can I achieve this?
Expected o/p
Col3 col1 col2
a 1 3
b NA NA
c 5 6
d 4 5
library(data.table)
unique(setDT(test_df)[, lapply(.SD, function(x)
replace(x, !all(is.na(x)), sum(x, na.rm=TRUE))) , by=col3])
# col3 col1 col2
#1: a 1 3
#2: b NA NA
#3: c 5 6
#4: d 4 5
test_df1 <- test_df
test_df1$col2[2] <- 2
unique(setDT(test_df1)[, lapply(.SD, function(x)
replace(x, !all(is.na(x)), sum(x, na.rm=TRUE))) , by=col3])
# col3 col1 col2
#1: a 1 3
#2: b NA 2
#3: c 5 6
#4: d 4 5
Update
Or using the compact code suggested by #Arun
test_df1$col2[5] <- NA
setDT(test_df1)[, lapply(.SD,
function(x) sum(x,na.rm= !all(is.na(x)))), by=col3]
# col3 col1 col2
#1: a 1 3
#2: b NA 2
#3: c 5 NA
#4: d 4 5
It sounds like (based on your comments to requests for clarification) you want aggregate your groups so you get NA if all the values are missing, and otherwise you want the sum of the non-missing values. You can pass aggregate a user-defined function that has this behavior:
aggregate(list(test_df$col1,test_df$col2), by=list(test_df$col3),
function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm=T)))
# Group.1 c.1..NA..NA..4..5. c.3..NA..NA..5..6.
# 1 a 1 3
# 2 b NA NA
# 3 c 5 6
# 4 d 4 5

Order a data frame only from a certain row index to a certain row index

Let's say we have a DF like this:
col1 col2
A 1
A 5
A 3
A 16
B 5
B 4
B 3
C 7
C 2
I'm trying to order col2 but only for same values in col1. Better said, I want it to look like this:
col1 col2
A 1
A 3
A 5
A 16
B 3
B 4
B 5
C 2
C 7
So order col2 only for A, B and C values, not order the entire col2 column
x <- function() {
values<- unique(DF[, 1])
for (i in values) {
currentData <- which(DF$col1== i)
## what to do here ?
data[order(data[, 2]), ]
}
}
so in CurrentData I have indexes for col2 values for only As, Bs etc. But how do I order only those items in my entire DF data frame ? Is it somehow possible to tell the order function to do order only on certain row indexes of data frame ?
ave will group the data by the first element, and apply the named function to the second element for each group. Here is an application of ave sorting within groups:
DF$col2 <- ave(DF$col2, DF$col1, FUN=sort)
DF
## col1 col2
## 1 A 1
## 2 A 3
## 3 A 5
## 4 A 16
## 5 B 3
## 6 B 4
## 7 B 5
## 8 C 2
## 9 C 7
This will work even if the values in col1 are not consecutive, leaving them in their original positions.
If that is not an important consideration, there are better ways to do this, such as the answer by #user314046.
It seems that
my_df[with(my_df, order(col1, col2)), ]
will do what you want - this just sorts the dataframe by col1 and col2. If you don't want to order by col1 a method is provided in the other answer.

How to exclude a set of elements in R?

I have two data frames: A and B of the same number of columns names and content. Data frame B is the subset of A. I want to get A without B. I have tried different functions like setdiff, duplicated, which and others. None of them worked for me, perhaps I didn't use them correctly. Any help is appreciated.
You could use merge e.g.:
df1 <- data.frame(col1=c('A','B','C','D','E'),col2=1:5,col3=11:15)
subset <- df1[c(2,4),]
subset$EXTRACOL <- 1 # use a column name that is not present among
# the original data.frame columns
merged <- merge(df1,subset,all=TRUE)
dfdifference <- merged[is.na(merged$EXTRACOL),]
dfdifference$EXTRACOL <- NULL
-----------------------------------------
> df1:
col1 col2 col3
1 A 1 11
2 B 2 12
3 C 3 13
4 D 4 14
5 E 5 15
> subset:
col1 col2 col3
2 B 2 12
4 D 4 14
> dfdifference:
col1 col2 col3
1 A 1 11
3 C 3 13
5 E 5 15

Resources