Ifelse condition over a mask - r

I have a problem. Imagine I have a data set a:
row1 row2 row3
col1 2 3 5
col2 5 3 4
col3 3 1 6
And I have a mask, which identifies those entries, which should be transformated:
row1 row2 row3
col1 T F F
col2 F F T
col3 F T F
So basically, I want all of the entries which are labelled as TRUE (T) be replaced by their value they have right now minus the value out of another dataset b:
row1 row2 row3
col1 1 4 8
col2 4 1 1
col3 6 2 7
So the result should be:
row1 row2 row3
col1 1 3 5
col2 5 3 3
col3 3 -1 6
What i tried was:
new_dataset <- ifelse(Mask == 'FALSE', a, a - b)
However, I end up with a list instead of a data frame. I know, that is because R creates every entry within the list by the total dataset a or (a-b). But how can I handle this?
Thank you very much in advance! :)

You can do a - b * mask:
a - b * mask
# row1 row2 row3
# col1 1 3 5
# col2 5 3 3
# col3 3 -1 6
It works for both data frames and matrices:
as.data.frame(a) - as.data.frame(b) * as.data.frame(mask)
# row1 row2 row3
# col1 1 3 5
# col2 5 3 3
# col3 3 -1 6
as.matrix(a) - as.matrix(b) * as.matrix(mask)
# row1 row2 row3
# col1 1 3 5
# col2 5 3 3
# col3 3 -1 6

Assuming that all the datasets are data.frames, we convert the 'i1' (i.e. the TRUE/FALSE dataset to matrix, use that to extract the elements from 'a' and 'b', subtract the corresponding elements and assign it to the 'a' that is TRUE for 'i1'.
i2 <- as.matrix(i1)
a[i2] <- a[i2] - b[i2]
a
# row1 row2 row3
#col1 1 3 5
#col2 5 3 3
#col3 3 -1 6
Or if the datasets are really big, then looping through columns might be more efficient. We can use mapply to replace the corresponding columns of 'a' with 'b' based on the index column in 'i1'
mapply(function(x, y, z) {x[z] <- x[z] - y[z]; x}, a, b, i1)

Another way around:
res <- a-b
w <- which(mask==F, arr.ind = T)
res[w] <- a[w]
OR
res <- a
res[mask] <- a[mask]-b[mask]
# row1 row2 row3
# col1 1 3 5
# col2 5 3 3
# col3 3 -1 6
w contains those values in a that remain unchanged.

Related

How to sort a data frame by row in R and split into muliple data frames?

Suppose I have a data frame :
Col1 Col2 Col3 Col4 Col5 Col6
Row1 1 0 20 4 8 23
Row2 0 1 3 61 2 1
Row3 1 1 2 4 3 54
I want to sort this data frame into multiple data frames since I want the column information too.
Col6 Col3 Col5 Col4 Col1 Col2
Row1 23 20 8 4 1 0
Col4 Col3 Col5 Col2 Col6 Col1
Row2 61 3 2 1 1 0
Col6 Col4 Col5 Col3 Col1 Col2
Row3 54 4 3 2 1 1
As is pointed out in the comment by Eric Lecoutre, you could use lapply() in this situation. There are different ways to do this and I give you two examples.
The sorting is done in both examples with the function order(). How it works is easiest understood with an example:
x <- c(3, 4, 1, 2)
x[order(x)]
## [1] 1 2 3 4
x[order(x, decreasing = TRUE)]
## [1] 4 3 2 1
So, order() does not return an ordered vector but rather the indices that can be used to order the vector.
Now to the solutions of your actual problem. You could apply over row indices as follows:
lapply(1:nrow(df), function(row) df[row, order(df[row, ], decreasing = TRUE)])
Or you could split the data frame in a list of data frame with a single row and apply over that list:
lapply(split(df, 1:nrow(df)), function(row) row[, order(row, decreasing = TRUE)])

Transforming dataset to aggregate values [duplicate]

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a dataset like this below
Col Value
A 1
A 0
A 1
A 1
A 1
B 0
B 1
B 0
B 1
B 1
How do I transform this so that it looks like this below
Col1 Col2 Col3
A 4 1
B 3 2
Col2 counts all the 1s and Col3 counts all the 0s for each factor value in Col1.
Or we can use dcast
library(reshape2)
dcast(df1, Col~Value, value.var='Value', length)
For this, you can just use table:
table(mydf)
## Value
## Col 0 1
## A 1 4
## B 2 3
Or:
library(data.table)
as.data.table(mydf)[, as.list(table(Value)), by = Col]
## Col 0 1
## 1: A 1 4
## 2: B 2 3
Another approach of aggregating the values is:
df <- data.frame(Col=c("A","A","A","A","A","B","B","B","B","B"), Value=c(1,0,1,1,1,0,1,0,1,1))
new_df <- as.data.frame(with(df, tapply(Value, list(Col, Value), FUN = function(x) length(x))))
new_df <- setNames(cbind(rownames(new_df), new_df), c("Col1","Col2","Col3"))
new_df
Col1 Col2 Col3
A A 1 4
B B 2 3
We can set rownames to NULL if do not wish to see them:
rownames(new_df) <- NULL
Result:
Col1 Col2 Col3
1 A 1 4
2 B 2 3

Keeping rows if any column matches one of a set of values

I have a simple question about subsetting using R; I think I am close but can't quite get it. Basically, I have 25 columns of interest and about 100 values. Any row that has ANY of those values in at one of the columns, I want to keep. Simple example:
Values <- c(1,2,5)
col1 <- c(2,6,8,1,3,5)
col2 <- c(1,4,5,9,0,0)
col3 <- c('dog', 'cat', 'cat', 'pig', 'chicken', 'cat')
df <- cbind.data.frame(col1, col2, col3)
df1 <- subset(df, col1%in%Values)
(Note that the third column is to indicate that there are additional columns but I don't need to match the values to those; the rows retained only depend upon columns 1 and 2). I know that in this trivial case I could just add
| col2%in%Values
to get the additional rows from column 2, but with 25 columns I don't want to add an OR statement for every single one. I tried
file2011_test <- file2011[file2011[,9:33]%in%CO_codes] #real names of values
but it didn't work. (And yes I know this is mixing subsetting types; I find subset() easier to understand but I don't think it can help me with what I need?)
May be you can try:
df[Reduce(`|`, lapply(as.data.frame(df), function(x) x %in% Values)),]
# col1 col2
#[1,] 2 1
#[2,] 8 5
#[3,] 1 9
#[4,] 5 0
Or
indx <- df %in% Values
dim(indx) <- dim(df)
df[!!rowSums(indx),]
# col1 col2
# [1,] 2 1
# [2,] 8 5
# [3,] 1 9
# [4,] 5 0
Update
Using the new dataset
df[Reduce(`|`, lapply(df[sapply(df, is.numeric)], function(x) x %in% Values)),]
# col1 col2 col3
#1 2 1 dog
#3 8 5 cat
#4 1 9 pig
#6 5 0 cat
take a look at data.table package. It is very intuitive and literally 100 times faster.
library(data.table)
df <- data.table(col1, col2, col3)
df[col1%in%Values | col2%in%Values]
# col1 col2 col3
#1: 2 1 dog
#2: 8 5 cat
#3: 1 9 pig
#4: 5 0 cat
If you want to do this for all column you can do this with:
df[rowSums(sapply(df, '%in%', Values) )>0]
# col1 col2 col3
#1: 2 1 dog
#2: 8 5 cat
#3: 1 9 pig
#4: 5 0 cat

Order a data frame only from a certain row index to a certain row index

Let's say we have a DF like this:
col1 col2
A 1
A 5
A 3
A 16
B 5
B 4
B 3
C 7
C 2
I'm trying to order col2 but only for same values in col1. Better said, I want it to look like this:
col1 col2
A 1
A 3
A 5
A 16
B 3
B 4
B 5
C 2
C 7
So order col2 only for A, B and C values, not order the entire col2 column
x <- function() {
values<- unique(DF[, 1])
for (i in values) {
currentData <- which(DF$col1== i)
## what to do here ?
data[order(data[, 2]), ]
}
}
so in CurrentData I have indexes for col2 values for only As, Bs etc. But how do I order only those items in my entire DF data frame ? Is it somehow possible to tell the order function to do order only on certain row indexes of data frame ?
ave will group the data by the first element, and apply the named function to the second element for each group. Here is an application of ave sorting within groups:
DF$col2 <- ave(DF$col2, DF$col1, FUN=sort)
DF
## col1 col2
## 1 A 1
## 2 A 3
## 3 A 5
## 4 A 16
## 5 B 3
## 6 B 4
## 7 B 5
## 8 C 2
## 9 C 7
This will work even if the values in col1 are not consecutive, leaving them in their original positions.
If that is not an important consideration, there are better ways to do this, such as the answer by #user314046.
It seems that
my_df[with(my_df, order(col1, col2)), ]
will do what you want - this just sorts the dataframe by col1 and col2. If you don't want to order by col1 a method is provided in the other answer.

How to exclude a set of elements in R?

I have two data frames: A and B of the same number of columns names and content. Data frame B is the subset of A. I want to get A without B. I have tried different functions like setdiff, duplicated, which and others. None of them worked for me, perhaps I didn't use them correctly. Any help is appreciated.
You could use merge e.g.:
df1 <- data.frame(col1=c('A','B','C','D','E'),col2=1:5,col3=11:15)
subset <- df1[c(2,4),]
subset$EXTRACOL <- 1 # use a column name that is not present among
# the original data.frame columns
merged <- merge(df1,subset,all=TRUE)
dfdifference <- merged[is.na(merged$EXTRACOL),]
dfdifference$EXTRACOL <- NULL
-----------------------------------------
> df1:
col1 col2 col3
1 A 1 11
2 B 2 12
3 C 3 13
4 D 4 14
5 E 5 15
> subset:
col1 col2 col3
2 B 2 12
4 D 4 14
> dfdifference:
col1 col2 col3
1 A 1 11
3 C 3 13
5 E 5 15

Resources