Conditions & Subtraction from Matrix in R - r

I've looked at R create a vector from conditional operation on matrix, and using a similar solution does not yield what I want (and I'm not sure why).
My goal is to evaluate df with the following condition: if df > 2, df -2, else 0
Take df:
a <- seq(1,5)
b <- seq(0,4)
df <- cbind(a,b) %>% as.data.frame()
df is simply:
a b
1 0
2 1
3 2
4 3
5 4
df_final should look like this after a suitable function:
a b
0 0
0 0
1 0
2 1
3 2
I applied the following function with the result, and I'm not sure why it doesn't work (further explanation of a solution would be appreciated)
apply(df,2,function(df){
ifelse(any(df>2),df-2,0)
})
Yielding the following:
a b
-1 -2
Thank you SO community!

Let's fix your function and understand why it didn't work:
apply(df, # apply to df
2, # to each *column* of df
function(df){ # this function. Call the function argument (each column) df
# (confusing because this is the same name as the data frame...)
ifelse( # Looking at each column...
any(df > 2), # if there are any values > 2
df - 2, # then df - 2
0 # otherwise 0
)
})
any() returns a single value. ifelse() returns something the same shape as the test, so by making your test any(df > 2) (a single value), ifelse() will also return a single value.
Let's fix this by (a) changing the function to be of a different name than the input (for readability) and (b) getting rid of the any:
apply(df, # apply to df
2, # to each *column* of df
function(x){ # this function. Call the function argument (each column) x
ifelse( # Looking at each column...
x > 2, # when x is > 2
df - 2, # make it x - 2
0 # otherwise 0
)
})
apply is made for working on matrices. When you give it a data frame, the first thing it does is convert it to a matrix. If you want the result to be a data frame, you need to convert it back to a data frame.
Or we can use lapply instead. lapply returns a list, and by assigning it to the columns of df with df[] <- lapply(), we won't need to convert. (And since lapply doesn't do the matrix conversion, it knows by default to apply the function to each column.)
df[] <- lapply(df, function(x) ifelse(x > 2, x - 2, 0))
As a side note, df <- cbind(a,b) %>% as.data.frame() is a more complicated way of writing df <- data.frame(a, b)

Create the 'out' dataset by subtracting 2, then replace the values that are based on a logical condition to 0
out <- df - 2
out[out < 0] <- 0
Or in a single step
(df-2) * ((df - 2) > 0)

Using apply
a <- seq(1,5)
b <- seq(0,4)
df <- cbind(a,b) %>% as.data.frame()
new_matrix <- apply(df, MARGIN=2,function(i)ifelse(i >2, i-2,0))
new_matrix
###if you want it to return a tibble/df
new_tibble <- apply(df, MARGIN=2,function(i)ifelse(i >2, i-2,0)) %>% as_tibble()

Related

R converting a vector to a dataframe with one row

I have a vector a<-c(0, 0). I want to convert this to a dataframe and then remove duplicated rows (as part of a loop).
This is my code:
a<-c(0, 0)
df<-t(as.data.frame(a))
distinct(df)
This isn't working because df isn't a dataframe even though I have converted it to a dataframe in the second step. I'm not sure how to make this work when the dataframe only has one row.
swap t and as.data.frame like this:
library(dplyr)
a<-c(0, 0)
df<-as.data.frame(t(a))
distinct(df)
Output:
V1 V2
1 0 0
The function t will transform the data.frame back to a matrix. The simplest solution would be to simply change the order of the functions:
a <- c(0,0)
df <- as.data.frame(t(a))
distinct(df)
You can use rbind.data.frame function as follows:
a <- c(0,0)
df = rbind.data.frame(a)
distinct(df)
Using dplyr and matrix:
library(dplyr)
data.frame(matrix(0, 1, 2)) %>%
distinct
#> X1 X2
#> 1 0 0

Return the IDs of a matrix that has all elemets of a vector

I know the answer is completely easy but I could not get it so far. Also I tried to find the answer through the similar questions but i could not. Anyway, I need to return the ID of matrix m that has all elemets of a vector (NoN). In the example that I prepared at below, I need to return IDs 1 and 3.
Example:
m<-matrix(c(1,1,1,1,2,2,34,45,4,4,4,4,4,5,6,3,3,3,3,21,22,3425,345,65,22,42,65,86,456,454,5678,5,234,22,65,21,22,786),nrow=19)
colnames(m)<-c("ID","LO")
NoN<-c(21,22)
My attempts so far are as follow:
1: m[all(m[,2] %in% NoN),1]
2: m[match(NoN, m[,2]),1]
3: subset(m, m[,2] %in% NoN)
4: m[which(m[,2] %in% NoN),1]
Appreciate!
Here's a function using base R:
FOO <- function(m, NoN){
# split matrix based on ID column
m2 <- lapply(split(m, m[, 1]), function(x) matrix(x, ncol = 2))
# match every element of NoN, create logical matrix
matchresult <- do.call(cbind, lapply(lapply(m2, function(x) lapply(NoN, function(y) match(y, x[,2]))), unlist))
# print colnames (= ID) of columns with no NA
as.numeric(colnames(matchresult)[colSums(apply(matchresult, 2, is.na)) == 0])
}
Result of function call:
> FOO(m, NoN)
[1] 1 3
Untested except for your example, but this should be able to handle any length of NoN as well as duplicated combinations of ID and LO.
Edit: More concise and efficient variant provided by #docendodiscimus:
FOO <- function(m, NoN){
df <- as.data.frame(m)
unique(df[as.logical(ave(df$LO, df$ID, FUN = function(x) all(NoN %in% x))),"ID"])
}
A not so safe way using base R:
m<-matrix(c(1,1,1,1,2,2,34,45,4,4,4,4,4,5,6,3,3,3,3,21,22,3425,345,65,22,42,65,86,456,454,5678,5,234,22,65,21,22,786),nrow=19)
colnames(m)<-c("ID","LO")
NoN<-c(21,22)
IDs <- m[m[, 2] %in% NoN, 1]
IDs <- table(IDs)
IDs <- names(IDs)[IDs >= length(NoN)]
> IDs
[1] "1" "3"
But beware, this does not take duplicated values into account. So if ID 1 would have two LOs of value 21 but no 22, it would still return ID 1.
EDIT: A safe way using dplyr:
library(dplyr)
m <- data.frame(m)
IDs <- m %>%
slice(which(LO %in% NoN)) %>% # get all rows which contain values from NoN
group_by(ID) %>% # group by ID
summarise(uniques = n_distinct(LO)) %>% # count unique values per ID
filter(uniques == length(NoN)) %>% # number of unique values has to be the same as the number of values in NoN
select(ID) %>% # select ID columns
unlist() %>% # unlist it
as.numeric() # convert from named num to numeric
> IDs
[1] 1 3
Here's an alternative solution that saves matrix m as a dataframe and performs a process for each ID:
# example data
m<-matrix(c(1,1,1,1,2,2,34,45,4,4,4,4,4,5,6,3,3,3,3,21,22,3425,345,65,22,42,65,86,456,454,5678,5,234,22,65,21,22,786),nrow=19)
colnames(m)<-c("ID","LO")
NoN<-c(21,22)
library(dplyr)
data.frame(m) %>% # save m as dataframe
group_by(ID) %>% # for each ID
summarise(sum_flag = sum(LO %in% NoN)) %>% # count number of LO elements in NoN
filter(sum_flag == length(NoN)) %>% # keep rows where this number matches the length of NoN
pull(ID) # get the corresponding IDs
# [1] 1 3
Keep in mind that this process assumes (based on your example) that elements of NoN and rows of m are unique.
I took this answer from #docendo discimus, which I found it efficient and concise.
df <- as.data.frame(m);
unique(df[as.logical(ave(df$LO, df$ID, FUN = function(x) all(NoN %in% x))),"ID"])

Filter by ranges supplied by two vectors, without a join operation

I wish to do exactly this: Take dates from one dataframe and filter data in another dataframe - R
except without joining, as I am afraid that after I join my data the result will be too big to fit in memory, prior to the filter.
Here is sample data:
tmp_df <- data.frame(a = 1:10)
I wish to do an operation that looks like this:
lower_bound <- c(2, 4)
upper_bound <- c(2, 5)
tmp_df %>%
filter(a >= lower_bound & a <= upper_bound) # does not work as <= is vectorised inappropriately
and my desired result is:
> tmp_df[(tmp_df$a <= 2 & tmp_df$a >= 2) | (tmp_df$a <= 5 & tmp_df$a >= 4), , drop = F]
# one way to get indices to subset data frame, impractical for a long range vector
a
2 2
4 4
5 5
My problem with memory requirements (with respect to the join solution linked) is when tmp_df has many more rows and the lower_bound and upper_bound vectors have many more entries. A dplyr solution, or a solution that can be part of pipe is preferred.
Maybe you could borrow the inrange function from data.table, which
checks whether each value in x is in between any of the
intervals provided in lower,upper.
Usage:
inrange(x, lower, upper, incbounds=TRUE)
library(dplyr); library(data.table)
tmp_df %>% filter(inrange(a, c(2,4), c(2,5)))
# a
#1 2
#2 4
#3 5
If you'd like to stick with dplyr it has similar functionality provided through the between function.
# ranges I want to check between
my_ranges <- list(c(2,2), c(4,5), c(6,7))
tmp_df <- data.frame(a=1:10)
tmp_df %>%
filter(apply(bind_rows(lapply(my_ranges,
FUN=function(x, a){
data.frame(t(between(a, x[1], x[2])))
}, a)
), 2, any))
a
1 2
2 4
3 5
4 6
5 7
Just be aware that the argument boundaries are included by default and that cannot be changed as with inrange

Counting function in R

I have a dataset like this
id <- 1:12
b <- c(0,0,1,2,0,1,1,2,2,0,2,2)
c <- rep(NA,3)
d <- rep(NA,3)
df <-data.frame(id,b)
newdf <- data.frame(c,d)
I want to do simple math. If x==1 or x==2 count them and write how many 1 and 2 are there in this dataset. But I don't want to count whole dataset, I want my function count them four by four.
I want to a result like this:
> newdf
one two
1 1 1
2 2 1
3 0 3
I tried this with lots of variation but I couldn't success.
afonk <- function(x) {
ifelse(x==1 | x==2, x, newdf <- (x[1]+x[2]))
}
afonk(newdf$one)
lapply(newdf, afonk)
Thanks in advance!
ismail
Fun with base R:
# counting function
countnum <- function(x,num){
sum(x == num)
}
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
dfl <- split(df$b,f = df$group)
# make data frame of counts
newdf <- data.frame(one = sapply(dfl,countnum,1),
two = sapply(dfl,countnum,2))
Edit based on comment:
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
table(subset(df, b != 0L)[c("group", "b")])
Which you prefer depends on what type of result you need. A table will work for a small visual count, and you can likely pull the data out of the table, but if it is as simple as your example, you might opt for the data.frame.
We could use dcast from data.table. Create a grouping variable using %/% and then dcast from 'long' to 'wide' format.
library(data.table)
dcast(setDT(df)[,.N ,.(grp=(id-1)%/%4+1L, b)],
grp~b, value.var='N', fill =0)[,c(2,4), with=FALSE]
Or a slightly more compact version would be using fun.aggregate as length.
res <- dcast(setDT(df)[,list((id-1)%/%4+1L, b)][b!=0],
V1~b, length)[,V1:=NULL][]
res
# 1 2
#1: 1 1
#2: 2 1
#3: 0 3
If we need the column names to be 'one', 'two'
library(english)
names(res) <- as.character(english(as.numeric(names(res))))

How to subset a large data frame through FOR loops and print the desired result?

I have a data frame that looks something like this:
x y
1 a
1 b
1 c
1 NA
1 NA
2 d
2 e
2 NA
2 NA
And my desired output should be a data frame that should display the sum of all complete cases of Y (that is the non-NA values) with the corresponding X. So if supposing Y has 2500 complete observations for X = 1, and 557 observations for X = 2, I should get this simple data frame:
x y(c.cases)
1 2500
2 557
Currently my function performs well but only for a single X but when I mention X to be a range (for ex. 30:25) then I get the sum of all the Ys specified instead of individual complete observations for each X. This is an outline of my function:
complete <- function(){
files <- file.list()
dat<- c() #Creates an empty vector
Y <- c() #Empty vector that will list down the Ys
result <- c()
for(i in c(X)){
dat <- rbind(dat, read.csv(files[i]))
}
dat_subset_Y <- dat[which(dat[, 'X'] %in% x), ]
Y <- c(Y, sum(complete.cases(dat)))
result <- cbind(X, Y)
print(result)
}
There are no errors or warning messages but only wrong results in a range of Xs.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'x', get the sum of all non NA elements (!is.na(y)).
library(data.table)
setDT(df1)[, list(y=sum(!is.na(y))), by = x]
Or another option is table
with(df1, table(x, !is.na(y)))
no need for that loop.
library(dplyr)
df %>%
filter(complete.cases(.))%>%
group_by(x) %>%
summarise(sumy=length(y))
Or
df %>%
group_by(x) %>%
summarise(sumy=sum(!is.na(y)))

Resources