Multiple subsets

Multiple subsets - r

Could you suggest a more elegant solution to the following problem? Remove rows containing more than one 0 in columns x,z,y or a,b,c.
df <- data.frame(x = 0, y = 1:5, z = 0:4, a = 4:0, b = 1:5, c=0)
my solution (row 1 and row 5 should get removed)
df_new <- subset(df, ((((x != 0 & y != 0) | (x != 0 & z != 0) | (y != 0 & z != 0)) & ((a != 0 & b != 0) | (a != 0 & c != 0) | (b != 0 & c != 0)))))

# 1:3 is same as columns 'x', 'y', 'z', Similarily for 4:6 .
# You can also specify the colnames explicitly
# add a na.rm = T inside rowSums() incase you also have missing data
(rowSums(df[, 1:3]==0)>1)|(rowSums(df[, 4:6]==0)>1)
# did you mean this ?
df[!((rowSums(df[, 1:3]==0)>1)|(rowSums(df[, 4:6]==0)>1)),]
# x y z a b c
#2 0 2 1 3 2 0
#3 0 3 2 2 3 0
#4 0 4 3 1 4 0

Related

Insert a blank row before zero

x<-c(0,1,1,0,1,1,1,0,1,1)
aaa<-data.frame(x)
How to insert a blank row before zero? When the first row is zero，do not add blank row. Thank you.
Result:
0
1
1
.
0
1
1
1
.
0
1
1

Below we used dot but you can replace "." with NA or "" or something else depending on what you want.
1) We can use Reduce and append:
Append <- function(x, y) append(x, ".", y - 1)
data.frame(x = Reduce(Append, setdiff(rev(which(aaa$x == 0)), 1), init = aaa$x))
2) gsub Another possibility is to convert to a character string, use gsub and convert back:
data.frame(x = strsplit(gsub("(.)0", "\\1.0", paste(aaa$x, collapse = "")), "")[[1]])
3) We can create a two row matrix in which the first row is dot before each 0 and NA otherwise. Then unravel it to a vector and use na.omit to remove the NA values.
data.frame(x = na.omit(c(rbind(replace(ifelse(aaa$x == 0, ".", NA), 1, NA), aaa$x))))
4) We can lapply over aaa$x[-1] outputting c(".", 9) or 1. Unlist that and insert aaa$x[1] back in. No packages are used.
repl <- function(x) if (!x) c(".", 0) else 1
data.frame(x = c(aaa$x[1], unlist(lapply(aaa$x[-1], repl))))
5) Create a list of all but the first element and replace the 0's in that list with c(".", 0) . Unlist that and insert the first element back in. No packages are used.
L <- as.list(aaa$x[-1])
L[x[-1] == 0] <- list(c(".", 0))
data.frame(x = c(aaa$x[1], unlist(L)))
6) Assuming aaa has two columns where the second column is character (NOT factor). Append a row of dots to aaa and then create an index vector using unlist and Map to access the appropriate row of the extended aaa.
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10],
stringsAsFactors = FALSE)
nr <- nrow(aaa); nc <- ncol(aaa)
fun <- function(ix, x) if (!is.na(x) & x == 0 & ix > 1) c(nr + 1, ix) else ix
rbind(aaa, rep(".", nc))[unlist(Map(fun, 1:nr, aaa$x)), ]
If we did want to have y be factor then note that we can't just add a dot to a factor if it is not a level of that factor so there is the question of what levels the factor can have. To get around that let us add an NA rather than a dot to the factor. Then we get the following which is the same except that aaa has been redefined so that y is a factor, we no longer need nc since we are assuming 2 columns and rep(...) in the last line is replaced with c(".", NA).
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10])
nr <- nrow(aaa)
fun <- function(ix, x) if (!is.na(x) & x == 0 & ix > 1) c(nr + 1, ix) else ix
rbind(aaa, c(".", NA))[unlist(Map(fun, 1:nr, aaa$x)), ]

One dplyr and tidyr possibility may be:
aaa %>%
uncount(ifelse(row_number() > 1 & x == 0, 2, 1)) %>%
mutate(x = ifelse(x == 0 & lag(x == 1, default = first(x)), NA_integer_, x))
x
1 0
2 1
3 1
4 NA
5 0
6 1
7 1
8 1
9 NA
10 0
11 1
12 1
It is not adding a blank row as you have a numeric vector. Instead, it is adding a row with NA. If you need a blank row, you can convert it into a character vector and then replace NA with blank.

ind = with(aaa, ifelse(x == 0 & seq_along(x) > 1, 2, 1))
d = aaa[rep(1:NROW(aaa), ind), , drop = FALSE]
transform(d, x = replace(x, sequence(ind) == 2, NA))

Here is an option with rleid
library(data.table)
setDT(aaa)[, .(x = if(x[.N] == 1) c(x, NA) else x), rleid(x)][-.N, .(x)]
# x
# 1: 0
# 2: 1
# 3: 1
# 4: NA
# 5: 0
# 6: 1
# 7: 1
# 8: 1
# 9: NA
#10: 0
#11: 1
#12: 1

data.frame(x = unname(unlist(by(aaa$x,cumsum(aaa==0),c,'.'))))
x
1 0
2 1
3 1
4 .
5 0
6 1
7 1
8 1
9 .
10 0
11 1
12 1
13 .

My solution is
aaa <- data.frame(x = c(0,1,1,0,1,1,1,0,1,1), y = letters[1:10])
aaa$ind = with(aaa, ifelse(x == 0 & seq_along(x) > 1, 2, 1))
aaa<-aaa[rep(1:nrow(aaa), aaa$ind), ,]
aaa[(aaa$ind== 2 & !grepl(".1",rownames(aaa))),]<-NA
aaa$ind<- NULL
aaa
x y
1 0 a
2 1 b
3 1 c
4 NA <NA>
4.1 0 d
5 1 e
6 1 f
7 1 g
8 NA <NA>
8.1 0 h
9 1 i
10 1 j

Is there an R function or SQL solution for grouping the all the same numbers repeatedly in a row and assign them to all rows??

I want to group the consecutive numbers in a sequence into a single pair. And the final goal is to count the number of pairs per group.
I tried to solve this problem by using a combination of row_number, lag and lead in Redshift.
** I do not care about the decreasing interval, but I want to build the group only in the increasing part.
My table
id number
ㅡㅡㅡㅡ
a | 0
a | 0
a | 1
a | 2
a | 3
a | 2
a | 1
a | 2
a | 1
Expected
id number group
ㅡㅡㅡㅡㅡㅡㅡㅡㅡ
a | 0 | 0
a | 0 | 0
a | 1 | 3
a | 2 | 3
a | 3 | 3
a | 2 | 0
a | 1 | 2
a | 2 | 2
a | 1 | 0
Final table
group cnt
---------
2 | 2
3 | 3
Thanks in advance!

My solution (left all steps intentionally in expected data frame):
library(dplyr)
df<-tibble(id = "a", number = c(0,0,1,2,3,1,2,1))
expected <- df %>%
mutate(l = lag(number),
l = if_else(is.na(l), 0, l),
splits = l < number & l > 0, #remove & l > 0 if starting from 0 is allowed, change to l + 1 == number if step must be 1
g = cumsum(!splits)) %>%
group_by(g) %>%
mutate(group = n()) %>%
ungroup()
final <- expected %>%
filter(group != 1) %>%
group_by(group) %>%
summarise(cnt = n())
anyway, group and cnt value will always be the same in final table, so you can just use unique(). so I'm not sure if that is what you expected

You can toy witht he problematic in a for loop way which identifies the sequence of min. 2 values and assigns the group variable the values of the last number appearing in the sequence. The result can be either the raw dataset providing the group variable or the aggregation
X <- data.frame(number = c(0L,0L,1L,2L,3L,2L,1L,2L,1L))
aggrIt <- function(DF = X, raw = T){
g <- 1L
result <- rep(0L, nrow(DF))
for(i in seq_len(nrow(DF))){
if(i == nrow(DF)) break
if(i == 1L) {
if(DF$number[i] != 0L && DF$number[i+1L] == DF$number[i] + 1L) result[i] <- g
if(DF$number[i] != 0L && DF$number[i+1L] != DF$number[i] + 1L) result[i] <- 0L
} else {
if(DF$number[i] != 0L && DF$number[i+1L] == DF$number[i] + 1L) {
result[i] <- g
} else {
if(DF$number[i-1L] == DF$number[i] - 1L) {
result[i] <- g
g <- g + 1L
}
}
}
}
transl <- tapply(DF$number[result != 0L], result[result != 0L], function(i) rep(max(i), length(i)), simplify = F)
DF$group <- 0L
DF$group[result %in% names(transl)] <- unlist(transl)
if(raw) return(DF)
return(setNames(aggregate(number~group, DF, length, subset = group != 0L), c("group", "cnt")))
}
aggrIt(X, raw= F)
#group cnt
#1 2 2
#2 3 3
aggrIt(X, raw = T)
#number group
#1 0 0
#2 0 0
#3 1 3
#4 2 3
#5 3 3
#6 2 0
#7 1 2
#8 2 2
#9 1 0
You can apply the function on groups of ids.

R - How to use sum and group_by inside apply?

I'm fairly new to R and I have the following issue.
I have a dataframe like this:
A | B | C | E | F |G
1 02 XXX XXX XXX 1
1 02 XXX XXX XXX 1
2 02 XXX XXX XXX NA
2 02 XXX XXX XXX NA
3 02 XXX XXX XXX 1
3 Z1 XXX XXX XXX 1
4 02 XXX XXX XXX 2
....
M 02 XXX XXX XXX 1
The thing is that the dataframe possibly has 150k rows or more, and I need to generate another dataframe grouping by A (which is an ID) and count the following occurrences:
When B is 02 and G has 1 <- V
When B is 02 and G is NA <- W
When B is Z1 and G has 1 <- X
When B is Z1 and G is NA <- Y
Any other kind of occurrence <- Z
For this simple example, the result should look something like this
A | V | W | X | Y | Z
1 2 0 0 0 0
2 0 2 0 0 0
3 1 1 0 0 0
4 0 0 0 0 1
...
M 1 0 0 0 0
At this point I managed to get the results using a for loop:
get_counters <- function(df){
counters <- data.frame(matrix(ncol = 6, nrow = length(unique(df$A))))
colnames(counters) <- c("A", "V", "W", "X", "Y", "Z")
counters$A<- unique(df$A)
for (i in 1:nrow(counters)) {
counters$V[i] <- sum(df$A == counters$A[i] & df$B == "02" & df$G == 1, na.rm = TRUE)
counters$W[i] <- sum(df$A == counters$A[i] & df$B == "02" & is.na(df$G), na.rm = TRUE)
counters$X[i] <- sum(df$A == counters$A[i] & df$B == "Z1" & df$G== 1, na.rm = TRUE)
counters$Y[i] <- sum(df$A == counters$A[i] & df$B == "Z1" & is.na(df$G), na.rm = TRUE)
counters$Z[i] <- sum(df$A == counters$A[i] & (df$B == "Z1" | df$B == "02") & df$G!= 1, na.rm = TRUE)
}
return(counters)
}
Trying that on a small test dataframe returns all the correct results, but with the real data is extremely slow. I'm not sure how to use the apply functions, seems like a simple problem, but I have not found an answer. So far I've assumed that if I could use apply with the sum statement in my for loop (maybe using group_by(A)) I could do it, but I receive all kind of errors.
counters$V <- df%>%
group_by(A)%>%
sum(df$A == counters$A& df$B == "02" &df$G == 1, na.rm = TRUE)
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
In addition: Warning message:
In df$A== counters$A:
longer object length is not a multiple of shorter object length
If I change the function to not use a for loop and not use $ (I get an error referring to "$ operator is invalid for atomic vectors") I either get more errors or weird unreadable results (Large lists that contain more values that the original dataframe, huge empty matrices, etc...)
Is there a simple (maybe not simple but fast and efficient) way to solve this problem? Thanks in advance.

You can do this very quickly using data.table.
Creating Dummy Data:
set.seed(123)
counters <- data.frame(A = rep(1:100000, each = 3), B = sample(c("02","Z1"), size = 300000, replace = T), G = sample(c(1,NA), size = 300000, replace = T))
All I am doing is counting the instances of the combination, then reshaping the data in the format you need:
library(data.table)
setDT(counters)
counters[,comb := paste0(B,"_",G)]
dcast(counters, A ~ comb, fun.aggregate = length, value.var = "A")
A 02_1 02_NA Z1_1 Z1_NA
1: 1 0 2 1 0
2: 2 1 0 1 1
3: 3 0 0 2 1
4: 4 1 1 0 1
5: 5 0 1 2 0
---
99996: 99996 0 1 1 1
99997: 99997 0 2 1 0
99998: 99998 2 0 1 0
99999: 99999 1 0 1 1
100000: 100000 0 2 0 1
I adopted a naming convention that is a bit more extensible (the new columns indicate what combination you are counting), but if you want to override, replace the comb := line with four lines like the following:
counters[B == "02" & is.na(G), comb := "V"]
counters[B == "02" & !is.na(G), comb := "X"]
....
But I think the above is a bit more flexible.

Removing columns that are all 0

I am trying to remove all columns in my dataframe that solely contain the value 0. My code is the following that I found on this website.
dataset = dataset[ ,colSums(dataset != 0) > 0]
However, I keep returning an error:
Error in [.data.frame(dataset, , colSums(dataset != 0) > 0) :
undefined columns selected

It's because you have an NA in at least one column. Fix like this:
dataset = dataset[ , colSums(dataset != 0, na.rm = TRUE) > 0]

Here's some code that will check which columns are numeric (or integer) and drop those that contain all zeros and NAs:
# example data
df <- data.frame(
one = rep(0,100),
two = sample(letters, 100, T),
three = rep(0L,100),
four = 1:100,
stringsAsFactors = F
)
# create function that checks numeric columns for all zeros
only_zeros <- function(x) {
if(class(x) %in% c("integer", "numeric")) {
all(x == 0, na.rm = TRUE)
} else {
FALSE
}
}
# apply that function to your data
df_without_zero_cols <- df[ , !sapply(df, only_zeros)]

There is an alternative using all():
dataset[, !sapply(dataset, function(x) all(x == 0))]
a c d f
1 1 -1 -1 a
2 2 0 NA a
3 3 1 1 a
In case of a large dataset, time and memory consuming copying can be avoided through removing the columns by reference
library(data.table)
cols <- which(sapply(dataset, function(x) all(x == 0)))
setDT(dataset)[, (cols) := NULL]
dataset
a c d f
1: 1 -1 -1 a
2: 2 0 NA a
3: 3 1 1 a
Data
dataset <- data.frame(a = 1:3, b = 0, c = -1:1, d = c(-1, NA, 1), e = 0, f ="a")
dataset
a b c d e f
1 1 0 -1 -1 0 a
2 2 0 0 NA 0 a
3 3 0 1 1 0 a

if else with multiple conditions combined with AND and OR

I am looking for a way to create a new variable (1,0) with 1 for multiple conditions combined with AND and OR.
i.e. if
a > 3 AND b > 5
OR
c > 3 AND d > 5
OR
e > 3 AND f > 5
1
if not
0
I've tried coding it as;
df$newvar <- ifelse(df$a > 3 & df$b > 5 | df$c > 3 & df$d > 5 | df$e > 3 & df$f > 5,"1","0")
But in my output many variables are coded as NA and the numbers do not seem to add up.
Does anyone have advice on a proper way to code this?

We can subset the columns to evaluate for values greater than 3, get a list of logical vectors ('l1'), similarly for values greater than 5 ('l2'), then compare the corresponding elements of list using Map and Reduce it to a single vector. With as.integer, we coerce the logical vector to binary
l1 <- lapply(df[c('a', 'c', 'e')] , function(x) x > 3 & !is.na(x))
l2 <- lapply(df[c('b', 'd', 'f')], function(x) x > 5 & !is.na(x))
df$newvar <- as.integer(Reduce(`|`, Map(`&`, l1, l2)))
df$newvar
#[1] 0 0 1 1 0 1 0 0 1 0
Or using the OP's method
with(df, as.integer((a >3 & !is.na(a) & b > 5 & !is.na(b)) | (c > 3 & !is.na(c) &
d > 5 & !is.na(d)) | (e > 3 & !is.na(e) & f > 5 & !is.na(f))))
#[1] 0 0 1 1 0 1 0 0 1 0
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:8), 6 * 10, replace = TRUE),
ncol = 6, dimnames = list(NULL, letters[1:6])))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Multiple subsets - r

Related

Insert a blank row before zero

Is there an R function or SQL solution for grouping the all the same numbers repeatedly in a row and assign them to all rows??

R - How to use sum and group_by inside apply?

Removing columns that are all 0

if else with multiple conditions combined with AND and OR

Categories

Resources