if condition is true, set all other column values to 0 - R - r

I create a random dataset via:
#create dataset
first_column <- c(1:10) #random column
second_column <- c(1:10) #random column
third_column <- c(1:10) #random column
group <- c(1,1,1,2,2,1,2,1,1,1) #column used for selection
#merge columns
df <- data.frame(first_column, second_column, third_column, group)
#examine
print(df)
which outputs:
first_column second_column third_column group
1 1 1 1 1
2 2 2 2 1
3 3 3 3 1
4 4 4 4 2
5 5 5 5 2
6 6 6 6 1
7 7 7 7 2
8 8 8 8 1
9 9 9 9 1
10 10 10 10 1
I would like to set the values of columns first_column, second_column, and third_column to 0 if the value of group is equal to 2 (while retaining the values if the value of group equals 1), which should result in:
first_column second_column third_column group
1 1 1 1 1
2 2 2 2 1
3 3 3 3 1
4 0 0 0 2
5 0 0 0 2
6 6 6 6 1
7 0 0 0 2
8 8 8 8 1
9 9 9 9 1
10 10 10 10 1
What is the most convenient way to reach this?

We could create the logical index with group as i and assign the columns 1 to 3 as 0
df[df$group == 2, 1:3] <- 0
-output
> df
first_column second_column third_column group
1 1 1 1 1
2 2 2 2 1
3 3 3 3 1
4 0 0 0 2
5 0 0 0 2
6 6 6 6 1
7 0 0 0 2
8 8 8 8 1
9 9 9 9 1
10 10 10 10 1

Related

Create column with ID starting at 1 and increment when value in another column changes in R

I have a data frame like so:
ID <- c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B')
val1 <- c(0,1,2,3,4,5,6,7,8,9,10,11,0,1,2,3)
val2 <- c(0,1,2,3,4,5,0,1,0,1,2,0,1,0,1,2)
df <- data.frame(ID, val1, val2)
Output:
ID val1 val2
1 A 0 0
2 A 1 1
3 A 2 2
4 A 3 3
5 A 4 4
6 A 5 5
7 A 6 0
8 A 7 1
9 A 8 0
10 A 9 1
11 A 10 2
12 B 11 0
13 B 0 1
14 B 1 0
15 B 2 1
16 B 3 2
I am trying to create a third column (val 3) which is like an index. When val1 = 0 and val 2 = 0 it should be 1 (this is also grouped by ID). It should stay as one and then increment by 1 until val2 = 0 again, like so showing desired output:
ID val1 val2 val3
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2
How can this be achieved? I tried:
df <- df %>%
group_by(ID, val2) %>%
mutate(val3 = row_number())
And:
df$val3 <- cumsum(c(1,diff(df$val2)==0))
But neither provide the desired outcome.
Inside cumsum use the logical comparison val2==0
df %>%
group_by(ID) %>%
mutate(val3 = cumsum(val2==0))
# A tibble: 16 × 4
# Groups: ID [2]
ID val1 val2 val3
<chr> <dbl> <dbl> <int>
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2

Filter dataframe on occurrence of same values across columns AND at column end

I have a dataframe like this:
df <- data.frame(
id = 1:19,
Area_l = c(1,2,0,0,0,2,3,1,2,0,0,0,0,3,4,0,0,0,0),
Area_r = c(3,2,2,0,0,2,3,1,0,0,0,1,3,3,4,3,0,0,0)
)
I need to filter the dataframe in such a way that all rows are omitted that fulfill two conditions:
(i): Area_l and Area_r are 0
(ii): the paired 0values in Area_l and Area_r are the last values in the columns.
I really have no clue how to implement these two conditions using dplyr. The desired result is this:
df
id Area_l Area_r
1 1 1 3
2 2 2 2
3 3 0 2
4 4 0 0
5 5 0 0
6 6 2 2
7 7 3 3
8 8 1 1
9 9 2 0
10 10 0 0
11 11 0 0
12 12 0 1
13 13 0 3
14 14 3 3
15 15 4 4
16 16 0 3
Any help?
Reverse the order of the dataframe, filter with a cumany condition, then reverse it back.
library(dplyr)
df %>%
map_df(rev) %>%
filter(cumany(Area_l + Area_r != 0)) %>%
map_df(rev)
output
# A tibble: 16 x 3
id Area_l Area_r
<int> <dbl> <dbl>
1 1 1 3
2 2 2 2
3 3 0 2
4 4 0 0
5 5 0 0
6 6 2 2
7 7 3 3
8 8 1 1
9 9 2 0
10 10 0 0
11 11 0 0
12 12 0 1
13 13 0 3
14 14 3 3
15 15 4 4
16 16 0 3
We may use rle
library(dplyr)
df %>%
filter(!if_all(starts_with("Area"), ~ .x == 0 &
inverse.rle(within.list(rle(.x == 0), values[-length(values)] <- FALSE))))
-output
id Area_l Area_r
1 1 1 3
2 2 2 2
3 3 0 2
4 4 0 0
5 5 0 0
6 6 2 2
7 7 3 3
8 8 1 1
9 9 2 0
10 10 0 0
11 11 0 0
12 12 0 1
13 13 0 3
14 14 3 3
15 15 4 4
16 16 0 3
Or another option is
df %>%
filter(if_any(starts_with("Area"),
~ row_number() <= max(row_number() * (.x != 0))))
Or another option is revcumsum from spatstat.utils
library(spatstat.utils)
df %>%
filter(!if_all(starts_with("Area"), ~ revcumsum(.x != 0) <1))

Ranking duplicated rows in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I am trying to create an additional variable (new variable-> flag) that will number the repetition of observation in my variable starting from 0.
dataset <- data.frame(id = c(1,1,1,2,2,4,6,6,6,7,7,7,7,8))
intended results will look like:
id flag
1 0
1 1
1 2
2 0
2 1
4 0
6 0
6 1
6 2
7 0
7 1
7 2
7 3
8 0
Thank You!
You may try
dataset$flag <- unlist(sapply(rle(dataset$id)$length, function(x) seq(1,x)-1))
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
data.table:
library(data.table)
setDT(dataset)[, flag := rowid(id) - 1]
dataset
id flag
1: 1 0
2: 1 1
3: 1 2
4: 2 0
5: 2 1
6: 4 0
7: 6 0
8: 6 1
9: 6 2
10: 7 0
11: 7 1
12: 7 2
13: 7 3
14: 8 0
Base R:
dataset$flag = sequence(rle(dataset$id)$lengths) - 1
dataset
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
Another base option:
transform(dataset,
flag = Reduce(function(x, y) y * x + y, duplicated(id), accumulate = TRUE))
id flag
1 1 0
2 1 1
3 1 2
4 2 0
5 2 1
6 4 0
7 6 0
8 6 1
9 6 2
10 7 0
11 7 1
12 7 2
13 7 3
14 8 0
dplyr -
library(dplyr)
dataset %>% group_by(id) %>% mutate(flag = row_number() - 1)
# id flag
# <dbl> <dbl>
# 1 1 0
# 2 1 1
# 3 1 2
# 4 2 0
# 5 2 1
# 6 4 0
# 7 6 0
# 8 6 1
# 9 6 2
#10 7 0
#11 7 1
#12 7 2
#13 7 3
#14 8 0
Base R with similar logic
transform(dataset, flag = ave(id, id, FUN = seq_along) - 1)
another way to reach what you expect but writing a little more
x <- dataset %>%
group_by(id) %>%
summarise(nreg=n())
df <- data.frame()
for(i in 1:nrow(x)){
flag <- data.frame(id = rep( x$id[i], x$nreg[i] ),
flag = seq(0, x$nreg [i] -1 )
)
df <- rbind(df, flag)
}

Count number of values which are less than current value

I'd like to count the rows in the column input if the values are smaller than the current row (Please see the results wanted below). The issue to me is that the condition is based on current row value, so it is very different from general case where the condition is a fixed number.
data <- data.frame(input = c(1,1,1,1,2,2,3,5,5,5,5,6))
input
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 5
9 5
10 5
11 5
12 6
The results I expect to get are like this. For example, for observations 5 and 6 (with value 2), there are 4 observations with value 1 less than their value 2. Hence count is given value 4.
input count
1 1 0
2 1 0
3 1 0
4 1 0
5 2 4
6 2 4
7 3 6
8 5 7
9 5 7
10 5 7
11 5 7
12 6 11
Edit: as I am dealing with grouped data with dplyr, the ultimate results I wish to get is like below, that is, I am wishing the conditions could be dynamic within each group.
data <- data.frame(id = c(1,1,2,2,2,3,3,4,4,4,4,4),
input = c(1,1,1,1,2,2,3,5,5,5,5,6),
count=c(0,0,0,0,2,0,1,0,0,0,0,4))
id input count
1 1 1 0
2 1 1 0
3 2 1 0
4 2 1 0
5 2 2 2
6 3 2 0
7 3 3 1
8 4 5 0
9 4 5 0
10 4 5 0
11 4 5 0
12 4 6 4
Here is an option with tidyverse
library(tidyverse)
data %>%
mutate(count = map_int(input, ~ sum(.x > input)))
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
Update
With the updated data, add the group by 'id' in the above code
data %>%
group_by(id) %>%
mutate(count1 = map_int(input, ~ sum(.x > input)))
# A tibble: 12 x 4
# Groups: id [4]
# id input count count1
# <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0
# 2 1 1 0 0
# 3 2 1 0 0
# 4 2 1 0 0
# 5 2 2 2 2
# 6 3 2 0 0
# 7 3 3 1 1
# 8 4 5 0 0
# 9 4 5 0 0
#10 4 5 0 0
#11 4 5 0 0
#12 4 6 4 4
In base R, we can use sapply and for each input count how many values are greater than itself.
data$count <- sapply(data$input, function(x) sum(x > data$input))
data
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
With dplyr one way would be using rowwise function and following the same logic.
library(dplyr)
data %>%
rowwise() %>%
mutate(count = sum(input > data$input))
1. outer and rowSums
data$count <- with(data, rowSums(outer(input, input, `>`)))
2. table and cumsum
tt <- cumsum(table(data$input))
v <- setNames(c(0, head(tt, -1)), c(head(names(tt), -1), tail(names(tt), 1)))
data$count <- v[match(data$input, names(v))]
3. data.table non-equi join
Perhaps more efficient with a non-equi join in data.table. Count number of rows (.N) for each match (by = .EACHI).
library(data.table)
setDT(data)
data[data, on = .(input < input), .N, by = .EACHI]
If your data is grouped by 'id', as in your update, join on that variable as well:
data[data, on = .(id, input < input), .N, by = .EACHI]
# id input N
# 1: 1 1 0
# 2: 1 1 0
# 3: 2 1 0
# 4: 2 1 0
# 5: 2 2 2
# 6: 3 2 0
# 7: 3 3 1
# 8: 4 5 0
# 9: 4 5 0
# 10: 4 5 0
# 11: 4 5 0
# 12: 4 6 4

Conditonally delete columns in R

I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!
Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1
I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.

Resources