How to find the number of rows which match a condition - r

If I have a dataframe A like
A:
x. y. z. a. b. c.
1 0 0 3 0 0
2 0 0 5 6 5
3 0 0 6 8 2
4 0 1 8 0 6
5 0 0 20 2 0
6 0 1 3 3 7
How could I obtain a data frame B like:
3 columns, each one related to a, b and c columns of data frame A.. containing the numbers of rows which match the following condition:
The number of times where a, b and c values are among 5 and 10, (5 <= i <=10) AND z value is equal to 1. For instance: in column a, the row 3 is 6 which is >5 and < 10, but z. value is not 1 then that row is not count. On the other hand in the row 4, a. is >5 and < 10, and z. value is 1, then this row is counted.
B would be like:
B:
a. b. c.
1 0 2

Here is a solution using tidyverse tools. The approach is to reduce to only the rows that have z == 1 using filter, and then use summarise_at to condense the other rows. We first apply the function (. > 5 & . < 10) which makes a logical vector for whether or not each of a, b, c are between 5 and 10, and then wrap it in sum. When applied to logical vectors, sum treats TRUE as 1 and FALSE as 0, so this is equivalent to counting the TRUE values.
library(tidyverse)
tbl_A <- read_table2(
"x y z a b c
1 0 0 3 0 0
2 0 0 5 6 5
3 0 0 6 8 2
4 0 1 8 0 6
5 0 0 20 2 0
6 0 1 3 3 7"
)
tbl_b <- tbl_A %>%
filter(z == 1) %>%
summarise_at(vars(a:c), ~ sum(. > 5 & . < 10)) %>%
print()
# A tibble: 1 x 3
a b c
<int> <int> <int>
1 1 0 2

Or in base R:
sapply(c("a.", "b.", "c."), function(x)
nrow(df[(df[, x] >= 5 & df[, x] <= 10) & df[, "z."] == 1, ])
)
#a. b. c.
# 1 0 2
Sample data
df <- read.table(text =
"x. y. z. a. b. c.
1 0 0 3 0 0
2 0 0 5 6 5
3 0 0 6 8 2
4 0 1 8 0 6
5 0 0 20 2 0
6 0 1 3 3 7", header = T)

Here is an option using data.table
library(data.table)
setDT(df)[z. == 1, lapply(.SD, function(x) sum(x > 5 & x < 10)) , .SDcols = a.:c.]
# a. b. c.
#1: 1 0 2

Related

Creating a new column with conditions in addition to the row value of the new column

Any ideas on how to create a new column B using the values of column A,
while using the value of the row above of the new created colum B?
The value of B should be corresponding to:
A0 = value of the row above.
A1 = 1.
A2 = value of the row above + 1.
Current dataframe + desired outcome
Dataframe Desired outcome
A A B
1 1 1
0 0 1
2 2 2
0 0 2
2 2 3
0 0 3
2 2 4
0 0 4
2 2 5
0 0 5
2 2 6
0 0 6
1 1 1
0 0 1
1 1 1
0 0 1
2 2 2
0 0 2
2 2 3
0 0 3
1 1 1
0 0 1
2 2 2
0 0 2
Data Frame
A <- c(1,0,2,0,2,0,2,0,2,0,2,0,1,0,1,0,2,0,2,0,1,0,2,0)
Bdesiredoutcome <- c(1,1,2,2,3,3,4,4,5,5,6,6,1,1,1,1,2,2,3,3,1,1,2,2)
df = data.frame(A,Bdesiredoutcome)
I tried using dpylr, mutate(), case_when() and lag() but keep running into errors. Due to using the lag() function. When using lag(A) the desired outcome cannot be generated.
Any idea's on how to solve this problem?
df <- df %>%
mutate(B = case_when((A == 0) ~ lag(B),
(A == 1) ~ 1,
(A == 2) ~ (lag(B)+1)
))
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "function"
In addition: Warning message:
We can create a grouping column with cumsum and then create the 'B' column
library(dplyr)
df %>%
group_by(grp = cumsum(A == 1)) %>%
mutate(B = cumsum(A != 0)) %>%
ungroup %>%
select(-grp) %>%
as.data.frame
-output
A Bdesired B
1 1 1 1
2 0 1 1
3 2 2 2
4 0 2 2
5 2 3 3
6 0 3 3
7 2 4 4
8 0 4 4
9 2 5 5
10 0 5 5
11 2 6 6
12 0 6 6
13 1 1 1
14 0 1 1
15 1 1 1
16 0 1 1
17 2 2 2
18 0 2 2
19 2 3 3
20 0 3 3
21 1 1 1
22 0 1 1
23 2 2 2
24 0 2 2
On your original question I got the following:
library(tidyverse)
library(lubridate)
df$date <-dmy(df$date)
df <- df %>%
arrange(id, date) %>%
group_by(id) %>%
mutate(daysbetween = replace_na(date - lag(date),0),
ind = 1,
NewA= case_when (daysbetween < 7 ~ 0, daysbetween > 7 ~ 1),
NewB= case_when (daysbetween < 85 ~ 0, daysbetween > 85 ~ 1),
A = case_when (1 + cumsum(ind*NewA) <= 6 ~ 1 + cumsum(ind*NewA),
1 + cumsum(ind*NewA) > 6 ~ 1 + cumsum(ind*NewA) - 6),
B = 1 + cumsum(ind*NewB))%>%
select(id, date, A, B)
It only works if the reset for A is at 6. I used cumsum() as suggested above.

column-wise operations depending on data on a data frame in R

I have a data frame with negative values in one column. something like this
df <- data.frame("a" = 1:6,"b"= -(5:10), "c" = rep(8:6,2))
a b c
1 1 -5 8
2 2 -6 7
3 3 -7 6
4 4 -8 8
5 5 -9 7
6 6 -10 6
I want to convert this to a data frame with no negative values in "b" keeping row totals unchanged. I can use column "a" only if "c" is not big enough to absorb the negative values in "b".
The end result should look like this
a b c
1 1 0 3
2 2 0 1
3 2 0 0
4 4 0 0
5 3 0 0
6 2 0 0
I feel that sapply could be used. But I don't know how ?
You can use pmin and pmax to get the new values for a, b and c.
df$c <- df$c + pmin(0, df$b)
df$b <- pmax(0, df$b)
df$a <- df$a + pmin(0, df$c)
df$c <- pmax(0, df$c)
df
# a b c
#1 1 0 3
#2 2 0 1
#3 2 0 0
#4 4 0 0
#5 3 0 0
#6 2 0 0
You could use dplyr:
df %>%
mutate(total=rowSums(.)) %>%
rowwise() %>%
mutate(c=max(b+c, 0),
b=max(b,0),
a=total - c - b) %>%
select(-total)
which returns
# A tibble: 6 x 3
# Rowwise:
a b c
<dbl> <dbl> <dbl>
1 1 0 3
2 2 0 1
3 2 0 0
4 4 0 0
5 3 0 0
6 2 0 0
Here is a base R solution.
df2 <- df
df2$c <- df$c + df$b
df2$a <- ifelse(df2$c < 0, df2$a + df2$c, df2$a)
df2[df2 < 0 ] <- 0
df2
# a b c
# 1 1 0 3
# 2 2 0 1
# 3 2 0 0
# 4 4 0 0
# 5 3 0 0
# 6 2 0 0

A Custom sort of the values within a dataframe in R

I am a newbie trying to learn R and I have a data frame like this:
a b c d
a 0 6 2 0
b 1 0 3 0
c 0 0 0 2
d 0 0 0 0
I want to sort a dataframe by two actions:
1. First, find the row which has the maximum TOTAL value and creating this
a b c d TOTAL
a 0 6 2 0 8
b 1 0 3 0 4
c 0 0 0 2 2
d 0 0 0 0 0
Second, select the row with the maximum value and recording the crossed
value in front of each character from max to min. So it results into a new dataframe like this:
'x'
a-b 6 #considering values for "a" where it meets "b"
a-c 2
b-c 3 #b has the second max TOTAL value
b-b 1
c-d 2 # finally, values in front of c
I'd appreciate your help on this one.
EDIT: adding source data at bottom
library(tidyr); library(dplyr)
df %>%
gather(col, val, -row) %>% # Pull into long form, with one row for each row-col
arrange(row, -val) %>% # Sort by row and descending value
filter(val != 0) %>% # Only keep non-zeros
unite("row", c("row", "col"))# combine row and col columns
row val
1 a_b 6
2 a_c 2
3 b_c 3
4 b_a 1
5 c_d 2
# Inputing data with "row" column
df <- read.table(
header = T,
stringsAsFactors = F,
text = "row a b c d
a 0 6 2 0
b 1 0 3 0
c 0 0 0 2
d 0 0 0 0 ")
Not completely certain, but is this what you want? You say you have a dataframe but it looks more like you have a matrix and it's not clear if you want to keep your first action or if that's just an intermediate step.
mat <- as.matrix(df)
df1 <- data.frame(addmargins(mat, 2))
df1
a b c d Sum
a 0 6 2 0 8
b 1 0 3 0 4
c 0 0 0 2 2
d 0 0 0 0 0
df2 <- as.data.frame(as.table(mat))
df2 <- df2[df2$Freq != 0,]
df2[with(df2, order(ave(Freq, Var1, FUN = sum), Freq, decreasing = TRUE)), ]
Var1 Var2 Freq
5 a b 6
9 a c 2
10 b c 3
2 b a 1
15 c d 2
Data:
df <- read.table(text="a b c d
0 6 2 0
1 0 3 0
0 0 0 2
0 0 0 0", header = TRUE, row.names = letters[1:4])
First question is just rowSums , for you second I am using melt , then order with groupby max and the value itself
s=setNames(reshape2::melt(as.matrix(df)), c('rows', 'vars', 'values'))
s=s[s$values!=0,]
s[order(-ave(s$values,s$rows,FUN=max),-s$values),]
rows vars values
5 a b 6
9 a c 2
10 b c 3
2 b a 1
15 c d 2

Fill a column based on max values by condition in R

I need to fill a new column based on the max values per group.
So I have
A B C
1 1 0
1 9 0
2 5 0
2 10 0
2 15 0
3 1 0
3 2 0
4 5 0
4 6 0
I need to fill $C with 1 for each maximum value in $B per grouping of $A
So:
A B C
1 1 0
1 9 1
2 5 0
2 10 0
2 15 1
3 1 0
3 2 1
4 5 0
4 6 1
Appreciate the help
We can use base R ave to match maximum value in each group
df$C <- +(with(df, B == ave(B, A, FUN = max)))
df
# A B C
#1 1 1 0
#2 1 9 1
#3 2 5 0
#4 2 10 0
#5 2 15 1
#6 3 1 0
#7 3 2 1
#8 4 5 0
#9 4 6 1
The same in dplyr would be
library(dplyr)
df %>%
group_by(A) %>%
mutate(C = +(B == max(B)))
We can also match it with index of maximum value
df$C <- with(df, ave(B, A, FUN = function(x) seq_along(x) == which.max(x)))
and
df %>%
group_by(A) %>%
mutate(C = +(row_number() == which.max(B)))

Filter out from a dataset rows where values of two variables are both = 0 in dplyr

I have the following data:
x y z
A 0 0
B 1 0
C 0 2
D 1 1
E 2 0
F 2 3
G 1 3
H 0 0
I 3 3
I want to automatically filter out from this dataset all the rows where 'y' and 'z' assumes 0 values at the same time using dplyr (namely I want to exclude A and H only)
Using dplyr:
library(dplyr)
df %>%
filter(y != 0 | z != 0)
# x y z
# 1 B 1 0
# 2 C 0 2
# 3 D 1 1
# 4 E 2 0
# 5 F 2 3
# 6 G 1 3
# 7 I 3 3
If your dataset is stored in a data.frame called df
You can with dplyr do this :
filter(df, !y == 0, !z == 0)
which will return :
x y z
B 1 0
C 0 2
D 1 1
E 2 0
F 2 3
G 1 3
I 3 3

Resources