column-wise operations depending on data on a data frame in R - r

I have a data frame with negative values in one column. something like this
df <- data.frame("a" = 1:6,"b"= -(5:10), "c" = rep(8:6,2))
a b c
1 1 -5 8
2 2 -6 7
3 3 -7 6
4 4 -8 8
5 5 -9 7
6 6 -10 6
I want to convert this to a data frame with no negative values in "b" keeping row totals unchanged. I can use column "a" only if "c" is not big enough to absorb the negative values in "b".
The end result should look like this
a b c
1 1 0 3
2 2 0 1
3 2 0 0
4 4 0 0
5 3 0 0
6 2 0 0
I feel that sapply could be used. But I don't know how ?

You can use pmin and pmax to get the new values for a, b and c.
df$c <- df$c + pmin(0, df$b)
df$b <- pmax(0, df$b)
df$a <- df$a + pmin(0, df$c)
df$c <- pmax(0, df$c)
df
# a b c
#1 1 0 3
#2 2 0 1
#3 2 0 0
#4 4 0 0
#5 3 0 0
#6 2 0 0

You could use dplyr:
df %>%
mutate(total=rowSums(.)) %>%
rowwise() %>%
mutate(c=max(b+c, 0),
b=max(b,0),
a=total - c - b) %>%
select(-total)
which returns
# A tibble: 6 x 3
# Rowwise:
a b c
<dbl> <dbl> <dbl>
1 1 0 3
2 2 0 1
3 2 0 0
4 4 0 0
5 3 0 0
6 2 0 0

Here is a base R solution.
df2 <- df
df2$c <- df$c + df$b
df2$a <- ifelse(df2$c < 0, df2$a + df2$c, df2$a)
df2[df2 < 0 ] <- 0
df2
# a b c
# 1 1 0 3
# 2 2 0 1
# 3 2 0 0
# 4 4 0 0
# 5 3 0 0
# 6 2 0 0

Related

Conditioning error, progression of logic in mutate/elseif_ pipeline

I'm trying to work out why a code like this won't give me the expected results. I understand there are better ways of achieving the results (cut, etc.) but I am specifically trying to understand why the mutate>ifelse pipeline progression to replace values doesn't work.
A <- c(1,0,0,0,NA,0,1,0,1,0,0,1,1,1,NA,NA,NA,1,0,0,0,1,1,1,0,1,NA)
B <- c(1,0,0,NA,0,1,1,1,0,1,NA,1,0,1,NA,NA,1,0,01,0,0,0,NA,0,1,0,1)
C <- c(0,NA,0,1,0,1,NA,1,0,1,NA,0,1,0,NA,NA,1,0,01,NA,0,0,NA,1,NA,NA,1)
df <- data.frame(A, B, C)
df$D <- NA
df <- df %>%
mutate(D=ifelse(A==0 & B==0 & C==0,0,D)) %>% #assign 0 to d IF all 3 variables 0
mutate(D=ifelse(A==0 | B==0 | C==0,0,D)) %>% #now assign 0 to d IF ANY of 3 variables 0
mutate(D=ifelse(A==1 | B==1 | C==1,1,D)) #now reassign d to 1 if any of the variables has the value 1
> summary(as.factor(df$D))
0 1 NA's
2 19 6
But looking at cross tabulation, my aims is to get 0=2 and NA=2 and rest assigned 1. I can't figure out why my code's logic is not working.
> ftable(xtabs(~A+B+C, df, addNA = TRUE, na.action = NULL)) #matches AV variable
C 0 1 NA
A B
0 0 2 0 2
1 0 4 1
NA 0 1 1
1 0 3 2 1
1 3 0 1
NA 0 0 1
NA 0 1 0 0
1 0 2 0
NA 0 0 2
Edit: corrected typo
Look at your code step by step, specificslly the two mutate commands with the OR conditions. For rows that contain missing and 1s (but no zeroes), R can‘t check if this row contains a zero, because it does not know what NA might be. So the second mutate returns NA for any row that has only 1s and NAs. The third step dows the same, just with 1s. Any row that only contains 0s and NAs will then return NA.
You can verify this by:
x <- c(0, 0, NA)
any(x == 0)
[1] TRUE
any(x == 1)
[1] NA
You can do:
library(tidyverse)
df2 <- df %>%
mutate(D = case_when(A == 0 & B == 0 & C == 0 ~ 0,
is.na(A) & is.na(B) & is.na(C) ~ NA_real_,
TRUE ~ 1))
which gives:
A B C D
1 1 1 0 1
2 0 0 NA 1
3 0 0 0 0
4 0 NA 1 1
5 NA 0 0 1
6 0 1 1 1
7 1 1 NA 1
8 0 1 1 1
9 1 0 0 1
10 0 1 1 1
11 0 NA NA 1
12 1 1 0 1
13 1 0 1 1
14 1 1 0 1
15 NA NA NA NA
16 NA NA NA NA
17 NA 1 1 1
18 1 0 0 1
19 0 1 1 1
20 0 0 NA 1
21 0 0 0 0
22 1 0 0 1
23 1 NA NA 1
24 1 0 1 1
25 0 1 NA 1
26 1 0 NA 1
27 NA 1 1 1
And then
df2 %>% count(D)
D n
1 0 2
2 1 23
3 NA 2

Fill a column based on max values by condition in R

I need to fill a new column based on the max values per group.
So I have
A B C
1 1 0
1 9 0
2 5 0
2 10 0
2 15 0
3 1 0
3 2 0
4 5 0
4 6 0
I need to fill $C with 1 for each maximum value in $B per grouping of $A
So:
A B C
1 1 0
1 9 1
2 5 0
2 10 0
2 15 1
3 1 0
3 2 1
4 5 0
4 6 1
Appreciate the help
We can use base R ave to match maximum value in each group
df$C <- +(with(df, B == ave(B, A, FUN = max)))
df
# A B C
#1 1 1 0
#2 1 9 1
#3 2 5 0
#4 2 10 0
#5 2 15 1
#6 3 1 0
#7 3 2 1
#8 4 5 0
#9 4 6 1
The same in dplyr would be
library(dplyr)
df %>%
group_by(A) %>%
mutate(C = +(B == max(B)))
We can also match it with index of maximum value
df$C <- with(df, ave(B, A, FUN = function(x) seq_along(x) == which.max(x)))
and
df %>%
group_by(A) %>%
mutate(C = +(row_number() == which.max(B)))

add column with total count of rows meeting a condition in dplyr

Trying to get totals by class and condition but not grouping data.
Reproducible example:
df <- data.frame("class" = c("a","b","c","d","b","b","b","b","c","c","a"),"increment" = c(0,0,0,0,0,0,32,12,0,0,0))
R> df
class increment
1 a 0
2 b 0
3 c 0
4 d 0
5 b 0
6 b 0
7 b 32
8 b 12
9 c 0
10 c 0
11 a 0
I want the total cases where increment is different from Zero but for every class.
Desired output:
R> df
class increment increment_count_per_class
1 a 0 0
2 b 0 2
3 c 0 0
4 d 0 0
5 b 0 2
6 b 0 2
7 b 32 2
8 b 12 2
9 c 0 0
10 c 0 0
11 a 0 0
My first approach is here below, but I know there must be a less convoluted way using dplyr:
df <- df %>% mutate(has.increment = ifelse(increment>0,1,0))
R> df
class increment has.increment
1 a 0 0
2 b 0 0
3 c 0 0
4 d 0 0
5 b 0 0
6 b 0 0
7 b 32 1
8 b 12 1
9 c 0 0
10 c 0 0
11 a 0 0
Get totals per class when increment exists
N <- df %>% group_by(class,has.increment) %>% tally() %>% filter(has.increment == 1)
R> N
# A tibble: 1 x 3
# Groups: class [1]
class has.increment n
<chr> <dbl> <int>
1 b 1 2
Then join:
merge(N,df, by = "class", all = TRUE)
R> merge(N,df, by = "class", all = TRUE)
class has.increment.x n increment has.increment.y
1 a NA NA 0 0
2 a NA NA 0 0
3 b 1 2 0 0
4 b 1 2 12 1
5 b 1 2 0 0
6 b 1 2 0 0
7 b 1 2 32 1
8 c NA NA 0 0
9 c NA NA 0 0
10 c NA NA 0 0
11 d NA NA 0 0
Try this:
df %>%
group_by(class) %>%
mutate(increment_count_per_class = sum(increment!=0))

R: df header columns are ordinal ranking and spread across columns for each observation

I have a questionnaire data that look like below:
items no_stars1 no_stars2 no_stars3 average satisfied bad
1 A 1 0 0 0 0 1
2 B 0 1 0 1 0 0
3 C 0 0 1 0 1 0
4 D 0 1 0 0 1 0
5 E 0 0 1 1 0 0
6 F 0 0 1 0 1 0
7 G 1 0 0 0 0 1
Basically, the header columns (no. of stars rating and satisfactory) are ordinal ranking for each Items. I would like to summarize the no_stars(col 2:4) and satisfactory(col 5:7) into one column so that the output would look like this :
items no_stars satisfactory
1 A 1 1
2 B 2 2
3 C 3 3
4 D 2 3
5 E 3 2
6 F 3 3
7 G 1 1
$no_stars <- 1 is for no_stars1, 2 for no_stars2, 3 for no_stars3
$satisfactory <- 1 is for bad, 2 for average, 3 for good
I have tried the code below
df$no_stars2[df$no_stars2 == 1] <- 2
df$no_stars3[df$no_stars3 == 1] <- 3
df$average[df$average == 1] <- 2
df$satisfied[df$satisfied == 1] <- 3
no_stars <- df$no_stars1 + df$no_stars2 + df$no_stars3
satisfactory <- df$bad + df$average + df$satisfied
tidy_df <- data.frame(df$Items, no_stars, satisfactory)
tidy_df
Is there any function in R that can do the same thing? or
anyone got better and simpler solution ?
Thanks
Just use max.col and set preferences:
starsOrder<-c("no_stars1","no_stars2","no_stars3")
satOrder<-c("bad","average","satisfied")
data.frame(items=df$items,no_stars=max.col(df[,starsOrder]),
satisfactory=max.col(df[,satOrder]))
# items no_stars satisfactory
#1 A 1 1
#2 B 2 2
#3 C 3 3
#4 D 2 3
#5 E 3 2
#6 F 3 3
#7 G 1 1
Another tidyverse solution making use of factor to integer conversions to encode no_stars and satisfactory and spreading from wide to long twice:
library(tidyverse)
df %>%
gather(no_stars, v1, starts_with("no_stars")) %>%
mutate(no_stars = as.integer(factor(no_stars))) %>%
gather(satisfactory, v2, average, satisfied, bad) %>%
filter(v1 > 0 & v2 > 0) %>%
mutate(satisfactory = as.integer(factor(
satisfactory, levels = c("bad", "average", "satisfied")))) %>%
select(-v1, -v2) %>%
arrange(items)
# items no_stars satisfactory
#1 A 1 1
#2 B 2 2
#3 C 3 3
#4 D 2 3
#5 E 3 2
#6 F 3 3
#7 G 1 1
While there may be more elegant solutions, using dplyr::case_when() gives you the flexibility to code things however you want:
library(dplyr)
df %>%
dplyr::mutate(
no_stars = dplyr::case_when(
no_stars1 == 1 ~ 1,
no_stars2 == 1 ~ 2,
no_stars3 == 1 ~ 3)
, satisfactory = dplyr::case_when(
average == 1 ~ 2,
satisfied == 1 ~ 3,
bad == 1 ~ 1)
)
# items no_stars1 no_stars2 no_stars3 average satisfied bad no_stars satisfactory
# 1 A 1 0 0 0 0 1 1 1
# 2 B 0 1 0 1 0 0 2 2
# 3 C 0 0 1 0 1 0 3 3
# 4 D 0 1 0 0 1 0 2 3
# 5 E 0 0 1 1 0 0 3 2
# 6 F 0 0 1 0 1 0 3 3
# 7 G 1 0 0 0 0 1 1 1
dat%>%
replace(.==1,NA)%>%
replace_na(setNames(as.list(names(.)),names(.)))%>%
replace(.==0,NA)%>%
mutate(s=coalesce(!!!.[2:4]),
no_stars=as.numeric(factor(s,unique(s))),
t=coalesce(!!!.[5:7]),
satisfactory=as.numeric(factor(t,unique(t))))%>%
select(items,no_stars,satisfactory)
items no_stars satisfactory
1 A 1 1
2 B 2 2
3 C 3 3
4 D 2 3
5 E 3 2
6 F 3 3
7 G 1 1
using apply and match :
data.frame(
items = df1$items,
no_stars = apply(df1[2:4], 1, match, x=1),
satisfactory = apply(df1[c(7,5:6)], 1, match, x=1))
# items no_stars satisfactory
# 1 A 1 1
# 2 B 2 2
# 3 C 3 3
# 4 D 2 3
# 5 E 3 2
# 6 F 3 3
# 7 G 1 1
data
df1 <- read.table(header=TRUE,stringsAsFactors=FALSE,text="
items no_stars1 no_stars2 no_stars3 average satisfied bad
1 A 1 0 0 0 0 1
2 B 0 1 0 1 0 0
3 C 0 0 1 0 1 0
4 D 0 1 0 0 1 0
5 E 0 0 1 1 0 0
6 F 0 0 1 0 1 0
7 G 1 0 0 0 0 1")

How to find the number of rows which match a condition

If I have a dataframe A like
A:
x. y. z. a. b. c.
1 0 0 3 0 0
2 0 0 5 6 5
3 0 0 6 8 2
4 0 1 8 0 6
5 0 0 20 2 0
6 0 1 3 3 7
How could I obtain a data frame B like:
3 columns, each one related to a, b and c columns of data frame A.. containing the numbers of rows which match the following condition:
The number of times where a, b and c values are among 5 and 10, (5 <= i <=10) AND z value is equal to 1. For instance: in column a, the row 3 is 6 which is >5 and < 10, but z. value is not 1 then that row is not count. On the other hand in the row 4, a. is >5 and < 10, and z. value is 1, then this row is counted.
B would be like:
B:
a. b. c.
1 0 2
Here is a solution using tidyverse tools. The approach is to reduce to only the rows that have z == 1 using filter, and then use summarise_at to condense the other rows. We first apply the function (. > 5 & . < 10) which makes a logical vector for whether or not each of a, b, c are between 5 and 10, and then wrap it in sum. When applied to logical vectors, sum treats TRUE as 1 and FALSE as 0, so this is equivalent to counting the TRUE values.
library(tidyverse)
tbl_A <- read_table2(
"x y z a b c
1 0 0 3 0 0
2 0 0 5 6 5
3 0 0 6 8 2
4 0 1 8 0 6
5 0 0 20 2 0
6 0 1 3 3 7"
)
tbl_b <- tbl_A %>%
filter(z == 1) %>%
summarise_at(vars(a:c), ~ sum(. > 5 & . < 10)) %>%
print()
# A tibble: 1 x 3
a b c
<int> <int> <int>
1 1 0 2
Or in base R:
sapply(c("a.", "b.", "c."), function(x)
nrow(df[(df[, x] >= 5 & df[, x] <= 10) & df[, "z."] == 1, ])
)
#a. b. c.
# 1 0 2
Sample data
df <- read.table(text =
"x. y. z. a. b. c.
1 0 0 3 0 0
2 0 0 5 6 5
3 0 0 6 8 2
4 0 1 8 0 6
5 0 0 20 2 0
6 0 1 3 3 7", header = T)
Here is an option using data.table
library(data.table)
setDT(df)[z. == 1, lapply(.SD, function(x) sum(x > 5 & x < 10)) , .SDcols = a.:c.]
# a. b. c.
#1: 1 0 2

Resources