I am new to dplyr/tidyverse and I would like to sum rows of a dataset if the values in a given column(s) exceed a given value. For example given this dataframe,
a<-c(2,3,2,1,0)
b<-c(2,3,3,2,1)
z<-c(3,2,1,1,0)
data.abz <- data.frame(a,b,z)
data.abz
a b z
1 2 2 3
2 3 3 2
3 2 3 1
4 1 2 1
5 0 1 0
I would like to sum across the rows if the value in column a or b is greater than 1 and if the value at column z is greater than 0. If the condition is not satisfied the row sum is 0. For example,
given the previous data frame, I would like to get the following,
a b z sum_values
1 2 2 3 7
2 3 3 2 8
3 2 3 1 6
4 1 2 1 3
5 0 1 0 0
The last two rows do not satisfy the condition and therefore they were assigned a value of 0. This is what I have done but I am sure there is a better way to achieve the same.
data.abz <- data.frame(a,b,z) %>%
mutate_at(vars(c(a,b)),
function(x) case_when(x < 2 ~ 0, TRUE~as.double(x)))%>%
mutate(sum_values = rowSums(.[1:3]))
Any more idiomatic and better ideas with R and dplyr?
I like to use dplyr's case_when function for conditional calculations. But depending on your needs you may need something else.
library(dplyr)
df <- tibble(a = c(2, 3, 2, 1, 0),
b = c(2, 3, 3, 2, 1),
z = c(3, 2, 1, 1, 0))
df <- df %>%
mutate(sum_values = case_when((a > 1 | b > 1) & (z > 0) ~ (a+b+z),
TRUE ~ 0))
That code produces different results than your results (specifically row 4), but let me know if it works. Or give a little more explanation on your desired output.
Related
I have a dataframe with a variable containing number of 9 digits. I would like to create a new variable with mutate and ifelse that will depend on the value of the 7th digit. I saw the grepl function that could do what I want but I don't know how to write this code.
For example if the 7th digit is a 1, I would like to have 1 in the new variable, if the 7th digit is a 2, I would like a 5 in the new variable, and if the 7th digit is a 3, I would like a 6 in the new variable.
Please find a example of data below.
Thank you for your help
# Library
library(tidyverse)
# Data
ID=c(1,2,3,4,5,6)
Vectest2=c("9079870989", "907007123", "907865345", "907098432", "907347567", "907845120")
> data
ID Vectest
1 1 830008100
2 2 830056123
3 3 830678309
4 4 830008100
5 5 830056123
6 6 830678309
data=data.frame(ID, Vectest)
# code
data %>% mutate(variable2=ifelse( grepl(), 3,
ifelse( grepl(), 1,
ifelse( grepl(), 2, NA)))
Instead of grepl I would suggest to you substr since you have to check fixed position. I have created a new column called seventh_d which will save the 7th digit in a variable so you don't have to calculate it again and again.
Also, since you are using dplyr use case_when which is a better alternative of nested ifelse.
library(dplyr)
data %>%
mutate(seventh_d = substr(Vectest2, 7, 7),
variable2 = case_when(seventh_d == 1 ~ 1,
seventh_d == 2 ~ 5,
seventh_d == 3 ~ 6,
TRUE ~ NA_real_))
# ID Vectest2 seventh_d variable2
#1 1 9079870989 0 NA
#2 2 907007123 1 1
#3 3 907865345 3 6
#4 4 907098432 4 NA
#5 5 907347567 5 NA
#6 6 907845120 1 1
data
data = data.frame(ID, Vectest2)
I was searching on internet for similar solution, but I was not able to find the specific one for my case. Let's say a have the following data frame:
a = c(1, 1, 1, 2, 2)
b = c(2, 1, 1, 1, 2)
c = c(2, 2, 1, 1, 1)
d = c(1, 2, 2, 1, 1)
df <- data.frame(a = a, b = b, c = c, d = d)
and df looks like this:
a b c d
1 1 2 2 1
2 1 1 2 2
3 1 1 1 2
4 2 1 1 1
5 2 2 1 1
Note: In this example I use [1,2] pair of values, but it could be a set of different values: [-1,1] or even more than two possible values: [-1,1,2].
Now I would like to have a matrix where each [i,j] element will represent the number of rows with the value 1 for column i and j. For this particular case we have (showing the upper diagonal, because its symmetric):
a b c d
a 3 2 1 1
b 3 2 1
c 3 2
d 3
The diagonal should count the number of rows with 1 value at a given column. On this case all columns have the sames number of value 1. The format should be similar to cor() function (Correlation Matrix).
I was trying to use table() (and also crosstab from descr package) but it shows the information by pairs of columns.
It can be done by computing manually the occurrence of 1 of each pair of columns (i.e.: nrow(df[df$a==1 & df$b==1,])=2) and then putting into a matrix, but I was wondering if there is a built-in function that simplify the process.
We can use crossprod on a matrix for computing the occurrences of the value 1 of the question´s example:
m1 <- as.matrix(df == 1) # see Note[1]
out <- crossprod(m1)
Note[1] Pointed by #imo (see comments below) for addressing the general case (a matrix with values: [x,y]). For a matrix with [0,1] values df==1can be replaced by df. For counting the 2 values from question's example, then use: df == 2.
If the lower diagonal should be 0 or NA
out[lower.tri(out)] <- NA
out
# a b c d
#a 3 2 1 1
#b NA 3 2 1
#c NA NA 3 2
#d NA NA NA 3
Let's say I have a dataframe:
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
# a b c
#1 1 2 4
#2 2 3 5
#3 3 2 1
For each column, I would like to calculate the difference between that and the max of the other columns:
# Desired result:
# a b c
#1 -3 -2 2
#2 -3 -2 2
#3 1 -1 -2
For example, for the (1,1) entry, it's 1 because for the first row, a = 1, and max(b,c) = 4, so 1 - 4 = -3.
Note that I don't necessarily know the number of columns in the dataframe up front, so there could be arbitrarily many columns.
This should work on any number of columns:
sapply(1:ncol(x), function (i) {
x[,i] - do.call(pmax, x[,-i])
})
If you want a dplyr solution with a bit of RC indexing, you can use transmute to generate a new data frame, or mutate to add to your existing dataframe.
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
x %>% transmute(a = a-max(x[,-1]),
b = b-max(x[,-2]),
c = c-max(x[,-3]))
Say I have a dataset like this:
id <- c(1, 2, 3, 4, 5,6)
number <- c(1, 4, 7, 4, NA, 4)
dat <- data.frame(id, number)
I.e.,
id number
1 1 1
2 2 4
3 3 7
4 4 4
5 5 NA
6 6 4
Using the filter function from dplyr, I can subset just the rows with numbers greater than 3:
dat.new <- filter(dat, number > 3)
id number
1 2 4
2 3 7
3 4 4
4 6 4
And I can also subset the rows with a missing number:
dat.new <- filter(dat, is.na(number))
id number
1 5 NA
But when I try to include rows with numbers NA and greater than 3, it doesn't work.
dat.new <- filter(dat, is.na(number) || number > 3)
id number
No data available in table
What's going on?
dat.new <- filter(dat, is.na(number) | number > 3)
The problem is the || operator. | (single) is the or comparison. See https://www.r-bloggers.com/logical-operators-in-r/ for more details.
I would like to perform an operation across a column of a data frame wherein the output is dependent on a comparison between two values.
My data frame dat is arranged like this:
region value1
a 0
a 0
a 6
a 7
a 3
a 0
a 4
b 5
b 1
b 0
I want to create a vector of factor values based in integers. The factor value should increment every time the region value changes or every time value1 is 0. So in this case the vector I want would be equivalent to c(1, 2, 2, 2, 2, 3, 3, 4, 4, 5).
I have code to make a factor vector that increments ONLY when value1 is 0:
fac <- as.factor(cumsum(dat[,2]==0))
and I have c-style code that gets roughly the vector I want, but runs extremely slowly on my overall data and is just plain ugly:
p <- 1
facint <- 1
for (i in 2:length(dat[,2])) {
facint <- c(facint, p)
if (dat[i, 2]==0 || dat[i, 1] != dat[i-1, 1])
p = p+1
}
fac <- as.factor(facint)
So how can I accomplish an operation such as this when operating on every row in R-style programming?
Try
cumsum(dat[,2]==0|c(FALSE,dat$region[-1]!=dat$region[-nrow(dat)]))
# [1] 1 2 2 2 2 3 3 4 4 5
Or
cumsum(!duplicated(dat[,1]) | dat[,2]==0)
#[1] 1 2 2 2 2 3 3 4 4 5