I would like to perform an operation across a column of a data frame wherein the output is dependent on a comparison between two values.
My data frame dat is arranged like this:
region value1
a 0
a 0
a 6
a 7
a 3
a 0
a 4
b 5
b 1
b 0
I want to create a vector of factor values based in integers. The factor value should increment every time the region value changes or every time value1 is 0. So in this case the vector I want would be equivalent to c(1, 2, 2, 2, 2, 3, 3, 4, 4, 5).
I have code to make a factor vector that increments ONLY when value1 is 0:
fac <- as.factor(cumsum(dat[,2]==0))
and I have c-style code that gets roughly the vector I want, but runs extremely slowly on my overall data and is just plain ugly:
p <- 1
facint <- 1
for (i in 2:length(dat[,2])) {
facint <- c(facint, p)
if (dat[i, 2]==0 || dat[i, 1] != dat[i-1, 1])
p = p+1
}
fac <- as.factor(facint)
So how can I accomplish an operation such as this when operating on every row in R-style programming?
Try
cumsum(dat[,2]==0|c(FALSE,dat$region[-1]!=dat$region[-nrow(dat)]))
# [1] 1 2 2 2 2 3 3 4 4 5
Or
cumsum(!duplicated(dat[,1]) | dat[,2]==0)
#[1] 1 2 2 2 2 3 3 4 4 5
Related
Given a random integer vector below:
z <- c(3, 2, 4, 2, 1)
I'd like to create a new vector that contains all z's indices a number of times specified by the value corresponding to that element of z. To illustrate this. The desired result in this case should be:
[1] 1 1 1 2 2 3 3 3 3 4 4 5
There must be a simple way to do this.
You can use rep and seq to repeat the indices of a vector based on the values of that same vector. seq to get the indices and rep to repeat them.
rep(seq(z), z)
# [1] 1 1 1 2 2 3 3 3 3 4 4 5
Starting with all the indices of the vector z. These are given by:
1:length(z)
Then these elements should be repeated. The number of times these numbers should be repeated is specified by the values of z. This can be done using a combination of the lapply or sapply function and the rep function:
unlist(lapply(X = 1:length(z), FUN = function(x) rep(x = x, times = z[x])))
[1] 1 1 1 2 2 3 3 3 3 4 4 5
unlist(sapply(X = 1:length(z), FUN = function(x) rep(x = x, times = z[x])))
[1] 1 1 1 2 2 3 3 3 3 4 4 5
Both alternatives give the same result.
I am struggling to figure out how to remove rows from a dataset based on conditions across multiple factors in a large dataset. Here is some example data to illustrate the problem I am having with a smaller data frame:
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
data$Value <- (as.numeric(data$Value))
data
Code Value
1 A 1
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I want to remove values where the Code is A and the Value is < 2 from the dataset. I understand the logic of how to select for values where Code is A and Values <2, but I can't figure out how to remove these values from the dataset without also removing all values of A that are > 2, while maintaining values of the other codes that are less than 2.
#Easy to select for values of A less than 2
data2<- subset(data, (Code == "A" & Value < 2))
data2
Code Value
1 A 1
#But I want to remove values of A less than 2 without also removing values of A that are greater than 2:
data1<- subset(data, (Code != "A" & Value > 2))
data1
Code Value
3 C 3
4 D 4
### just using Value > 2 does not allow me to include values that are less than 2 for the other Codes (B,C,D):
data2<- subset(data, Value > 2)
data2
3 C 3
4 D 4
7 A 3
8 A 4
My ideal dataset would look like this:
data
Code Value
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I have tried different iterations of filter(), subset(), and select() but I can't figure out the correct conditional statement that allows me to remove the desired combination of levels of multiple factors. Any suggestions would be greatly appreciated.
I am new to dplyr/tidyverse and I would like to sum rows of a dataset if the values in a given column(s) exceed a given value. For example given this dataframe,
a<-c(2,3,2,1,0)
b<-c(2,3,3,2,1)
z<-c(3,2,1,1,0)
data.abz <- data.frame(a,b,z)
data.abz
a b z
1 2 2 3
2 3 3 2
3 2 3 1
4 1 2 1
5 0 1 0
I would like to sum across the rows if the value in column a or b is greater than 1 and if the value at column z is greater than 0. If the condition is not satisfied the row sum is 0. For example,
given the previous data frame, I would like to get the following,
a b z sum_values
1 2 2 3 7
2 3 3 2 8
3 2 3 1 6
4 1 2 1 3
5 0 1 0 0
The last two rows do not satisfy the condition and therefore they were assigned a value of 0. This is what I have done but I am sure there is a better way to achieve the same.
data.abz <- data.frame(a,b,z) %>%
mutate_at(vars(c(a,b)),
function(x) case_when(x < 2 ~ 0, TRUE~as.double(x)))%>%
mutate(sum_values = rowSums(.[1:3]))
Any more idiomatic and better ideas with R and dplyr?
I like to use dplyr's case_when function for conditional calculations. But depending on your needs you may need something else.
library(dplyr)
df <- tibble(a = c(2, 3, 2, 1, 0),
b = c(2, 3, 3, 2, 1),
z = c(3, 2, 1, 1, 0))
df <- df %>%
mutate(sum_values = case_when((a > 1 | b > 1) & (z > 0) ~ (a+b+z),
TRUE ~ 0))
That code produces different results than your results (specifically row 4), but let me know if it works. Or give a little more explanation on your desired output.
I was searching on internet for similar solution, but I was not able to find the specific one for my case. Let's say a have the following data frame:
a = c(1, 1, 1, 2, 2)
b = c(2, 1, 1, 1, 2)
c = c(2, 2, 1, 1, 1)
d = c(1, 2, 2, 1, 1)
df <- data.frame(a = a, b = b, c = c, d = d)
and df looks like this:
a b c d
1 1 2 2 1
2 1 1 2 2
3 1 1 1 2
4 2 1 1 1
5 2 2 1 1
Note: In this example I use [1,2] pair of values, but it could be a set of different values: [-1,1] or even more than two possible values: [-1,1,2].
Now I would like to have a matrix where each [i,j] element will represent the number of rows with the value 1 for column i and j. For this particular case we have (showing the upper diagonal, because its symmetric):
a b c d
a 3 2 1 1
b 3 2 1
c 3 2
d 3
The diagonal should count the number of rows with 1 value at a given column. On this case all columns have the sames number of value 1. The format should be similar to cor() function (Correlation Matrix).
I was trying to use table() (and also crosstab from descr package) but it shows the information by pairs of columns.
It can be done by computing manually the occurrence of 1 of each pair of columns (i.e.: nrow(df[df$a==1 & df$b==1,])=2) and then putting into a matrix, but I was wondering if there is a built-in function that simplify the process.
We can use crossprod on a matrix for computing the occurrences of the value 1 of the question´s example:
m1 <- as.matrix(df == 1) # see Note[1]
out <- crossprod(m1)
Note[1] Pointed by #imo (see comments below) for addressing the general case (a matrix with values: [x,y]). For a matrix with [0,1] values df==1can be replaced by df. For counting the 2 values from question's example, then use: df == 2.
If the lower diagonal should be 0 or NA
out[lower.tri(out)] <- NA
out
# a b c d
#a 3 2 1 1
#b NA 3 2 1
#c NA NA 3 2
#d NA NA NA 3
I illustrate my question with a small date frame such as:
X1 X2 X3
1 0 1 2
2 0 1 3
3 0 1 4
4 0 2 3
5 0 2 4
6 0 3 4
7 1 2 3
8 1 2 4
9 1 3 4
10 2 3 4
(The real one will have a huge number of rows...)
I have to expand each row of this data frame with 12 additional values, considering that the 3 values already present are the 3 starting terms of a series defined by the recurrence equation:
U(n) = U(n-1) - Min(U(n-2), U(n-3))
Consider for example the 1st row with 0, 1, 2. The next term (4th) has to be :
2 - Min(1, 0) = 2 - 0 = 2
etc. At the end, my first row will be :
0 1 2 2 1 -1 -2 -1 1 3 4 3 0 -3 -3
And I have to repeat this operation on each row of my initial data frame. Of course, I know I can use intricated loops "for {***}" to do this, but it's time consuming.
Is there any way to build the final data frame column by column? (I mean not listing the rows but constructing at once entire columns based on the recurrence equation)
You do not have to write an 'intricate loop' and work row by row. You can write just one simple loop and calculate column by column such as:
# recreate the sample dataframe
data <- data.frame(x1 = c(rep(0, 6), 1,1,1,2),
X2 = c(1,1,1,2,2,4,2,2,3,3),
X3 = c(2,3,4,3,4,4,3,4,4,4))
# create placeholder dataframe full of zeros
temp <- data.frame(matrix(data = 0, nrow = 10, ncol = 15))
# write the original dataframe into the placeholder dataframe
temp[, 1:3] <- data
# for columns 4 to 15 in the placeholder dataframe
for(i in 4:15)
{
# calculate each column based on the given formula
temp[, i] <- temp[, i-1] - pmin(temp[,i-2], temp[, i-3])
}