r - dplyr mutate refer new column itself

r - dplyr mutate refer new column itself - r

I have a data frame like this named 'a'.
ID V1
1 -1
1 0
1 1
1 1000
1 0
1 1
2 -1
2 0
2 1000
...
I shorten this data frame to show briefly.
And now I want to create a new column using conditional mutate function, but it should refer new column created by mutate function.
a %>%
group_by(ID) %>%
mutate(V2, ifelse(row_number() == 1, 1,
ifelse(V1 < 1000, 1,
ifelse(V1 >= 1000, lag(V2) + 1))
"Error: Then 'V2' not found" message is produced.
This result is what I want.
ID V1 V2
1 -1 1
1 0 1
1 1 1
1 1000 2
1 0 2
1 1 2
2 -1 1
2 0 1
2 1000 2
How to I get this? Thanks for your help.

We can try
a %>%
group_by(ID) %>%
mutate(V2 = cumsum(V1 >= 1000)+1L)
# ID V1 V2
# <int> <int> <int>
#1 1 -1 1
#2 1 0 1
#3 1 1 1
#4 1 1000 2
#5 1 0 2
#6 1 1 2
#7 2 -1 1
#8 2 0 1
#9 2 1000 2
data
a <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
V1 = c(-1L,
0L, 1L, 1000L, 0L, 1L, -1L, 0L, 1000L)), .Names = c("ID", "V1"
), class = "data.frame", row.names = c(NA, -9L))

This should work:
a %>% group_by(ID) %>% mutate(V2 = ifelse(row_number() == 1, 1, 0) +
ifelse(row_number() > 1 & V1 <= 1000, 1, 0) +
cumsum(ifelse(V1 >= 1000, 1, 0)))
Update: Changed second ifelse logic statement from row_number() > 1 & V1 < 1000 to that shown above. This alteration should give the results as requested in the comments.

Related

How to filter for multiple possible values?

I have a table that looks something like this:
ID Var1P Var1C Var2P Var2C Var3P Var3P NoDxP NoDxC
101 1 3 3 1 1 1 1 1
102 1 1 1 2 1 1 1 1
103 2 1 1 3 1 1 1 1
104 1 0 2 0 1 1 1 1
What I have been trying to do is filter for only the observations that have all values of 0, 1, or, 2. Basically getting rid of anything that has a score of 3 or higher. I'm attempting to use this filter method - it should be noted that this dataframe is in class character:
namesnovalue <- dataframe[c(2:7)]
namesnovalue <- names(namesnovalue)
filternovalue <- function(x) {
filter(dataframe, x == '1' | x == '0' | x == '2')
}
novalue <- sapply(dataframe[namesnovalue], FUN=filternovalue, simplify=TRUE, USE.NAMES=TRUE)
novalue <- as.data.frame(novalue)
I think the function attempts to do what I set out for it to do. But, before I make novalue a dataframe I get a matrixing of the data. When I make it a dataframe I get a dataframe made up of the matrices (or so it appears). I'm not sure where I'm writing the argument incorrectly.
For reference, the data output I'm trying to get would be this:
ID Var1P Var1C Var2P Var2C Var3P Var3C NoDxP NoDxC
102 1 1 1 2 1 1 1 1
104 1 0 2 0 1 1 1 1
Thank you all for any help and time!

EDITED based on updated question.
Using dplyr:
library(dplyr)
df1 %>%
filter(across(-starts_with("ID"), ~ . < 3))
Result:
ID Var1P Var1C Var2P Var2C Var3P Var3C
1 102 1 1 1 2 1 1
2 104 1 0 2 0 1 1
Where data df1 is:
df1 <- structure(list(ID = 101:104, Var1P = c(1L, 1L, 2L, 1L),
Var1C = c(3L, 1L, 1L, 0L), Var2P = c(3L, 1L, 1L, 2L),
Var2C = c(1L, 2L, 3L, 0L), Var3P = c(1L, 1L, 1L, 1L),
Var3C = c(1L, 1L, 1L, 1L)),
class = "data.frame",
row.names = c(NA, -4L))

In base R, you can use rowSums to select rows which has no value greater than or equal to 3.
df[rowSums(df[-1] >= 3) == 0, ]
# ID Var1P Var1C Var2P Var2C Var3P Var3P.1 NoDxP NoDxC
#2 102 1 1 1 2 1 1 1 1
#4 104 1 0 2 0 1 1 1 1

R: subset dataframe for all rows after a condition is met

So I'm having a dataset of the following form:
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
I would like to subset the dataframe and create a new dataframe, containing only the rows after Var1 first reached its group-maximum (including the row this happens) up to the row where Var2 becomes 1 for the first time (also including this row). So what I'd like to have should look like this:
ID Var1 Var2
1 12 0
1 11 1
2 8 0
2 7 0
2 6 1
The original dataset contains a number of NAs and the function should simply ignore those. Also if Var2 never reaches "1" for a group is should just add all rows to the new dataframe (of course only the ones after Var1 reaches its group maximum).
However I cannot wrap my hand around the programming. Does anyone know help?

A dplyr solution with cumsum based filter will do what the question asks for.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
## A tibble: 5 x 3
## Groups: ID [2]
# ID Var1 Var2
# <int> <int> <int>
#1 1 12 0
#2 1 11 1
#3 2 8 0
#4 2 7 0
#5 2 6 1
Edit
Here is a solution that tries to answer to the OP's comment and question edit.
df1 %>%
group_by(ID) %>%
mutate_at(vars(starts_with('Var')), ~replace_na(., 0L)) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
Data
df1 <- read.table(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)

Using data.table with .I
library(data.table)
setDT(df1)[df1[, .I[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]$V1]
# ID Var1 Var2
#1: 1 12 0
#2: 1 11 1
#3: 2 8 0
#4: 2 7 0
#5: 2 6 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(2L, 8L, 12L, 11L, 10L, 5L, 8L, 7L, 6L, 5L), Var2 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))

Here is data.table translation of Rui Barradas' working solution:
library(data.table)
dat <- fread(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
dat[, .SD[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]

Grouping by a column and counting number of positive and negative values corresponding to each value in R

I want to have a list of positive and negative values corresponding to each value that comes after grouping a column. My data looks like this:
dataset <- read.table(text =
"id value
1 4
1 -2
1 0
2 6
2 -4
2 -5
2 -1
3 0
3 0
3 -4
3 -5",
header = TRUE, stringsAsFactors = FALSE)
I want my result to look like this:
id num_pos_value num_neg_value num_zero_value
1 1 1 1
2 1 3 0
3 0 2 2
I want to extend the columns of the above result by adding sum of the positive and negative values.
id num_pos num_neg num_zero sum_pos sum_neg
1 1 1 1 4 -2
2 1 3 0 6 -10
3 0 2 2 0 -9

We create a group by 'id' and calculate the sum of logical vector
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(num_pos = sum(value > 0),
num_neg = sum(value < 0),
num_zero = sum(value == 0))
# A tibble: 3 x 4
# id num_pos num_neg num_zero
# <int> <int> <int> <int>
#1 1 1 1 1
#2 2 1 3 0
#3 3 0 2 2
Or get the table of sign of 'value' and spread it to 'wide'
library(tidyr)
df1 %>%
group_by(id) %>%
summarise(num = list(table(factor(sign(value), levels = -1:1)))) %>%
unnest %>%
mutate(grp = rep(paste0("num", c("pos", "zero", "neg")), 3)) %>%
spread(grp, num)
Or using count
df1 %>%
count(id, val = sign(value)) %>%
spread(val, n, fill = 0)
data
df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), value = c(4L, -2L, 0L, 6L, -4L, -5L, -1L, 0L, 0L, -4L, -5L
)), class = "data.frame", row.names = c(NA, -11L))

Create counter with multiple variables that restart within each subgroup

I have a dataframe with two columns (ident and value). I would like to create a counter that restart every time ident value change and also when value within each ident change. Here is an example to make it clear.
# ident value counter
#--------------------
# 1 0 1
# 1 0 2
# 1 1 1
# 1 1 2
# 1 1 3
# 1 0 1
# 1 1 1
# 1 1 2
# 2 1 1
# 2 0 1
# 2 0 2
# 2 0 3
I've tried the plyr package
ddply(mydf, .(ident, value), transform, .id = seq_along(ident))
Same result with the data.frame package.

A data.table alternative with the use of the rleid/rowid functions. With rleid you create a run length id for consecutive values, which can be used as a group. 1:.N or rowid can be used to create the counter. The code:
library(data.table)
# option 1:
setDT(d)[, counter := 1:.N, by = .(ident,rleid(value))]
# option 2:
setDT(d)[, counter := rowid(ident, rleid(value))]
which both give:
> d
ident value counter
1: 1 0 1
2: 1 0 2
3: 1 1 1
4: 1 1 2
5: 1 1 3
6: 1 0 1
7: 1 1 1
8: 1 1 2
9: 2 1 1
10: 2 0 1
11: 2 0 2
12: 2 0 3
With dplyr it is a bit less straightforward:
library(dplyr)
d %>%
group_by(ident, val.gr = cumsum(value != lag(value, default = first(value)))) %>%
mutate(counter = row_number()) %>%
ungroup() %>%
select(-val.gr)
As an alternative to the cumsum-function you could also use rleid from data.table.
Used data:
d <- structure(list(ident = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
value = c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L)),
.Names = c("ident", "value"), class = "data.frame", row.names = c(NA, -12L))

We can paste the two values together and use length attribute of rle to get the length of consecutive numbers. We then use sequence to generate the counter.
df$counter <- sequence(rle(paste0(df$dent, df$value))$lengths)
df
# dent value counter
#1 1 0 1
#2 1 0 2
#3 1 1 1
#4 1 1 2
#5 1 1 3
#6 1 0 1
#7 1 1 1
#8 1 1 2
#9 2 1 1
#10 2 0 1
#11 2 0 2
#12 2 0 3

In R, how can I elegantly compute the medians for multiple columns, and then count the number of cells in each row that exceed the median?

Suppose I have the following data frame:
Base Coupled Derived Decl
1 0 0 1
1 7 0 1
1 1 0 1
2 3 12 1
1 0 4 1
Here is the dput output:
temp <- structure(list(Base = c(1L, 1L, 1L, 2L, 1L), Coupled = c(0L,7L, 1L, 3L, 0L), Derived = c(0L, 0L, 0L, 12L, 4L), Decl = c(1L, 1L, 1L, 1L, 1L)), .Names = c("Base", "Coupled", "Derived", "Decl"), row.names = c(NA, 5L), class = "data.frame")
I want to compute the median for each column. Then, for each row, I want to count the number of cell values greater than the median for their respective columns and append this as a column called AboveMedians.
In the example, the medians would be c(1,1,0,1). The resulting table I want would be
Base Coupled Derived Decl AboveMedians
1 0 0 1 0
1 7 0 1 1
1 1 0 1 0
2 3 12 1 3
1 0 4 1 1
What is the elegant R way to do this? I have something involving a for-loop and sapply, but this doesn't seem optimal.
Thanks.

We can use rowMedians from matrixStats after converting the data.frame to matrix.
library(matrixStats)
Medians <- colMedians(as.matrix(temp))
Medians
#[1] 1 1 0 1
Then, replicate the 'Medians' to make the dimensions equal to that of 'temp', do the comparison and get the rowSums on the logical matrix.
temp$AboveMedians <- rowSums(temp >Medians[col(temp)])
temp$AboveMedians
#[1] 0 1 0 3 1
Or a base R only option is
apply(temp, 2, median)
# Base Coupled Derived Decl
# 1 1 0 1
rowSums(sweep(temp, 2, apply(temp, 2, median), FUN = ">"))

Another alternative:
library(dplyr)
library(purrr)
temp %>%
by_row(function(x) {
sum(x > summarise_each(., funs(median))) },
.to = "AboveMedian",
.collate = "cols"
)
Which gives:
#Source: local data frame [5 x 5]
#
# Base Coupled Derived Decl AboveMedian
# <int> <int> <int> <int> <int>
#1 1 0 0 1 0
#2 1 7 0 1 1
#3 1 1 0 1 0
#4 2 3 12 1 3
#5 1 0 4 1 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

r - dplyr mutate refer new column itself - r

Related

How to filter for multiple possible values?

R: subset dataframe for all rows after a condition is met

Grouping by a column and counting number of positive and negative values corresponding to each value in R

Create counter with multiple variables that restart within each subgroup

In R, how can I elegantly compute the medians for multiple columns, and then count the number of cells in each row that exceed the median?

Categories

Resources