R: codify a variable based on preavious observation and by other variable - r

considering the following data:
Var1 Var2 Target
A 0 no
A 250 no
A 0 si
A 0 si
B 0 no
B 0 no
B 0 no
B 250 no
C 0 no
C 250 no
C 0 si
C 250 no
and look at the variable called Target. I need to reproduce it with the same values.
The condition to obtain "si" or "no" is the following:
for the same level of Var1 (e.g A) if Var2=250 and the nexts are =0 then Target=si
I made this code:
df$Target <- NA
for(i in unique(df$Var1)){
subset.data.frame(df, Var1==i)
for(n in 1: length(df$Var1))
df$Target <-
ifelse(df$Var2[n]==250 && df$Var2[n+1]==0 && df$Var1[n+1]==df$Var1[n], "si", "no"))
But I get Target=si only if the next Var2=0.
Instead, as described in the dataset above, all observations with Var2=0 after a 250 have to be Target=si.
Could you help me to solve the problem, please?
Thank you,
Andrea

Solution
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(Target = ifelse(cumsum(lag(Var2, default=0) == 250) > 0
& Var2 == 0, 'si', 'no'))
Result
# A tibble: 12 x 3
# Groups: Var1 [3]
Var1 Var2 Target
<fctr> <int> <chr>
1 A 0 no
2 A 250 no
3 A 0 si
4 A 0 si
5 B 0 no
6 B 0 no
7 B 0 no
8 B 250 no
9 C 0 no
10 C 250 no
11 C 0 si
12 C 250 no
Explanation
We use dplyr to group df by the levels of Var1, then for each group cumsum(lag(Var2, default=0) == 250) > 0 tells us for every row in that group if any previous observations of Var2 within that group were 250 and Var2 == 0 tells us if the current observation of Var2 is 0. If both of those conditions are TRUE, we code Target as "si", otherwise we code it as "no"
Data
The data I started with for df are
structure(list(Var1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Var2 = c(0L, 250L, 0L, 0L, 0L, 0L, 0L, 250L, 0L, 250L, 0L,
250L)), .Names = c("Var1", "Var2"), row.names = c(NA, -12L
), class = "data.frame")
Comparison to akrun's Solution
The output of arkun's solution is below so you can determine which approach is more appropriate for your problem.
# A tibble: 12 x 3
# Groups: Var1 [3]
Var1 Var2 Target
<fctr> <int> <chr>
1 A 0 si
2 A 250 no
3 A 0 no
4 A 0 no
5 B 0 no
6 B 0 no
7 B 0 si
8 B 250 no
9 C 0 si
10 C 250 no
11 C 0 si
12 C 250 no

We can use dplyr
library(dplyr)
df1 %>%
group_by(Var1) %>%
mutate(Target = replace(Target, Var2==0 & lead(Var2, default = Var2[n()])==250, 'si'))

Related

How can i count frequency AMT by ABCD?

C1:
AMT A B C D
1 13 0 1 0 0
2 17 0 0 1 0
3 19 0 0 0 1
4 1 0 0 1 0
5 9 0 1 0 0
How can i count frequency AMT by ABCD?
C2= t(as.matrix(C1[1])) %*% as.matrix(C1[2:5])
It gives me a result of Total Sum by Region.
My desired output to combine A B C D in one col since it is binary then count frequency by Type. ie.
AMT GROUP N
1 1 A 1
2 9 B 1
3 13 B 1
4 17 C 1
5 19 D 1
...
AMT IS NOT LIMITED TO 1 9 13 17 ... RANGE FROM 0-30
res <- C1 %>% group_by( ) %>% summarise(Freq=n())
library(tidyverse)
C1 %>%
tidyr::pivot_longer(
cols = A:D,
names_to = "Names",
values_to = "Values",
) %>%
group_by(Names) %>%
filter(Values == 1) %>%
summarise(AMT = sum(AMT))
select(Names, AMT, -Values)
Output:
Names AMT
<chr> <dbl>
1 B 22
2 C 18
3 D 19
You can use max.col to get the column name which has value 1 in it.
library(dplyr)
C1 %>%
transmute(AMT,
GROUP = names(.)[-1][max.col(select(., -1))],
N = 1) %>%
arrange(AMT) -> res
res
# AMT GROUP N
#4 1 C 1
#5 9 B 1
#1 13 B 1
#2 17 C 1
#3 19 D 1
data
C1 <- structure(list(AMT = c(13L, 17L, 19L, 1L, 9L), A = c(0L, 0L,
0L, 0L, 0L), B = c(1L, 0L, 0L, 0L, 1L), C = c(0L, 1L, 0L, 1L,
0L), D = c(0L, 0L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L))

R: subset dataframe for all rows after a condition is met

So I'm having a dataset of the following form:
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
I would like to subset the dataframe and create a new dataframe, containing only the rows after Var1 first reached its group-maximum (including the row this happens) up to the row where Var2 becomes 1 for the first time (also including this row). So what I'd like to have should look like this:
ID Var1 Var2
1 12 0
1 11 1
2 8 0
2 7 0
2 6 1
The original dataset contains a number of NAs and the function should simply ignore those. Also if Var2 never reaches "1" for a group is should just add all rows to the new dataframe (of course only the ones after Var1 reaches its group maximum).
However I cannot wrap my hand around the programming. Does anyone know help?
A dplyr solution with cumsum based filter will do what the question asks for.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
## A tibble: 5 x 3
## Groups: ID [2]
# ID Var1 Var2
# <int> <int> <int>
#1 1 12 0
#2 1 11 1
#3 2 8 0
#4 2 7 0
#5 2 6 1
Edit
Here is a solution that tries to answer to the OP's comment and question edit.
df1 %>%
group_by(ID) %>%
mutate_at(vars(starts_with('Var')), ~replace_na(., 0L)) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
Data
df1 <- read.table(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
Using data.table with .I
library(data.table)
setDT(df1)[df1[, .I[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]$V1]
# ID Var1 Var2
#1: 1 12 0
#2: 1 11 1
#3: 2 8 0
#4: 2 7 0
#5: 2 6 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(2L, 8L, 12L, 11L, 10L, 5L, 8L, 7L, 6L, 5L), Var2 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))
Here is data.table translation of Rui Barradas' working solution:
library(data.table)
dat <- fread(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
dat[, .SD[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]

Grouping by a column and counting number of positive and negative values corresponding to each value in R

I want to have a list of positive and negative values corresponding to each value that comes after grouping a column. My data looks like this:
dataset <- read.table(text =
"id value
1 4
1 -2
1 0
2 6
2 -4
2 -5
2 -1
3 0
3 0
3 -4
3 -5",
header = TRUE, stringsAsFactors = FALSE)
I want my result to look like this:
id num_pos_value num_neg_value num_zero_value
1 1 1 1
2 1 3 0
3 0 2 2
I want to extend the columns of the above result by adding sum of the positive and negative values.
id num_pos num_neg num_zero sum_pos sum_neg
1 1 1 1 4 -2
2 1 3 0 6 -10
3 0 2 2 0 -9
We create a group by 'id' and calculate the sum of logical vector
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(num_pos = sum(value > 0),
num_neg = sum(value < 0),
num_zero = sum(value == 0))
# A tibble: 3 x 4
# id num_pos num_neg num_zero
# <int> <int> <int> <int>
#1 1 1 1 1
#2 2 1 3 0
#3 3 0 2 2
Or get the table of sign of 'value' and spread it to 'wide'
library(tidyr)
df1 %>%
group_by(id) %>%
summarise(num = list(table(factor(sign(value), levels = -1:1)))) %>%
unnest %>%
mutate(grp = rep(paste0("num", c("pos", "zero", "neg")), 3)) %>%
spread(grp, num)
Or using count
df1 %>%
count(id, val = sign(value)) %>%
spread(val, n, fill = 0)
data
df1 <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), value = c(4L, -2L, 0L, 6L, -4L, -5L, -1L, 0L, 0L, -4L, -5L
)), class = "data.frame", row.names = c(NA, -11L))

Subsetting and repetition of rows in a dataframe using R

Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.
We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0
You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))
slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0

Reshaping only a few columns in a dataframe

I am trying to reshape a dataframe in R. Here is the dataframe I have in dput:
dput(newdata)
structure(list(var1 = c(0L, 0L, 0L, 0L, 0L, 0L), var2 = c(0L,
0L, 0L, 0L, 0L, 0L), var3 = c(0L, 0L, 0L, 0L, 0L, 0L), Date = structure(c(15260,
15260, 15260, 15169, 15169, 15169), class = "Date"), Success = structure(c(2L,
1L, 1L, 2L, 1L, 1L), .Label = c("N", "Y"), class = "factor")), .Names = c("var1",
"var2", "var3", "Date", "Success"), row.names = c(NA, 6L), class = "data.frame")
Output I am look for:
Variable Date N Y
var1 3/2/2012 0 1
var1 3/4/2012 0 1
var1 3/6/2012 0 1
var2 3/2/2012 1 0
var2 3/4/2012 1 0
var2 3/6/2012 1 0
var3 3/2/2012 0 1
var3 3/4/2012 0 1
var3 3/6/2012 0 1
I am fairly new to R. I have been trying to use reshape() module but been unsuccessful until now. Any insight would be hugely appreciated. Thank you.
Thank you for providing reproducible input and desired output. This helps a lot. Unfortunately as your input is presented now is flawed: rows 2 and 3 in your data frame are identical, and so are rows 5 and 6. It would not be possible to perform your desired data transformation correctly on such data.
Assuming your duplicate rows are not relevant, you can accomplish your desired output via tidyr::spread() and tidyr::gather(). I call your data structure df:
library("dplyr")
library("tidyr")
# call to duplicated() removes all identical rows from df
wide <- df %>%
filter(!duplicated(.)) %>%
gather(Variable, value, starts_with("var")) %>%
spread(Success, value, fill = NA, drop = FALSE)
wide
Date Variable N Y
1 2011-07-14 var1 0 0
2 2011-07-14 var2 0 0
3 2011-07-14 var3 0 0
4 2011-10-13 var1 0 0
5 2011-10-13 var2 0 0
6 2011-10-13 var3 0 0
So as kgolyaev stated, you have duplicate rows which means that spread can't simplify down to a single row when spreading the columns. One way around this is to just use a mutate with ifelse instead of spreading. This works because you just have "N" and "Y" for Success values. Had it been 12 unique values, it would have been a different solution.
We can gather the vars into vars and num. And then we can just use a simple nested ifelse statement to get the 1s and 0s. Then remove unneeded columns and arrange by Date.
library(tidyverse)
df %>% gather("vars", "num", -c(Date, Success)) %>%
mutate(Y = ifelse(Success == "N", 0, 1),
N = ifelse(Success == "N", 1, 0)) %>%
select(-c(Success, num)) %>%
arrange(Date)
Date vars Y N
1 2011-07-14 var1 1 0
2 2011-07-14 var1 0 1
3 2011-07-14 var1 0 1
4 2011-07-14 var2 1 0
5 2011-07-14 var2 0 1
6 2011-07-14 var2 0 1
7 2011-07-14 var3 1 0
8 2011-07-14 var3 0 1
9 2011-07-14 var3 0 1
10 2011-10-13 var1 1 0
11 2011-10-13 var1 0 1
12 2011-10-13 var1 0 1
13 2011-10-13 var2 1 0
14 2011-10-13 var2 0 1
15 2011-10-13 var2 0 1
16 2011-10-13 var3 1 0
17 2011-10-13 var3 0 1
18 2011-10-13 var3 0 1

Resources