Fill a column in a data frame only one step up - r

I want to fill one step up with one variable in a data frame
> id <- rep(1:3,each=2)
> trt <- rep(c("A","B"),3)
> score <- c("1", "","", 3, "",6)
> df <- data.frame(id,trt,score)
> df
id trt score
1 1 A 1
2 1 B
3 2 A
4 2 B 3
5 3 A
6 3 B 6
>
I want it to look like this:
> id <- rep(1:3,each=2)
> trt <- rep(c("A","B"),3)
> score <- c(1, "",3, 3, 6,6)
> df <- data.frame(id,trt,score)
> df
id trt score
1 1 A 1
2 1 B
3 2 A 3
4 2 B 3
5 3 A 6
6 3 B 6
I know this code fill in the columns up, but I just want it to fill one step up, is that possible to do?
library(tidyr)
> library(dplyr)
> df %>% fill(score, .direction="up")

To use fill we need NA's while you have empty string values. We can conditional replace blank values with NA for only one row above non-blank values and then use fill
library(dplyr)
df %>%
mutate(score = replace(score, which(score != "") - 1, NA)) %>%
tidyr::fill(score, .direction = "up")
# id trt score
#1 1 A 1
#2 1 B
#3 2 A 3
#4 2 B 3
#5 3 A 6
#6 3 B 6
An alternative and simple base R option would be
inds <- which(df$score != '')
inds <- inds[inds > 1]
df$score[inds - 1] <- df$score[inds]

Related

How can I remove rows with the same value in 2 ore more rows in R

I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")

Remove columns that have only a unique value

I want to remove columns that have only a unique value.
First, I try it for a single column and it works:
data %/%
select_if(length(unique(data$policy_id)) > 1)
then I try it for multiple columns as below:
data %/%
select_if(length(unique(data[, c("policy_date", "policy_id"])) > 1)
but it does not work. I think it is a conceptual mistake due to my lack of experience.
thanks in advance
You can use select(where()).
Suppose I have a data frame like this:
df <- data.frame(A = LETTERS[1:5], B = 1:5, C = 2)
df
#> A B C
#> 1 A 1 2
#> 2 B 2 2
#> 3 C 3 2
#> 4 D 4 2
#> 5 E 5 2
Then I can do:
df %>% select(where(~ n_distinct(.) > 1))
#> A B
#> 1 A 1
#> 2 B 2
#> 3 C 3
#> 4 D 4
#> 5 E 5
Some base R options:
Using lengths + unique + sapply
subset(df,select = lengths(sapply(df,unique))>1)
Using Filter + length + unique
Filter(function(x) length(unique(x))>1,df)
Does this work:
> df <- data.frame(col1 = 1:10,
+ col2 = rep(10,10),
+ col3 = round(rnorm(10,1)))
> df
col1 col2 col3
1 1 10 1
2 2 10 0
3 3 10 1
4 4 10 1
5 5 10 1
6 6 10 0
7 7 10 2
8 8 10 1
9 9 10 1
10 10 10 1
> df %>% select_if(~length(unique(.)) > 1)
col1 col3
1 1 1
2 2 0
3 3 1
4 4 1
5 5 1
6 6 0
7 7 2
8 8 1
9 9 1
10 10 1
>
Another option would be to use purrr:
df %>% purrr::keep(~all(n_distinct(.) > 1))
df %>% purrr::keep(~all(length(unique(.)) > 1))
df %>% purrr::discard(~!all(n_distinct(.) > 1))
df %>% purrr::discard(~!all(length(unique(.)) > 1))
Mixing table with apply generates the same output.
df[, apply(df, 2, function(i) length(table(i)) > 1)]
df <- data.frame(A = LETTERS[1:5], B = 1:5, C = 2)
An option with base R
df[sapply(df, function(x) length(unique(x))) > 1]
data
df <- data.frame(A = LETTERS[1:5], B = 1:5, C = 2)

R: Generating indicators that values differ within groups

I have a data frame where each row is an observation and I have two columns:
the group membership of the observation
the outcome for the observation.
I'm trying to create a new variable outcome_change that takes a value of 1 if outcome is NOT identical for all observations in a given group and 0 otherwise.
Shown in the below code (dat) is an example of the data I have. Meanwhile, dat_out1 shows what I'm looking for the code to produce in the presence of no NA values. The dat_out2 is identical except it shows that the same results arise when there are missing values in a group's values.
Surely there is somewhat to do this with dplyr::group_by()? I don't know how to make these comparisons within groups.
# Input (2 groups: 1 with identical values of outcome
# in the group (group a) and 1 with differing values of
# outcome in the group (group b)
dat <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2))
# Output 1: add a variable for all observations belonging to
# a group where the outcome changed within each group
dat_out1 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2),
outcome_change = c(0,0,0,1,1,1))
# Output 2: same as Output 1, but able to ignore NA values
dat_out2 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,NA,3,2,NA),
outcome_change = c(0,0,0,1,1,1))
Here is an aproach:
library(tidyverse)
dat %>%
group_by(group) %>%
mutate(outcome_change = ifelse(length(unique(outcome[!is.na(outcome)])) > 1, 1, 0))
#output
# A tibble: 6 x 3
# Groups: group [2]
group outcome outcome_change
<fctr> <dbl> <dbl>
1 a 1 0
2 a 1 0
3 a 1 0
4 b 3 1
5 b 2 1
6 b 2 1
with dat2
# A tibble: 6 x 3
# Groups: group [2]
group outcome outcome_change
<fctr> <dbl> <dbl>
1 a 1 0
2 a 1 0
3 a NA 0
4 b 3 1
5 b 2 1
6 b NA 1
library(dplyr)
dat <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2))
dat2 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,NA,3,2,NA))
dat_out1 <- dat %>% group_by(group) %>%
mutate(outcome_change = ifelse(min(outcome) == max(outcome), 0, 1))
dat_out2 <- dat2 %>% group_by(group) %>%
mutate(outcome_change = ifelse(min(outcome, na.rm = TRUE) == max(outcome, na.rm = TRUE), 0, 1))
Here is an option using data.table
library(data.table)
setDT(dat1)[, outcome_change := as.integer(uniqueN(outcome[!is.na(outcome)])>1), group]
dat1
# group outcome outcome_change
#1: a 1 0
#2: a 1 0
#3: a 1 0
#4: b 3 1
#5: b 2 1
#6: b 2 1
If we apply the same with 'dat2'
dat2
# group outcome outcome_change2
#1: a 1 0
#2: a 1 0
#3: a NA 0
#4: b 3 1
#5: b 2 1
#6: b NA 1

Create NAs for first two rows using group_by

I am trying to replace my observations with NAs. I would like to replace NAs only for the first two observations with respect to each group represented by a given ID.
So from:
id b
1 1 0.1125294
2 1 -0.6871102
3 1 0.1721639
4 2 0.2714921
5 2 0.1012665
6 2 -0.3538989
Get:
id b
1 1 NA
2 1 NA
3 1 0.1721639
4 2 NA
5 2 NA
6 2 -0.3538989
Tried this, but it does not work...
data<- data %>% group_by(id) %>% mutate(data$b[1:2] = NA)
Thanks for any help!
library(dplyr)
df <- data.frame(id = rep(1:2, each = 3), value = rnorm(6))
df %>% group_by(id) %>% mutate(value=replace(value, 1:2, NA))

R, dplyr: cumulative version of n_distinct

I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))

Resources