R lag across arbitrary number of missing values [duplicate] - r

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
library(tidyverse)
testdata <- tibble(ID=c(1,NA,NA,2,NA,3),
Observation = LETTERS[1:6])
testdata1 <- testdata %>%
mutate(
ID1 = case_when(
is.na(ID) ~ lag(ID, 1),
TRUE ~ ID
)
)
testdata1
I have a dataset like testdata, with a valid ID only when ID changes. There can be an arbitrary number of records in a set, but the above case_when and lag() structure does not fill in ID for all records, just for record 2 in each group. Is there a way to get the 3rd (or deeper) IDs filled with the appropriate value?

We can use fill from the tidyr package. Since you are using tidyverse, tidyr is already inlcuded.
testdata1 <- testdata %>%
fill(ID)
testdata1
# # A tibble: 6 x 2
# ID Observation
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 2 D
# 5 2 E
# 6 3 F
Or we can use na.locf from the zoo package.
library(zoo)
testdata1 <- testdata %>%
mutate(ID = na.locf(ID))
testdata1
# # A tibble: 6 x 2
# ID Observation
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 2 D
# 5 2 E
# 6 3 F

Related

How to convert a column to a different type using NSE?

I'm writing a function that takes a data frame and a column names as arguments, and returns the data frame with the column indicated being transformed to character type. However, I'm stuck at the non-standard evaluation part of dplyr.
My current code:
df <- tibble(id = 1:5, value = 6:10)
col <- "id"
mutate(df, "{col}" := as.character({{ col }}))
# # A tibble: 5 x 2
# id value
# <chr> <int>
# 1 id 6
# 2 id 7
# 3 id 8
# 4 id 9
# 5 id 10
As you can see, instead of transforming the contents of the column to character type, the column values are replaced by the column names. {{ col }} isn't evaluated like I expected it to be. What I want is a dynamic equivalent of this:
mutate(df, id = as.character(id))
# # A tibble: 5 x 2
# id value
# <chr> <int>
# 1 1 6
# 2 2 7
# 3 3 8
# 4 4 9
# 5 5 10
I've tried to follow the instructions provided in dplyr's programming vignette, but I'm not finding a solution that works. What am I doing wrong?
Use the .data pronoun -
library(dplyr)
df <- tibble(id = 1:5, value = 6:10)
col <- "id"
mutate(df, "{col}" := as.character(.data[[col]]))
# id value
# <chr> <int>
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Some other alternatives -
mutate(df, "{col}" := as.character(get(col)))
mutate(df, "{col}" := as.character(!!sym(col)))
We may use across which can also do this on multiple columns
library(dplyr)
df %>%
mutate(across(all_of(col), as.character))
# A tibble: 5 x 2
id value
<chr> <int>
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
data
df <- tibble(id = 1:5, value = 6:10)
col <- "id"

R reset counter based on two columns [duplicate]

This question already has an answer here:
R code to assign a sequence based off of multiple variables [duplicate]
(1 answer)
Closed 3 years ago.
I have following kind of data and i need output as the second data frame...
a <- c(1,1,1,1,2,2,2,2,2,2,2)
b <- c(1,1,1,2,3,3,3,3,4,5,6)
d <- c(1,2,3,4,1,2,3,4,5,6,7)
df <- as.data.frame(cbind(a,b,d))
output <- c(1,1,1,2,1,1,1,1,2,3,4)
df_output <- as.data.frame(cbind(df,output))
I have tried cumsum and I am not able to get the desired results. Please guide. Regards, Enthu.
based on column a value cahnges and if b is to be reset starting from one.
the condition is if b has same value it should start with 1.
Like in the 5th record, col b has value as 3. It should reset to 1 and if all the values if col b is same ( as the case from ro 6,6,7,8 is same , then it should be 1 and any change should increment by 1).
We can do a group by column 'a' and then create the new column with either match the unique values in 'b'
library(dplyr)
df2 <- df %>%
group_by(a) %>%
mutate(out = match(b, unique(b)))
df2
# A tibble: 11 x 4
# Groups: a [2]
# a b d out
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1
# 2 1 1 2 1
# 3 1 1 3 1
# 4 1 2 4 2
# 5 2 3 1 1
# 6 2 3 2 1
# 7 2 3 3 1
# 8 2 3 4 1
# 9 2 4 5 2
#10 2 5 6 3
#11 2 6 7 4
Or another option is to coerce a factor variable to integer
df %>%
group_by(a) %>%
mutate(out = as.integer(factor(b)))
data
df <- data.frame(a, b, d)

Is there an R function to count values separated by semi-colon in a column and across rows

I am wondering if there is a smart way to count the occurrence of values separated by semi-colon in a column and across the entire rows.
Sample data and output expected
Here's a tidyverse approach:
library(tidyverse)
# example data
df1 = data.frame(var1 = c(2,4,3,5),
var2 = c("3;5;2;0;1","2;3;8;5","9;6;2","8;5;4;7;0;1"),
stringsAsFactors = F)
df1 %>%
separate_rows(var2) %>% # split values to different rows
filter(var2 %in% df1$var1) %>% # keep values that match var1
count(var2) # count each value
# # A tibble: 4 x 2
# var2 n
# <chr> <int>
# 1 2 3
# 2 3 2
# 3 4 1
# 4 5 3
And a base R approach:
v = unlist(strsplit(df1$var2, ";"))
data.frame(table(v[v %in% df1$var1]))
# Var1 Freq
# 1 2 3
# 2 3 2
# 3 4 1
# 4 5 3

R Show duplicates in dataframe

I am trying to "highlight" duplicates in my dataframe. I found various tutorials on dropping duplicates or creating a new dataset containing only duplicates. But since I expect something went wrong in earlier stages of my datawork, I would (for now) just like to see which observations appear to be duplicates in order to understand what went wrong. I would like R to create column c
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
c <- c(2,1,2,1,2,2,1)
df <-data.frame(a,b,c)
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
df <-data.frame(a,b)
library(dplyr)
df %>%
group_by(a,b) %>% # for each combination of a and b
mutate(c = n()) %>% # count times they appear
ungroup()
# # A tibble: 7 x 3
# a b c
# <fct> <dbl> <int>
# 1 C 1 2
# 2 A 1 1
# 3 A 2 2
# 4 B 1 1
# 5 A 2 2
# 6 C 1 2
# 7 C 2 1

Subsetting a dataframe to exclude values in the same row as NA

I have a dataframe called "data". One of the columns is called "reward" and another is called "X.targetResp". I want to create a new dataframe, called "reward", that consists of all values from the column "reward" in "data." HOWEVER, I want to exclude values of the "reward" column that are in the same row as an NA value in the "X.targetResp" column of "data".
I've tried the following:
reward <- data$reward %in% filter(!is.na(data$X.targetResp))
reward <- subset(data, reward, !(X.targetResp=="NA"))
reward <- subset(data, reward, !is.na(X.targetResp))
...but I get errors for each of them.
Thanks for your input!
In dplyr, you can use filter and !is.na() to filter out the ones with NA in X.targetResp, and then use the select function to select the reward column.
library(dplyr)
# Create example data frame
dat <- data_frame(reward = 1:5,
X.targetResp = c(2, 4, NA, NA, 10))
# Print the data frame
dat
# # A tibble: 5 x 2
# reward X.targetResp
# <int> <dbl>
# 1 1 2
# 2 2 4
# 3 3 NA
# 4 4 NA
# 5 5 10
# Use the filter function
reward <- dat %>%
filter(!is.na(X.targetResp)) %>%
select(reward)
reward
# # A tibble: 3 x 1
# reward
# <int>
# 1 1
# 2 2
# 3 5
And here is a base R solution with the similar logic.
subset(dat, !is.na(X.targetResp), "reward")
# A tibble: 3 x 1
reward
# <int>
# 1 1
# 2 2
# 3 5
You can also consider use drop_na on X.targetResp from the tidyr.
library(dplyr)
library(tidyr)
reward <- dat %>%
drop_na(X.targetResp) %>%
select(reward)
reward
# # A tibble: 3 x 1
# reward
# <int>
# 1 1
# 2 2
# 3 5
Here is an example of the data.table package.
library(data.table)
setDT(dat)
reward <- dat[!is.na(X.targetResp), .(reward)]
reward
# reward
# 1: 1
# 2: 2
# 3: 5
You can simply use na.omit, which is designed to address this problem:
# replicating the same example data frame given by #www
data <- data.frame(
reward = 1:5,
X.targetResp = c(2, 4, NA, NA, 10)
)
# omitting the rows containing NAs
reward <- na.omit(data)
# resulting data frame with both columns
reward
# reward X.targetResp
# 1 1 2
# 2 2 4
# 5 5 10
# you can easily extract the first column if necessary
reward[1]
# reward
# 1 1
# 2 2
# 5 5
Following up #www's comment:
In case there are other columns you want to dodge:
# omitting the rows where only X.targetResp is NA
reward <- data[complete.cases(data["X.targetResp"]), ]
# resulting data frame with both columns
reward
# reward X.targetResp
# 1 1 2
# 2 2 4
# 5 5 10
# you can easily extract the first column if necessary
reward[1]
# reward
# 1 1
# 2 2
# 5 5

Resources