I have data that looks like this
ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1
I want to replace all values with 'NA' if the ID occurs more than once in the dataframe. The final product should look like this
ID v1 v2
1 1 0
2 0 1
3 NA NA
3 NA NA
4 0 1
I could do this by hand, but I want R to detect all the duplicate cases (in this case two times ID '3') and replace the values with 'NA'.
Thanks for your help!
You could use duplicated() from either end, and then replace.
idx <- duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE)
df[idx, -1] <- NA
which gives
ID v1 v2
1 1 1 0
2 2 0 1
3 3 NA NA
4 3 NA NA
5 4 0 1
This will also work if the duplicated IDs are not next to each other.
Data:
df <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
One more option:
df1[df1$ID %in% df1$ID[duplicated(df1$ID)], -1] <- NA
#> df1
# ID v1 v2
#1 1 1 0
#2 2 0 1
#3 3 NA NA
#4 3 NA NA
#5 4 0 1
data
df1 <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
Here is a base R method
# get list of repeated IDs
repeats <- rle(df$ID)$values[rle(df$ID)$lengths > 1]
# set the corresponding variables to NA
df[, -1] <- sapply(df[, -1], function(i) {i[df$ID %in% repeats] <- NA; i})
In the first line, we use rle to extract repeated IDs. In the second, we use sapply to loop through non-ID variables and replace IDs that repeat with NA for each variable.
Note that this assumes that the data set is sorted by ID. This may be accomplished with the order function. (df <- df[order(df$ID),]).
If the dataset is very large, you might break up the first function into two steps to avoid computing the rle twice:
dfRle <- rle(df$ID)
repeats <- dfRle$values[dfRle$lengths > 1]
data
df <- read.table(header=T, text="ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1")
Related
I have a table similar to this minimal example without the difference column:
trigger
values
difference
0
3
0
NA
0
NA
1
5
2
0
4
0
NA
1
10
6
I want to subtract the above number (and leave out the NAs) from the number at each trigger point (trigger = 1)
Is there a way to do this in R?
Edit:
I have now the situation where the triggers lie close together like in this example:
trigger
values
difference
0
3
0
NA
0
NA
1
5
2
0
4
1
5
1
0
10
How can I tackle this problem?
Create a grouping column with cumsum on the 'trigger' and taking the lag, then do the difference between the first and last element and replace it as the last value per group
library(dplyr)
df1 %>%
group_by(grp = lag(cumsum(trigger), default = 0)) %>%
mutate(difference = replace(rep(NA, n()), n(),
values[n()] - values[1])) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 7 × 3
trigger values difference
<int> <int> <int>
1 0 3 NA
2 0 NA NA
3 0 NA NA
4 1 5 2
5 0 4 NA
6 0 NA NA
7 1 10 6
For the second case, we may need a condition with if/else that checks the number of rows i.e. if the number of rows is greater than 1 only need the computation to replace
df2 %>%
group_by(grp = lag(cumsum(trigger), default = 0)) %>%
mutate(difference = if(n() > 1) replace(rep(NA, n()), n(),
values[n()] - values[1]) else NA) %>%
ungroup
-output
# A tibble: 7 × 4
trigger values grp difference
<int> <int> <dbl> <int>
1 0 3 0 NA
2 0 NA 0 NA
3 0 NA 0 NA
4 1 5 0 2
5 0 4 1 NA
6 1 5 1 1
7 0 10 2 NA
data
df1 <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 0L, 1L), values = c(3L,
NA, NA, 5L, 4L, NA, 10L)), class = "data.frame", row.names = c(NA,
-7L))
df2 <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 1L, 0L), values = c(3L,
NA, NA, 5L, 4L, 5L, 10L)), class = "data.frame", row.names = c(NA,
-7L))
# Import data: df => data.frame
df <- structure(list(trigger = c(0L, 0L, 0L, 1L, 0L, 0L, 1L), values = c(3L,
NA, NA, 5L, 4L, NA, 10L), diff_col = c(NA, NA, NA, 2L, -1L, NA,
6L)), row.names = c(NA, -7L), class = "data.frame")
# Create an empty vector: diff_col => integer vector
df$diff_col <- NA_integer_
# Difference the X.values vector, ignoring NAs:
# diff_col => integer vector
df[which(!(is.na(df$values)))[-1], "diff_col"] <- diff(
na.omit(
df$values
)
)
# Nullify the value if the trigger is 0:
# diff_col => integer vector
df$diff_col <- with(
df,
ifelse(
trigger == 0,
NA_integer_,
diff_col
)
)
I am attempting to have R read across columns by row and evaluate whether values from two adjacent cells are equal. If the values are equal, I want R to count this occurence in a new variable. Here is example data (df):
Var1
Var2
Var3
2
3
3
3
3
3
1
2
3
3
2
1
...and I want to get here:
Var1
Var2
Var3
NewVar
2
3
3
1
3
3
3
2
1
2
3
0
3
2
1
0
One example set of code I have tried out is the following:
df$NewVar <- 0
for (i in 1:2){
if (df[i]==df[i+1]){
df$NewVar <- df$NewVar + 1
}
else{
df$NewVar <- df$NewVar
}
}
This particular set of code just returns 0s in the NewVar variable.
Any sort of help would be much appreciated!
Here's a vectorized solution using rowSums :
df$NewVar <- rowSums(df[-1] == df[-ncol(df)])
df
# Var1 Var2 Var3 NewVar
#1 2 3 3 1
#2 3 3 3 2
#3 1 2 3 0
#4 3 2 1 0
data
df <- structure(list(Var1 = c(2L, 3L, 1L, 3L), Var2 = c(3L, 3L, 2L,
2L), Var3 = c(3L, 3L, 3L, 1L)), class = "data.frame", row.names = c(NA,-4L))
We can use Reduce
df$NewVar <- Reduce(`+`, Map(`==`, df[-1], df[-ncol(df)]))
data
df <- structure(list(Var1 = c(2L, 3L, 1L, 3L), Var2 = c(3L, 3L, 2L,
2L), Var3 = c(3L, 3L, 3L, 1L)), class = "data.frame", row.names = c(NA,-4L))
I would like to count how many rows in each column are >0 and how many of those rows (that are >0) start with "mt-".
The result should also be in a data frame.
Here is an example.
df1
mt-abc 1 0 2
mt-dca 1 1 2
cla 0 2 0
dla 0 3 0
result
above0 2 3 2
mt 2 1 2
In base R you can do :
mat <- df[-1] > 0
rbind(above0 = colSums(mat),
mt = colSums(startsWith(df$V1, 'mt') & mat))
# V2 V3 V4
#above0 2 3 2
#mt 2 1 2
Actual data has numbers in the column and names in rownames for which we can do :
mat <- df > 0
rbind(above0 = colSums(mat),
mt = colSums(startsWith(rownames(df), 'mt') & mat))
data
df <- structure(list(V1 = c("mt-abc", "mt-dca", "cla", "dla"), V2 = c(1L,
1L, 0L, 0L), V3 = 0:3, V4 = c(2L, 2L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -4L))
I don't think this is the most elegant approach in the tidyverse, but just out of curiosity:
library(tidyverse)
my_df <- data.frame(
stringsAsFactors = FALSE,
var = c("mt-abc", "mt-dca", "cla", "dla"),
x = c(1L, 1L, 0L, 0L),
y = c(0L, 1L, 2L, 3L),
z = c(2L, 2L, 0L, 0L)
)
df_1 <- my_df %>%
summarize(across(.cols=x:z, .fn=~sum(.x > 0))) %>%
mutate(var="above0")
df_2 <- my_df %>%
filter(str_detect(var, "^mt")) %>%
summarise(across(.cols=x:z, .fn=~sum(.x > 0))) %>%
mutate(var="mt")
bind_rows(df_1, df_2)
#> x y z var
#> 1 2 3 2 above0
#> 2 2 1 2 mt
Created on 2020-12-04 by the reprex package (v0.3.0)
I would like to know how to increasingly count the number of times that a column in my data.frame satisfies a condition. Let's consider a data.frame such as:
x hour count
1 0 NA
2 1 NA
3 2 NA
4 3 NA
5 0 NA
6 1 NA
...
I would like to have this output:
x hour count
1 0 1
2 1 NA
3 2 NA
4 3 NA
5 0 2
6 1 NA
...
With the count column increasing by 1 everytime the condition hour==0 is met.
Is there a smart and efficient way to perform this? Thanks
You can use seq_along on the rows where hour == 0.
i <- x$hour == 0
x$count[i] <- seq_along(i)
x
# x hour count
#1 1 0 1
#2 2 1 NA
#3 3 2 NA
#4 4 3 NA
#5 5 0 2
#6 6 1 NA
Data:
x <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L), count = c(NA,
NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
You can use cumsum to count incremental number of 0 occurrences and replace counts where hour values is not 0 to NA.
library(dplyr)
df %>%
mutate(count = cumsum(hour == 0),
count = replace(count, hour != 0 , NA))
# x hour count
#1 1 0 1
#2 2 1 NA
#3 3 2 NA
#4 4 3 NA
#5 5 0 2
#6 6 1 NA
data
df <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L)),
class = "data.frame", row.names = c(NA, -6L))
Using data.table
library(data.table)
setDT(df)[hour == 0, count := seq_len(.N)]
df
# x hour count
#1: 1 0 1
#2: 2 1 NA
#3: 3 2 NA
#4: 4 3 NA
#5: 5 0 2
#6: 6 1 NA
data
df <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L)),
class = "data.frame", row.names = c(NA, -6L))
I would like to create variable "Time" which basically indicates the number of times variable ID showed up within each day minus 1. In other words, the count is lagged by 1 and the first time ID showed up in a day should be left blank. Second time the same ID shows up on a given day should be 1.
Basically, I want to create the "Time" variable in the example below.
ID Day Time Value
1 1 0
1 1 1 0
1 1 2 0
1 2 0
1 2 1 0
1 2 2 0
1 2 3 1
2 1 0
2 1 1 0
2 1 2 0
Below is the code I am working on. Have not been successful with it.
data$time<-data.frame(data$ID,count=ave(data$ID==data$ID, data$Day, FUN=cumsum))
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'Day', we get the lag of sequence of rows (shift(seq_len(.N))) and assign (:=) it as "Time" column.
library(data.table)
setDT(df1)[, Time := shift(seq_len(.N)), .(ID, Day)]
df1
# ID Day Value Time
# 1: 1 1 0 NA
# 2: 1 1 0 1
# 3: 1 1 0 2
# 4: 1 2 0 NA
# 5: 1 2 0 1
# 6: 1 2 0 2
# 7: 1 2 1 3
# 8: 2 1 0 NA
# 9: 2 1 0 1
#10: 2 1 0 2
Or with base R
with(df1, ave(Day, Day, ID, FUN= function(x)
ifelse(seq_along(x)!=1, seq_along(x)-1, NA)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
Or without the ifelse
with(df1, ave(Day, Day, ID, FUN= function(x)
NA^(seq_along(x)==1)*(seq_along(x)-1)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Day = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), Value = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("ID", "Day",
"Value"), row.names = c(NA, -10L), class = "data.frame")