I want to compare 2 column dataframe with dates and include one column to indicate whether dates "A" are <= dates "B" or >
df <- data.frame( list (A=c("15-10-2000", "15-10-2000", "15-10-2000","20-10-2000"),
B=c("15-10-2000", "16-10-2000", "14-10-2000","19-10-2000")))
What I would like to include is new column C = ( 1 , 1, 0, 0).
I have tried:
df$C = ifelse (df$A <= df$B, 1, 0)
It works except for the "equal" comparation.
I get: C = ( 0 , 1, 0, 0)
sorry but before doing the comparation I changed the format to Date and still does not works
df$A= as.Date(df$A, format = "%d-%m-%Y")
df$B = as.Date(df$B, format = "%d-%m-%Y")
The date columns are factors. You need to first convert them to Date class and then compare
library(dplyr)
df %>%
mutate_at(vars(A:B), as.Date, format = "%d-%m-%Y") %>%
mutate(C = as.integer(A <= B))
# A B C
#1 2000-10-15 2000-10-15 1
#2 2000-10-15 2000-10-16 1
#3 2000-10-15 2000-10-14 0
#4 2000-10-20 2000-10-19 0
Or in base R that would be
df[1:2] <- lapply(df[1:2], as.Date, format = "%d-%m-%Y")
df$C <- as.integer(df$A <= df$B)
You should convert the factors to dates (As Jon Spring pointed out). Then it should work
library(dplyr)
df %>%
mutate_all(lubridate::dmy) %>%
mutate(C = ifelse(A<=B,1,0))
A B C
1 2000-10-15 2000-10-15 1
2 2000-10-15 2000-10-16 1
3 2000-10-15 2000-10-14 0
4 2000-10-20 2000-10-19 0
Related
I am looking to use a conditional statement to access date rows which are before 0021-01-11 and have NA value in a specific column (People_vaccinated for example). For those rows I wanted to impute with zero.
I want to use an IF statement with (condition1 AND condition 2).
Condition1 can be df$People_vaccinated == NA and condition2 can be df$date < 'given date'
Maybe this will help -
df <- data.frame(Date = c('0021-01-07', '0021-01-08','0021-01-11', '0021-01-12'),
a = c(2, NA, 3, NA),
b = c(1, NA, 2, 3))
ind <- match('0021-01-11', df$Date)
df$a[1:ind][is.na(df$a[1:ind])] <- 0
df
# Date a b
#1 0021-01-07 2 1
#2 0021-01-08 0 NA
#3 0021-01-11 3 2
#4 0021-01-12 NA 3
Or using dplyr -
library(dplyr)
df <- df %>%
mutate(a = replace(a,
row_number() <= match('0021-01-11', Date) & is.na(a), 0))
df
I would like to identify binary columns in a data.frame. And make a new df on based that condition.
For example, this table
my.table <-read.table(text="a,b,c
0,2,0
0.25,1,1
1,0,0", header=TRUE, as.is=TRUE,sep = ",")
Maybe you can keep columns that have only 0 and 1 value.
Filter(function(x) all(x %in% c(0, 1)), my.table)
# c
#1 0
#2 1
#3 0
Few other variations to do the same thing :
library(dplyr)
library(purrr)
#2
my.table[colSums(my.table == 0 | my.table == 1) == nrow(my.table)]
#3
my.table %>% select(where(~all(. %in% c(0, 1))))
#4
keep(my.table, ~all(. %in% c(0, 1)))
We can use base R
my.table[colSums(sapply(my.table, `%in%`, c(0, 1))) == nrow(my.table)]
# c
#1 0
#2 1
#3 0
Suppose I have:
A <- c(1,0,0,0)
B <- c(0,1,0,0)
C <- c(0,0,1,0)
D <- c(0,0,0,1)
data <- xts(cbind(A,B,C,D),order.by = as.Date(1:4))
Then I get...
A B C D
1970-01-02 1 0 0 0
1970-01-03 0 1 0 0
1970-01-04 0 0 1 0
1970-01-05 0 0 0 1
I would like to extract the dates for each column where the value is 1.
So I want to see something like this...
A "1970-01-02"
B "1970-01-03"
C "1970-01-04"
D "1970-01-05"
Here's the manual way of getting the answer. So I basically want to run a loop that can do this...
index(data$A[data$A==1])
index(data$B[data$B==1])
index(data$C[data$C==1])
index(data$D[data$D==1])
If for a particular row there are multiple 1's and you want to return the index only once for that row, we can use rowSums and subset the index
zoo::index(data)[rowSums(data == 1) > 0]
#[1] "1970-01-02" "1970-01-03" "1970-01-04" "1970-01-05"
If we want index value for each 1, we can use which with arr.ind = TRUE
zoo::index(data)[which(data == 1, arr.ind = TRUE)[, 1]]
To get both column name as well as index, we can reuse the matrix from which
mat <- which(data == 1, arr.ind = TRUE)
data.frame(index = zoo::index(data)[mat[, 1]], column = colnames(data)[mat[,2]])
# index column
#1 1970-01-02 A
#2 1970-01-03 B
#3 1970-01-04 C
#4 1970-01-05 D
Starting from your original data object, you can do create a tibble first and then melt it to get your desired format:
library(tidyverse)
as_tibble(data) %>%
mutate(time = time(data)) %>%
gather("group", "value", -time) %>%
filter(value == 1) %>%
select(group, time)
Using sapply, I am returning the row names for which there is 1 in the row. This should work if there are multiples 1's in a row.
one_days <- as.Date(unlist(
sapply(1:ncol(data),
function(x) time(data)[which(data[, x] == 1)])))
# "1970-01-02" "1970-01-03" "1970-01-04" "1970-01-05"
If you want row names as well.
rown <- unlist(sapply(1 : ncol(data), function(x) rep(colnames(data)[x], sum(data[, x]))))
names(one_days) <- rown
# A B C D
# "1970-01-02" "1970-01-03" "1970-01-04" "1970-01-05"
Testing for multiple 1's
A <- c(1,1,0,0)
one_days <- as.Date(unlist(
sapply(1:ncol(data),
function(x) time(data)[which(data[, x] == 1)])))
rown <- unlist(sapply(1 : ncol(data), function(x) rep(colnames(data)[x], sum(data[, x]))))
names(one_days) <- rown
one_days
# A A B C D
#"1970-01-02" "1970-01-03" "1970-01-03" "1970-01-04" "1970-01-05"
I have a dataframe with several numeric variables along with factors. I wish to run over the numeric variables and replace the negative values to missing. I couldn't do that.
My alternative idea was to write a function that gets a dataframe and a variable, and does it. It didn't work either.
My code is:
NegativeToMissing = function(df,var)
{
df$var[df$var < 0] = NA
}
Error in $<-.data.frame(`*tmp*`, "var", value = logical(0)) : replacement has 0 rows, data has 40
what am I doing wrong ?
Thank you.
Here is an example with some dummy data.
df1 <- data.frame(col1 = c(-1, 1, 2, 0, -3),
col2 = 1:5,
col3 = LETTERS[1:5])
df1
# col1 col2 col3
#1 -1 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 -3 5 E
Now find columns that are numeric
numeric_cols <- sapply(df1, is.numeric)
And replace negative values
df1[numeric_cols] <- lapply(df1[numeric_cols], function(x) replace(x, x < 0 , NA))
df1
# col1 col2 col3
#1 NA 1 A
#2 1 2 B
#3 2 3 C
#4 0 4 D
#5 NA 5 E
You could also do
df1[df1 < 0] <- NA
With tidyverse, we can make use of mutate_if
library(tidyverse)
df1 %>%
mutate_if(is.numeric, funs(replace(., . < 0, NA)))
If you still want to change only one selected variable a solution withdplyr would be to use non-standard evaluation:
library(dplyr)
NegativeToMissing <- function(df, var) {
quo_var = quo_name(var)
df %>%
mutate(!!quo_var := ifelse(!!var < 0, NA, !!var))
}
NegativeToMissing(data, var=quo(val1)) # use quo() function without ""
# val1 val2
# 1 1 1
# 2 NA 2
# 3 2 3
Data used:
data <- data.frame(val1 = c(1, -1, 2),
val2 = 1:3)
data
# val1 val2
# 1 1 1
# 2 -1 2
# 3 2 3
I have a data frame where column "A" has 6 distinct values. Column "B" has float values. By using dplyr, I can group by column "A" and find mean of column "B" of each group as follows:
mydf %>% group_by(A) %>% summarize(Mean = mean(B, na.rm=TRUE))
My utter aim is to find rows in each group whose "B" values are higher than the group average. How can I achieve this (using base R or dplyr)?
A simple alternative with base R ave would be
df[df$b > ave(df$b, df$a) , ]
# a b
#4 1 4
#5 1 5
#9 2 9
#10 2 10
The default argument for ave is mean so no need to mention it explicitly, if there are NA values present in b modify it to
df[df$b > ave(df$b, df$a, FUN = function(x) mean(x,na.rm = TRUE)) , ]
Another solution with subset and ave as suggested by #Onyambu
subset(df,b>ave(b,a))
# a b
#4 1 4
#5 1 5
#9 2 9
#10 2 10
data
df <- data.frame(a = rep(c(1, 2), each = 5), b = 1:10)
df
# a b
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#5 1 5
#6 2 6
#7 2 7
#8 2 8
#9 2 9
#10 2 10
You can just group and then filter:
mydf %>%
group_by(A) %>%
filter(B > mean(B, na.rm = TRUE)) %>%
ungroup()
Using Base R, I would go for this. It is not as elegant as dplyr.
mean.df <- aggregate(mydf$b, by =list(a = mydf$a), FUN = mean)
names(mean.df)[2] <- "mean"
mydf <- merge(mydf, mean.df, by = "a")
# Rows whose values are higher than mean
new.df <- subset(mydf, b > mean, select = -mean)
I like working with Data tables. So a data.table solution would be,
mydt <- data.table(mydf)
mydt[, mean := mean(b), by = a]
new.dt <- mydt[b > mean, -c("mean"), with = TRUE]
Another way to do it using base R and tapply:
mydf = cbind.data.frame(A=sample(6,20,rep=T),B=runif(20))
mydf.ave = tapply(mydf$B,mydf$A,mean)
newdf = mydf[mydf$B > mydf.ave[as.character(mydf$A)],]
(thus the one liner would be:mydf[mydf$B > tapply(mydf$B,mydf$A,mean)[as.character(mydf$A)],])