I have a df looks like below.
> a <- data.frame(col1=c(1, 2),col2=c(10,11))
> a
col1 col2
1 1 10
2 2 11
Then I want to fill two extra columns conditionally, col3 and col4.
if col1 == 1, then copy col2 to col3, and fill 0 in col4.
if col1 == 2, then copy col2 to col4, and fill 0 in col3.
Finally I can see a df.
col1 col2 col3 col4
1 1 10 10 0
2 2 11 0 11
Any good packages or basic R function can do this?
A case for dplyr:
library(dplyr)
a <- data.frame(col1=c(1, 2),col2=c(10,11))
a %>%
mutate(col3=case_when(col1==1 ~ col2,
col1==2 ~ 0),
col4=case_when(col1==2 ~ col2,
col1==1 ~ 0))
> col1 col2 col3 col4
>1 1 10 10 0
>2 2 11 0 11
This will fill up col3 and col4 with NA's if col1 is neither 1 or 2. An ifelse statement such as the one from Anatolii is also possible, but in my opinion it should not be that general.
library(dplyr)
a <- data.frame(col1=c(1, 2),col2=c(10,11))
a %>%
mutate(col3=ifelse(col1==1, col2,
ifelse(col1==2, 0, NA)),
col4=ifelse(col1==1, 0,
ifelse(col1==2, col2, NA)))
You can easily do with an ifelse statement:
library(dplyr)
a <- data.frame(col1=c(1, 2),col2=c(10,11))
a %>%
mutate(col3 = ifelse(col1 == 1, col2, 0),
col4 = ifelse(col1 == 2, col2, 0))
> col1 col2 col3 col4
1 1 10 10 0
2 2 11 0 11
I think you might be overcomplicating it by looking for a package or an if-else.... A simple index should do it.
col1 <- c(1,2)
col2 <- c(10,11)
a <- data.frame(col1,col2,col3 = 0, col4 = 0)
a$col3[which(col1 == 1)] <- col2[which(col1 == 1)]
a$col4[which(col1 == 2)] <- col2[which(col1 == 2)]
results in
col1 col2 col3 col4
1 1 10 10 0
2 2 11 0 11
The which(col1 == 1) statement declares that for each row of col1 that equals 1, you are copying that row into col3; the same thing with col4 using which(col1 == 2).
Related
I have a dataframe with the following structure:
Df = data.frame(
Col1 = c(1,0,0),
Col2 = c(0,2,1),
Col3 = c(0,0,0)
)
What I'm trying to get is a dataframe where those cells with a value greater than 0 get replaced with the column name and those lower than 1 get replaced by NA. The resulting dataframe would be something like this:
Df = data.frame(
Col1 = c("Col1",NA,NA),
Col2 = c(NA,"Col2","Col2"),
Col3 = c(NA,NA,NA)
)
So far I tried with this solution and with functions like apply(), mutate_if(), and across() but I can't get what I'm after.
You could do:
Df %>%
mutate(across(everything(), ~ if_else(. > 0, cur_column(), NA_character_)))
Col1 Col2 Col3
1 Col1 <NA> <NA>
2 <NA> Col2 <NA>
3 <NA> Col2 <NA>
I have an example df:
df <- data.frame(
col1 = c(4,5,6,11),
col2 = c('b','b','c', 'b')
)
> df
col1 col2
1 4 b
2 5 b
3 6 c
4 11 b
and I mutate based on these conditions:
df2 <- df %>%
mutate(col3 = case_when(
col2 == 'b' & col1 == 4 ~ 10,
col2 == 'b' & col1 == 5 ~ 15,
col2 == 'b' & col1 == 11 ~ 20,
col2 == 'c' & col1 == 6 ~ 7)
)
> df2
col1 col2 col3
1 4 b 10
2 5 b 15
3 6 c 7
4 11 b 20
You can see that the first 3 conditions are repetitive in that they require col2 == 'b'. Is there some syntax or another package or more efficient way of combining same/similar conditions so that I don't need to repeat col2 == 'b'? Like a one liner that if col2 == 'b' then do these transformations.
You can write nested case_when
df %>% mutate(col3 = case_when(
col2=="b" ~ case_when(
col1 == 4 ~ 10,
col1 == 5 ~ 15,
col1 == 11 ~ 20),
col2=="c" ~ case_when(
col1 == 6 ~ 7)))
Another solution colud be using a auxiliary table, this way however will limit the flexibility of case_when and only works for equality matches but i suspect is a lot faster.
library(dplyr)
df <- data.frame(
col1 = c(4,5,6,11),
col2 = c('b','b','c', 'b')
)
choices <- data.frame(
col2 = c(rep("b",3),"c"),
col1 = c(4,5,11,6),
col3 = c(10,15,20,7)
)
df %>% left_join(choices, by = c("col1"="col1", "col2"="col2"))
#> col1 col2 col3
#> 1 4 b 10
#> 2 5 b 15
#> 3 6 c 7
#> 4 11 b 20
Be sure to capture the default unmatched cases.
I have a simple problem but i haven't figured out the solution yet. I don't know how to reference to a variable outside the data frame when I'm using dplyr. Here is a small chunk of code:
library(dplyr)
var <- 1
df <- data.frame(col1 = c("a", "b", "c"), col2 = c(1, 2, 3))
df %>% mutate(col2 = ifelse(var == 1, col2 + var, col2))
Result:
col1 col2
1 a 2
2 b 2
3 c 2
Desired output:
col1 col2
1 a 2
2 b 3
3 c 4
This is not a dplyr specific issue but when you have a condition to check of length 1 use if and else instead of vectorized ifelse.
library(dplyr)
df %>% mutate(col2 = if(var == 1) col2 + var else col2)
# col1 col2
#1 a 2
#2 b 3
#3 c 4
We could use rowwise and sum
df %>%
rowwise() %>%
mutate(col2 = ifelse(var == 1, sum(col2,var), col2))
col1 col2
<chr> <dbl>
1 a 2
2 b 3
3 c 4
We could use base R for this
i1 <- df$col2 == var
df$col2[i1] <- df$col2[i1] + var
-output
> df
col1 col2
1 a 2
2 b 2
3 c 3
Or use data.table
library(data.table)
setDT(df)[col2 == var, col2 := col2 + var]
Suppose I have the following df
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
col1 col2 col3
1 1 2 <NA>
2 3 4 <NA>
3 1 2 c
My goal is to delete all duplicate rows based on col1 and col2 such that the longer row "survives". In this case, the first row should be deleted. I tried
df[duplicated(df[, 1:2]), ]
but this gives me only the third row (and not the third and the second one). How to do it properly?
EDIT: The real df has 15 columns, of which the first 13 are used for identifying duplicates. In the last two columns roughly 2/3 of the rows are filled with NAs (the first 13 columns do not contain any NAs). Thus, my example df was misleading in the sense that there are two columns to be excluded for identifying the duplicates. I am sorry for that.
You can try this:
library(dplyr)
df %>% group_by(col1,col2) %>%
slice(which.min(is.na(col3)))
or this :
df %>%
group_by(col1,col2) %>%
arrange(col3) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: col1, col2 [2]
# col1 col2 col3
# <dbl> <dbl> <fctr>
# 1 1 2 c
# 2 3 4 NA
A GENERAL SOLUTION
with the most general solution there can be only one row per value of col1, see comment below to add col2 to the grouping variables. It assumes all NAs are on the right.
df %>% mutate(nna = df %>% is.na %>% rowSums) %>%
group_by(col1) %>% # or group_by(col1,col2)
slice(which.min(nna)) %>%
select(-nna)
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
3 1 2 c
2 3 4 <NA>
EDIT: Keep all non-NA rows
df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
1 1 2 a
5 1 2 b
3 1 2 c
2 3 4 <NA>
You can sort NAs to the top or bottom before dropping dupes:
# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]
# col1 col2 col3
# 3 1 2 c
# 2 3 4 <NA>
# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)
# col1 col2 col3
# 1: 1 2 c
# 2: 3 4 NA
This approach cannot be taken with dplyr, which doesn't offer "sort by all columns" in arrange, nor fromLast in distinct.
Imagine that I have a data frame like this:
> col1 <- rep(1:3,10)
> col2 <- rep(c("a","b"),15)
> col3 <- rnorm(30,10,2)
> sample_df <- data.frame(col1 = col1, col2 = col2, col3 = col3)
> head(sample_df)
col1 col2 col3
1 1 a 13.460322
2 2 b 3.404398
3 3 a 8.952066
4 1 b 11.148271
5 2 a 9.808366
6 3 b 9.832299
I only want to keep combinations of predictors which, together, have a col3 standard deviation below 2. I can find the combinations using ddply, but I don't know how to backtrack to the original DF and select the correct levels.
> sample_df_summ <- ddply(sample_df, .(col1, col2), summarize, sd = sd(col3), count = length(col3))
> head(sample_df_summ)
col1 col2 sd count
1 1 a 2.702328 5
2 1 b 1.032371 5
3 2 a 2.134151 5
4 2 b 3.348726 5
5 3 a 2.444884 5
6 3 b 1.409477 5
For clarity, in this example, I'd like the DF with col1 = 3, col2 = b and col1 = 1 and col 2 = b. How would I do this?
You can add a "keep" column that is TRUE only if the standard deviation is below 2. Then, you can use a left join (merge) to add the "keep" column to the initial dataframe. In the end, you just select with keep equal to TRUE.
# add the keep column
sample_df_summ$keep <- sample_df_summ$sd < 2
sample_df_summ$sd <- NULL
sample_df_summ$count <- NULL
# join and select the rows
sample_df_keep <- merge(sample_df, sample_df_summ, by = c("col1", "col2"), all.x = TRUE, all.y = FALSE)
sample_df_keep <- sample_df_keep[sample_df_keep$keep, ]
sample_df_keep$keep <- NULL
Using dplyr:
library(dplyr)
sample_df %>% group_by(col1, col2) %>% mutate(sd = sd(col3)) %>% filter(sd < 2)
You get:
#Source: local data frame [6 x 4]
#Groups: col1, col2
#
# col1 col2 col3 sd
#1 1 a 10.516437 1.4984853
#2 1 b 11.124843 0.8652206
#3 2 a 7.585740 1.8781241
#4 3 b 9.806124 1.6644076
#5 1 a 7.381209 1.4984853
#6 1 b 9.033093 0.8652206