replace cell values greater than 0 with column name - r

I have a dataframe with the following structure:
Df = data.frame(
Col1 = c(1,0,0),
Col2 = c(0,2,1),
Col3 = c(0,0,0)
)
What I'm trying to get is a dataframe where those cells with a value greater than 0 get replaced with the column name and those lower than 1 get replaced by NA. The resulting dataframe would be something like this:
Df = data.frame(
Col1 = c("Col1",NA,NA),
Col2 = c(NA,"Col2","Col2"),
Col3 = c(NA,NA,NA)
)
So far I tried with this solution and with functions like apply(), mutate_if(), and across() but I can't get what I'm after.

You could do:
Df %>%
mutate(across(everything(), ~ if_else(. > 0, cur_column(), NA_character_)))
Col1 Col2 Col3
1 Col1 <NA> <NA>
2 <NA> Col2 <NA>
3 <NA> Col2 <NA>

Related

dplyr ifelse mutate reference to variable outside the data frame

I have a simple problem but i haven't figured out the solution yet. I don't know how to reference to a variable outside the data frame when I'm using dplyr. Here is a small chunk of code:
library(dplyr)
var <- 1
df <- data.frame(col1 = c("a", "b", "c"), col2 = c(1, 2, 3))
df %>% mutate(col2 = ifelse(var == 1, col2 + var, col2))
Result:
col1 col2
1 a 2
2 b 2
3 c 2
Desired output:
col1 col2
1 a 2
2 b 3
3 c 4
This is not a dplyr specific issue but when you have a condition to check of length 1 use if and else instead of vectorized ifelse.
library(dplyr)
df %>% mutate(col2 = if(var == 1) col2 + var else col2)
# col1 col2
#1 a 2
#2 b 3
#3 c 4
We could use rowwise and sum
df %>%
rowwise() %>%
mutate(col2 = ifelse(var == 1, sum(col2,var), col2))
col1 col2
<chr> <dbl>
1 a 2
2 b 3
3 c 4
We could use base R for this
i1 <- df$col2 == var
df$col2[i1] <- df$col2[i1] + var
-output
> df
col1 col2
1 a 2
2 b 2
3 c 3
Or use data.table
library(data.table)
setDT(df)[col2 == var, col2 := col2 + var]

How to delete duplicate rows (the shorter ones) based on certain columns?

Suppose I have the following df
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
> df
col1 col2 col3
1 1 2 <NA>
2 3 4 <NA>
3 1 2 c
My goal is to delete all duplicate rows based on col1 and col2 such that the longer row "survives". In this case, the first row should be deleted. I tried
df[duplicated(df[, 1:2]), ]
but this gives me only the third row (and not the third and the second one). How to do it properly?
EDIT: The real df has 15 columns, of which the first 13 are used for identifying duplicates. In the last two columns roughly 2/3 of the rows are filled with NAs (the first 13 columns do not contain any NAs). Thus, my example df was misleading in the sense that there are two columns to be excluded for identifying the duplicates. I am sorry for that.
You can try this:
library(dplyr)
df %>% group_by(col1,col2) %>%
slice(which.min(is.na(col3)))
or this :
df %>%
group_by(col1,col2) %>%
arrange(col3) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: col1, col2 [2]
# col1 col2 col3
# <dbl> <dbl> <fctr>
# 1 1 2 c
# 2 3 4 NA
A GENERAL SOLUTION
with the most general solution there can be only one row per value of col1, see comment below to add col2 to the grouping variables. It assumes all NAs are on the right.
df %>% mutate(nna = df %>% is.na %>% rowSums) %>%
group_by(col1) %>% # or group_by(col1,col2)
slice(which.min(nna)) %>%
select(-nna)
df <- data.frame(col1 = c(1, 3, 1), col2 = c(2, 4, 2), col3 = c(NA, NA, "c"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
3 1 2 c
2 3 4 <NA>
EDIT: Keep all non-NA rows
df <- data.frame(col1 = c(1, 3, 1,3, 1), col2 = c(2, 4, 2,4, 2), col3 = c("a", NA, "c",NA, "b"))
df <- df[order(df$col3),]
duplicates <- duplicated(df[,1:2]) & is.na(df[,3])
duplicates_sub <- subset(df , duplicates == FALSE)
> duplicates_sub
col1 col2 col3
1 1 2 a
5 1 2 b
3 1 2 c
2 3 4 <NA>
You can sort NAs to the top or bottom before dropping dupes:
# in base, which puts NAs last
odf = df[do.call(order, df), ]
odf[!duplicated(odf[, c("col1", "col2")]), ]
# col1 col2 col3
# 3 1 2 c
# 2 3 4 <NA>
# or with data.table, which puts NAs first
library(data.table)
DF = setorder(data.table(df))
unique(DF, by=c("col1", "col2"), fromLast=TRUE)
# col1 col2 col3
# 1: 1 2 c
# 2: 3 4 NA
This approach cannot be taken with dplyr, which doesn't offer "sort by all columns" in arrange, nor fromLast in distinct.

How do you clear column elements from an R data frame based off of other another columns elements in the same data frame?

I have the following data frame
>data.frame
col1 col2
A
x B
C
D
y E
I need a new data frame that looks like:
>new.data.frame
col1 col2
A
x
C
D
y
I just need a method for reading from col1 and if there is ANY characters in Col1 then clear corresponding row value of col2. I was thinking about using an if statement and data.table for this but am unsure of how to relay the information for deleting col2's values based on ANY characters being present in col1.
Something like this works:
# Create data frame
dat <- data.frame(col1=c(NA,"x", NA, NA, "y"), col2=c("A", "B", "C", "D", "E"))
# Create new data frame
dat_new <- dat
dat_new$col2[!is.na(dat_new$col1)] <- NA
# Check that it worked
dat
dat_new
This depends on what you mean by 'remove'. Here I'm assuming a blank string "". However, the same principle will apply for NAs
## create data frame
df <- data.frame(col1 = c("", "x", "","", "y"),
col2 = LETTERS[1:5],
stringsAsFactors = FALSE)
df
# col1 col2
# 1 A
# 2 x B
# 3 C
# 4 D
# 5 y E
## subset by blank values in col1, and replace the values in col2
df[df$col1 != "",]$col2 <- ""
## or df$col2[df$col1 != ""] <- ""
df
# col1 col2
# 1 A
# 2 x
# 3 C
# 4 D
# 5 y
And as you mentioned data.table, the code for this would be
library(data.table)
setDT(df)
## filter by blank entries in col1, and update col2 by-reference (:=)
df[col1 != "", col2 := ""]
df
Using dplyr
library(dplyr)
df %>%
mutate(col2 = replace(col2, col1!="", ""))
# col1 col2
#1 A
#2 x
#3 C
#4 D
#5 y

Only Keep Certain Combinations of Predictors in a Dataframe

Imagine that I have a data frame like this:
> col1 <- rep(1:3,10)
> col2 <- rep(c("a","b"),15)
> col3 <- rnorm(30,10,2)
> sample_df <- data.frame(col1 = col1, col2 = col2, col3 = col3)
> head(sample_df)
col1 col2 col3
1 1 a 13.460322
2 2 b 3.404398
3 3 a 8.952066
4 1 b 11.148271
5 2 a 9.808366
6 3 b 9.832299
I only want to keep combinations of predictors which, together, have a col3 standard deviation below 2. I can find the combinations using ddply, but I don't know how to backtrack to the original DF and select the correct levels.
> sample_df_summ <- ddply(sample_df, .(col1, col2), summarize, sd = sd(col3), count = length(col3))
> head(sample_df_summ)
col1 col2 sd count
1 1 a 2.702328 5
2 1 b 1.032371 5
3 2 a 2.134151 5
4 2 b 3.348726 5
5 3 a 2.444884 5
6 3 b 1.409477 5
For clarity, in this example, I'd like the DF with col1 = 3, col2 = b and col1 = 1 and col 2 = b. How would I do this?
You can add a "keep" column that is TRUE only if the standard deviation is below 2. Then, you can use a left join (merge) to add the "keep" column to the initial dataframe. In the end, you just select with keep equal to TRUE.
# add the keep column
sample_df_summ$keep <- sample_df_summ$sd < 2
sample_df_summ$sd <- NULL
sample_df_summ$count <- NULL
# join and select the rows
sample_df_keep <- merge(sample_df, sample_df_summ, by = c("col1", "col2"), all.x = TRUE, all.y = FALSE)
sample_df_keep <- sample_df_keep[sample_df_keep$keep, ]
sample_df_keep$keep <- NULL
Using dplyr:
library(dplyr)
sample_df %>% group_by(col1, col2) %>% mutate(sd = sd(col3)) %>% filter(sd < 2)
You get:
#Source: local data frame [6 x 4]
#Groups: col1, col2
#
# col1 col2 col3 sd
#1 1 a 10.516437 1.4984853
#2 1 b 11.124843 0.8652206
#3 2 a 7.585740 1.8781241
#4 3 b 9.806124 1.6644076
#5 1 a 7.381209 1.4984853
#6 1 b 9.033093 0.8652206

R parallel execution

I have a dataframe containing 5 columns
COL1 | COL2 | COL 3 | COL 4 | COL 5
I need to aggregate on COL1 and apply 4 different function on COL2 to COL5 columns
a1<-aggregate( COL2 ~ COL1, data = dataframe, sum)
a2<-aggregate( COL3 ~ COL1, data = dataframe, length)
a3<-aggregate( COL4 ~ COL1, data = dataframe, max)
a4<-aggregate( COL5 ~ COL1, data = dataframe, min)
finalDF<- Reduce(function(x, y) merge(x, y, all=TRUE), list(a1,a2,a3,a4))
1)I have 24 cores on the machine.
How can I execute above 4 lines of code (a1,a2,a3,a4) in parallel?
I want to use 4 cores simultaneously and then use Reduce to compute finalDF
2) Can I use different function on different column in one aggregate
(I can use one fun on multiple column and I can also use multiple function on one column in aggregate but I was unable to apply multiple functions on different columns
[COL2-sum,COL3-length,COL4-max,COL5-min])
This is an example of how you might do it with dplyr as suggested by #Roland
set.seed(2)
df <- data.frame(COL1 = sample(LETTERS, 1e6, replace=T),
COL2 = rnorm(1e6),
COL3 = runif(1e6, 100, 1000),
COL4 = rnorm(1e6, 25, 100),
COL5 = runif(1e6, -100, 10))
#> head(df)
# COL1 COL2 COL3 COL4 COL5
#1 E 1.0579823 586.2360 -3.157057 -14.462318
#2 S 0.1238110 872.3868 129.579090 9.525772
#3 O 0.4902512 498.0537 93.063487 1.910506
#4 E 1.7215843 200.7077 126.716256 -5.865204
#5 Y 0.6515853 275.3369 12.554218 -26.301225
#6 Y 0.7959678 134.4977 54.789415 -33.145334
require(dplyr)
df <- df %.%
group_by(COL1) %.%
summarize(a1 = sum(COL2),
a2 = length(COL3),
a3 = max(COL4),
a4 = min(COL5)) #add as many calculations as you like
On my machine this took 0.064 seconds.
#> head(df)
#Source: local data frame [6 x 5]
#
# COL1 a1 a2 a3 a4
#1 A -0.9068368 38378 403.4208 -99.99943
#2 B 6.0557452 38551 419.0970 -99.99449
#3 C 108.5680251 38673 491.8061 -99.99382
#4 D -34.1217133 38469 481.0626 -99.99697
#5 E -68.2998926 38168 452.8280 -99.99602
#6 F -185.9059338 38159 417.2271 -99.99995

Resources