Exchange the values between two columns based on a condition using R - r

I have got the following df. I want if the value in the dm column is less than 20000, then that value should go to the nd column. Similarly, if the value in the nd column is greater then 20000 then that value should go to the dm column
structure(list(id = c(1, 2, 3), nd = c(NA, 20076, NA), dm = c(10113,
NA, 10188)), class = "data.frame", row.names = c(NA, -3L))
I want my final df to look like this
structure(list(id = c(1, 2, 3), nd = c(10113, NA, 10188), dm = c(NA,
20076, NA)), class = "data.frame", row.names = c(NA, -3L))
Thank you

ifelse is your friend for this.
base R
transform(df,
nd = ifelse(dm < 20000, dm, nd),
dm = ifelse(nd > 20000, nd, dm)
)
# id nd dm
# 1 1 10113 NA
# 2 2 NA 20076
# 3 3 10188 NA
Note that this works in base R because unlike dplyr::mutate, the calculation for the dm= (second) expression (and beyond) does not see the change from the previous expressions, so the nd that it sees is the original, unchanged nd.
We can also use the temporary-variable trick illustrated in the dplyr example below:
df |>
transform(
nd2 = ifelse(dm < 20000, dm, nd),
dm2 = ifelse(nd > 20000, nd, dm)
) |>
subset(select = -c(nd, dm))
and then rename nd2 to nd (etc).
dplyr
Because mutate "sees" the changes immediately, we need to store into other variables and then reassign.
library(dplyr)
df %>%
mutate(
nd2 = ifelse(dm < 20000, dm, nd),
dm2 = ifelse(nd > 20000, nd, dm)
) %>%
select(-nd, -dm) %>%
rename(nd=nd2, dm=dm2)
# id nd dm
# 1 1 10113 NA
# 2 2 NA 20076
# 3 3 10188 NA

Another base R option using apply:
as.data.frame(t(apply(df, 1, function(x) {
if(x[2] > 20000 | x[3] < 20000) x[c(1, 3, 2)] else x})))
#> id dm nd
#> 1 1 10113 NA
#> 2 2 NA 20076
#> 3 3 10188 NA
Created on 2023-02-18 with reprex v2.0.2

Related

remove rows containing NA based on condition

df <- data.frame(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30))
From df I want to remove rows containing NA in y based on the condition that the x value in that row is smaller than the x value in the row with the minimum y value to obtain this data frame.
data.frame(x = 3:7, y = c(5, 10, NA, 20, 30))
dlypr() solutions preferable!
We could use which.min to get the index of minimum 'y' value, subset the 'x' create the comparison with the 'x' values along with the expression for NA elements in 'y' and negate (!)
subset(df, !(x< x[which.min(y)] & is.na(y)))
-output
x y
3 3 5
4 4 10
5 5 NA
6 6 20
7 7 30
Or the same logic can be applied with dplyr::filter
library(dplyr)
df %>%
filter(!(x< x[which.min(y)] & is.na(y)))
-ouptut
x y
1 3 5
2 4 10
3 5 NA
4 6 20
5 7 30
data
df <- structure(list(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30)),
class = "data.frame", row.names = c(NA,
-7L))
Use logical indices for each of the conditions and combine them with logical AND, &:
df <- data.frame(x = 1:7, y = c(NA, NA, 5, 10, NA, 20, 30))
i <- is.na(df$y)
j <- df$x < df$y
df[!i & j, ]
# x y
#3 3 5
#4 4 10
#6 6 20
#7 7 30

adding multiple columns include na in dataframe in r

I have dataframe like this:
I want to create a new column which is the sum of other columns by ignoring NA if there is any numeric value in a row. But if all value (like the second row) in a row are na, the sum column gets NA.
As this is your first activity here on SO you should have a look to this which describes how a minimal and reproducible examples is made. This is certainly needed in the future, if you have more questions. An image is mostly not accepted as a starting point.
Fortunately your table was a small one. I turned it into a tribble and then used rowSums to calculate the numbers you seem to want.
df <- tibble::tribble(
~x, ~y, ~z,
6000, NA, NA,
NA, NA, NA,
100, 7000, 1000,
0, 0, NA
)
df$sum <- rowSums(df, na.rm = T)
df
#> # A tibble: 4 x 4
#> x y z sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6000 NA NA 6000
#> 2 NA NA NA 0
#> 3 100 7000 1000 8100
#> 4 0 0 NA 0
Created on 2020-06-15 by the reprex package (v0.3.0)
Let's say that your data frame is called df
cbind(df, apply(df, 1, function(x){if (all(is.na(x))) {NA} else {sum(x, na.rm = T)}))
Note that if your data frame has other columns, you will need to restrict the df call within apply to only be the columns you're after.
You can count the NA values in df. If in a row there is no non-NA value you can assign output as NA or calculate row-wise sum otherwise using rowSums.
ifelse(rowSums(!is.na(df)) == 0, NA, rowSums(df, na.rm = TRUE))
#[1] 6000 NA 10000 8100 0
data
df <- structure(list(x = c(6000, NA, 10000, 100, 0), y = c(NA, NA,
NA, 7000, 0), z = c(NA, NA, NA, 1000, NA)), class = "data.frame",
row.names = c(NA, -5L))

set a threshold for complete cases to remove NA from multiple columns in R

There might be an easy answer to this, but I am not able to make it work. I have a data table that looks like this:
df <- data.table(t = c(1, 2, 3), a = c(NA, NA, 4), b = c(NA, 4, NA), c = c(NA, 4, NA))
How can I remove only the rows where all columns but "t" have NA's. It should be fast because of my big data files, so I would like to do it especially with complete.cases. I couldn't find a solution to this problem yet.
The result should look like this
dfRes <- data.table(t = c(2, 3), a = c(NA, 4), b = c(4, NA), c = c(4, NA))
We can use complete.cases with Reduce
library(data.table)
df[df[, Reduce(`|`, lapply(.SD, complete.cases)), .SDcols = a:c]]
# t a b c
#1: 2 NA 4 4
#2: 3 4 NA NA
We can use rowSums on columns other than "t".
library(data.table)
cols <- which(names(df) != 't')
df[rowSums(!is.na(df[, ..cols])) > 0, ]
# t a b c
#1: 2 NA 4 4
#2: 3 4 NA NA

How can I replace zeros with half the minimum value within a column?

I am tying to replace 0's in my dataframe of thousands of rows and columns with half the minimum value greater than zero from that column. I would also not want to include the first four columns as they are indexes.
So if I start with something like this:
index <- c("100p", "200p", 300p" 400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb"
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
as.data.frame(q) <- cbind(index, ratio, gene, species, a1, b1, c1)
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 0 0 1
200p 4 NA NA 3 0 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I would hope to gain a result such as this:
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 1 2 1
200p 4 NA NA 3 2 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I have tried the following code:
apply(q[-4], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))
but I keep getting the error:Error in min(x[x > 0])/2 : non-numeric argument to binary operator
Any help on this? Thank you very much!
We can use lapply and replace the 0 values with minimum value in column by 2.
cols<- 5:7
q[cols] <- lapply(q[cols], function(x) replace(x, x == 0, min(x[x>0], na.rm = TRUE)/2))
q
# index ratio gene species a1 b1 c1
#1 100p 5 gapdh mouse 1 2 1
#2 200p 4 <NA> <NA> 3 2 2
#3 300p 3 <NA> <NA> 5 4 3
#4 400p 2 actb rat 2 6 4
In dplyr, we can use mutate_at
library(dplyr)
q %>% mutate_at(cols,~replace(., . == 0, min(.[.>0], na.rm = TRUE)/2))
data
q <- structure(list(index = structure(1:4, .Label = c("100p", "200p",
"300p", "400p"), class = "factor"), ratio = c(5, 4, 3, 2), gene = structure(c(2L,
NA, NA, 1L), .Label = c("actb", "gapdh"), class = "factor"),
species = structure(c(1L, NA, NA, 2L), .Label = c("mouse",
"rat"), class = "factor"), a1 = c(0, 3, 5, 2), b1 = c(0,
0, 4, 6), c1 = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))
A slightly different (and potentially faster for large datasets) dplyr option with a bit of maths could be:
q %>%
mutate_at(vars(5:length(.)), ~ (. == 0) * min(.[. != 0])/2 + .)
index ratio gene species a1 b1 c1
1 100p 5 gapdh mouse 1 2 1
2 200p 4 <NA> <NA> 3 2 2
3 300p 3 <NA> <NA> 5 4 3
4 400p 2 actb rat 2 6 4
And the same with base R:
q[, 5:length(q)] <- lapply(q[, 5:length(q)], function(x) (x == 0) * min(x[x != 0])/2 + x)
For reference, considering your original code, I believe your function was not the issue. Instead, the error comes from applying the function to non-numeric data.
# original data
index <- c("100p", "200p", "300p" , "400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb")
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
# data frame
q <- as.data.frame(cbind(index, ratio, gene, species, a1, b1, c1))
# examine structure (all cols are factors)
str(q)
# convert factors to numeric
fac_to_num <- function(x){
x <- as.numeric(as.character(x))
x
}
# apply to cols 5 thru 7 only
q[, 5:7] <- apply(q[, 5:7],2,fac_to_num)
# examine structure
str(q)
# use original function only on numeric data
apply(q[, 5:7], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))

How to select rows based on the combination of columns

I have the following data frame:
structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1,
1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni",
"Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")
and I would like to get a vector containing all the species where Ni = 1, Zn = 1 and Cu = NA. So in this example that would be (1,4)
I thought I could have a try with the R script select * from where, but I can't seem to install the package RMySQL on RStudio (R version 2.15.1).
df <- structure(list(Species=1:4,Ni=c(1,NA,1,1),Zn=c(1,1,1,1),Cu=c(NA,NA,1,NA)),
.Names=c("Species","Ni","Zn","Cu"),row.names=c(NA,-4L),class="data.frame")
with(df, Species[Ni %in% 1 & Zn %in% 1 & Cu %in% NA])
[1] 1 4
Rather than using Ni == 1 you should use Ni %in% 1, as the former will return NA elements where Ni is NA. Cu %in% NA produces the same result as is.na(Cu).
with(df, Species[Ni == 1 & Zn %in% 1 & Cu %in% NA])
[1] 1 NA 4
Note though that Ni == 1 used in subset, as in #MadScone's answer, does not suffer from this (which came as a surprise to me).
subset(df, Ni == 1 & Zn == 1 & is.na(Cu), Species)
Species
1 1
4 4
Have a look at subset().
x <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1,
1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni",
"Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")
subset(x, Ni == 1 & Zn == 1 & is.na(Cu), Species)
Edit:
I stand corrected by Backlin...
It is much better to use %in% rather than == x & !is.na() !
MadScone's suggestion of using subset() is better yet, since the result remains a data.frame, even when one select only one column for the output, i.e.
> class(subset(df, Ni == 1 & Zn == 1 & is.na(Cu), Species))
[1] "data.frame"
#whereby we get a vector when only one column is selected...
> class(df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(df$Cu), 1])
[1] "integer"
# but we get data.frame when using multiple columns...
> class(df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(df$Cu), 1:2])
[1] "data.frame"
I'm just leaving my sub-par answer to mention this alternative idiom, as one one should avoid!
Setup:
> df <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1,
1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni",
"Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")
> df
Species Ni Zn Cu
1 1 1 1 NA
2 2 NA 1 NA
3 3 1 1 1
4 4 1 1 NA
Query:
> df[df$Ni == 1 & !is.na(df$Ni)
& df$Zn == 1 & !is.na(df$Zn)
& is.na(!df$Cu), ]
Species Ni Zn Cu
1 1 1 1 NA
4 4 1 1 NA
The trick with NA values is to explicitly exclude them i.e. for example with Ni, request value 1 and !is.na() etc. Failure to do so results in finding records where, say, Ni is NA
As mentioned above the df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(!df$Cu), ] idiom is much preferable and using subset() is typically better yet.
> df[df$Ni == 1 & df$Zn == 1 & is.na(!df$Cu), ]
Species Ni Zn Cu
1 1 1 1 NA
NA NA NA NA NA # OOPS...
4 4 1 1 NA
df <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1, 1, 1, 1),
Cu = c(NA, NA, 1, NA)),
.Names = c("Species", "Ni", "Zn", "Cu"), row.names = c(NA, -4L),
class = "data.frame")
If the columns Ni, Zn, and Cu contain 1 and NA only, you could simply use:
subset(df, Ni & Zn & is.na(Cu), Species)

Resources