How to select rows based on the combination of columns - r

I have the following data frame:
structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1,
1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni",
"Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")
and I would like to get a vector containing all the species where Ni = 1, Zn = 1 and Cu = NA. So in this example that would be (1,4)
I thought I could have a try with the R script select * from where, but I can't seem to install the package RMySQL on RStudio (R version 2.15.1).

df <- structure(list(Species=1:4,Ni=c(1,NA,1,1),Zn=c(1,1,1,1),Cu=c(NA,NA,1,NA)),
.Names=c("Species","Ni","Zn","Cu"),row.names=c(NA,-4L),class="data.frame")
with(df, Species[Ni %in% 1 & Zn %in% 1 & Cu %in% NA])
[1] 1 4
Rather than using Ni == 1 you should use Ni %in% 1, as the former will return NA elements where Ni is NA. Cu %in% NA produces the same result as is.na(Cu).
with(df, Species[Ni == 1 & Zn %in% 1 & Cu %in% NA])
[1] 1 NA 4
Note though that Ni == 1 used in subset, as in #MadScone's answer, does not suffer from this (which came as a surprise to me).
subset(df, Ni == 1 & Zn == 1 & is.na(Cu), Species)
Species
1 1
4 4

Have a look at subset().
x <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1,
1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni",
"Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")
subset(x, Ni == 1 & Zn == 1 & is.na(Cu), Species)

Edit:
I stand corrected by Backlin...
It is much better to use %in% rather than == x & !is.na() !
MadScone's suggestion of using subset() is better yet, since the result remains a data.frame, even when one select only one column for the output, i.e.
> class(subset(df, Ni == 1 & Zn == 1 & is.na(Cu), Species))
[1] "data.frame"
#whereby we get a vector when only one column is selected...
> class(df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(df$Cu), 1])
[1] "integer"
# but we get data.frame when using multiple columns...
> class(df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(df$Cu), 1:2])
[1] "data.frame"
I'm just leaving my sub-par answer to mention this alternative idiom, as one one should avoid!
Setup:
> df <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1,
1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni",
"Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")
> df
Species Ni Zn Cu
1 1 1 1 NA
2 2 NA 1 NA
3 3 1 1 1
4 4 1 1 NA
Query:
> df[df$Ni == 1 & !is.na(df$Ni)
& df$Zn == 1 & !is.na(df$Zn)
& is.na(!df$Cu), ]
Species Ni Zn Cu
1 1 1 1 NA
4 4 1 1 NA
The trick with NA values is to explicitly exclude them i.e. for example with Ni, request value 1 and !is.na() etc. Failure to do so results in finding records where, say, Ni is NA
As mentioned above the df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(!df$Cu), ] idiom is much preferable and using subset() is typically better yet.
> df[df$Ni == 1 & df$Zn == 1 & is.na(!df$Cu), ]
Species Ni Zn Cu
1 1 1 1 NA
NA NA NA NA NA # OOPS...
4 4 1 1 NA

df <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1, 1, 1, 1),
Cu = c(NA, NA, 1, NA)),
.Names = c("Species", "Ni", "Zn", "Cu"), row.names = c(NA, -4L),
class = "data.frame")
If the columns Ni, Zn, and Cu contain 1 and NA only, you could simply use:
subset(df, Ni & Zn & is.na(Cu), Species)

Related

Exchange the values between two columns based on a condition using R

I have got the following df. I want if the value in the dm column is less than 20000, then that value should go to the nd column. Similarly, if the value in the nd column is greater then 20000 then that value should go to the dm column
structure(list(id = c(1, 2, 3), nd = c(NA, 20076, NA), dm = c(10113,
NA, 10188)), class = "data.frame", row.names = c(NA, -3L))
I want my final df to look like this
structure(list(id = c(1, 2, 3), nd = c(10113, NA, 10188), dm = c(NA,
20076, NA)), class = "data.frame", row.names = c(NA, -3L))
Thank you
ifelse is your friend for this.
base R
transform(df,
nd = ifelse(dm < 20000, dm, nd),
dm = ifelse(nd > 20000, nd, dm)
)
# id nd dm
# 1 1 10113 NA
# 2 2 NA 20076
# 3 3 10188 NA
Note that this works in base R because unlike dplyr::mutate, the calculation for the dm= (second) expression (and beyond) does not see the change from the previous expressions, so the nd that it sees is the original, unchanged nd.
We can also use the temporary-variable trick illustrated in the dplyr example below:
df |>
transform(
nd2 = ifelse(dm < 20000, dm, nd),
dm2 = ifelse(nd > 20000, nd, dm)
) |>
subset(select = -c(nd, dm))
and then rename nd2 to nd (etc).
dplyr
Because mutate "sees" the changes immediately, we need to store into other variables and then reassign.
library(dplyr)
df %>%
mutate(
nd2 = ifelse(dm < 20000, dm, nd),
dm2 = ifelse(nd > 20000, nd, dm)
) %>%
select(-nd, -dm) %>%
rename(nd=nd2, dm=dm2)
# id nd dm
# 1 1 10113 NA
# 2 2 NA 20076
# 3 3 10188 NA
Another base R option using apply:
as.data.frame(t(apply(df, 1, function(x) {
if(x[2] > 20000 | x[3] < 20000) x[c(1, 3, 2)] else x})))
#> id dm nd
#> 1 1 10113 NA
#> 2 2 NA 20076
#> 3 3 10188 NA
Created on 2023-02-18 with reprex v2.0.2

How do I change NA in df$x when df$y == 1

I have a simple data frame read from a csv:
df <- read.csv("test.csv", na.strings = c("", NA))
that gives this (I am using R):
x y
1 a 1
2 <NA> 1
3 <NA> 2
4 a 1
5 b 2
6 <NA> 3
I would like to change the NA in df$x to another value based on the value in df$y. Such that, if df$x = NA and df$y = 1, df$x = "p1", if df$x = NA and df$y = 2, df$x = "p2" and so on to look like this:
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
df <- df %>% replace_na(list(x = "p1"))
Lets me change the whole of df$x, but I am unable to find anything that puts a stringent condition on it. It would seem I need to use an ifelse statement but I cannot seem to get the syntax correct.
mutate(df, x = ifelse(is.na(x) && y == 1, x == 'p1', x == 'p2'))
Thanks for any help.
You can use the following solution. It should be noted that those <NA> values your are trying to change or stored as character strings so they are not considered NA values (you can verify that with is.na(df1$x).
library(dplyr)
df1 %>%
mutate(x = ifelse(x == "<NA>" & y == 1, "p1", ifelse(x == "<NA>" & y == 2,
"p2", ifelse(x == "<NA>" & y == 3,
"p3", x))))
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
Here is another very concise way for as many replacements as possible:
library(dplyr)
library(stringr)
library(glue)
df1 %>%
mutate(x = str_replace(x, "<NA>", glue::glue("p{y}")))
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
You can try the code below using replace + is.na
transform(
df,
x = replace(x, is.na(x), paste0("p", y[is.na(x)]))
)
which gives
x y
1 a 1
2 p1 1
3 p2 2
4 a 1
5 b 2
6 p3 3
Data
> dput(df)
structure(list(x = c("a", NA, NA, "a", "b", NA), y = c(1L, 1L,
2L, 1L, 2L, 3L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
In base R you can use a simple one-liner
df$x <- ifelse(is.na(df$x), paste0('p', df$y), df$x)
# x y
# 1 1 1
# 2 p1 1
# 3 p2 2
# 4 1 1
# 5 2 2
# 6 p3 3

How can I replace zeros with half the minimum value within a column?

I am tying to replace 0's in my dataframe of thousands of rows and columns with half the minimum value greater than zero from that column. I would also not want to include the first four columns as they are indexes.
So if I start with something like this:
index <- c("100p", "200p", 300p" 400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb"
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
as.data.frame(q) <- cbind(index, ratio, gene, species, a1, b1, c1)
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 0 0 1
200p 4 NA NA 3 0 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I would hope to gain a result such as this:
index ratio gene species a1 b1 c1
100p 5 gapdh mouse 1 2 1
200p 4 NA NA 3 2 2
300p 3 NA NA 5 4 3
400p 2 actb rat 2 6 4
I have tried the following code:
apply(q[-4], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))
but I keep getting the error:Error in min(x[x > 0])/2 : non-numeric argument to binary operator
Any help on this? Thank you very much!
We can use lapply and replace the 0 values with minimum value in column by 2.
cols<- 5:7
q[cols] <- lapply(q[cols], function(x) replace(x, x == 0, min(x[x>0], na.rm = TRUE)/2))
q
# index ratio gene species a1 b1 c1
#1 100p 5 gapdh mouse 1 2 1
#2 200p 4 <NA> <NA> 3 2 2
#3 300p 3 <NA> <NA> 5 4 3
#4 400p 2 actb rat 2 6 4
In dplyr, we can use mutate_at
library(dplyr)
q %>% mutate_at(cols,~replace(., . == 0, min(.[.>0], na.rm = TRUE)/2))
data
q <- structure(list(index = structure(1:4, .Label = c("100p", "200p",
"300p", "400p"), class = "factor"), ratio = c(5, 4, 3, 2), gene = structure(c(2L,
NA, NA, 1L), .Label = c("actb", "gapdh"), class = "factor"),
species = structure(c(1L, NA, NA, 2L), .Label = c("mouse",
"rat"), class = "factor"), a1 = c(0, 3, 5, 2), b1 = c(0,
0, 4, 6), c1 = c(1, 2, 3, 4)), class = "data.frame", row.names = c(NA, -4L))
A slightly different (and potentially faster for large datasets) dplyr option with a bit of maths could be:
q %>%
mutate_at(vars(5:length(.)), ~ (. == 0) * min(.[. != 0])/2 + .)
index ratio gene species a1 b1 c1
1 100p 5 gapdh mouse 1 2 1
2 200p 4 <NA> <NA> 3 2 2
3 300p 3 <NA> <NA> 5 4 3
4 400p 2 actb rat 2 6 4
And the same with base R:
q[, 5:length(q)] <- lapply(q[, 5:length(q)], function(x) (x == 0) * min(x[x != 0])/2 + x)
For reference, considering your original code, I believe your function was not the issue. Instead, the error comes from applying the function to non-numeric data.
# original data
index <- c("100p", "200p", "300p" , "400p")
ratio <- c(5, 4, 3, 2)
gene <- c("gapdh", NA, NA,"actb")
species <- c("mouse", NA, NA, "rat")
a1 <- c(0,3,5,2)
b1 <- c(0, 0, 4, 6)
c1 <- c(1, 2, 3, 4)
# data frame
q <- as.data.frame(cbind(index, ratio, gene, species, a1, b1, c1))
# examine structure (all cols are factors)
str(q)
# convert factors to numeric
fac_to_num <- function(x){
x <- as.numeric(as.character(x))
x
}
# apply to cols 5 thru 7 only
q[, 5:7] <- apply(q[, 5:7],2,fac_to_num)
# examine structure
str(q)
# use original function only on numeric data
apply(q[, 5:7], 2, function(x) "[<-"(x, x==0, min(x[x > 0]) / 2))

how I can make a vector by using 2 columns in data set

I have these two columns in my data.frame :
df1 <- structure(list(Mode = c("car", "walk", "passenger", "car", "bus"
), Licence = c(1, 1, 0, 1, 1)), row.names = c(NA, -5L), class = "data.frame")
df1
# Mode Licence
# 1 car 1
# 2 walk 1
# 3 passenger 0
# 4 car 1
# 5 bus 1
I want to make an indicator vector b, that is 1 if the mode of that person is not car an have a driver licence and 0 otherwise. in the above example I need d to be:
df2 <- structure(list(Mode = c("car", "walk", "passenger", "car", "bus"
), Licence = c(1, 1, 0, 1, 1), b = c(0, 1, 0, 0, 1)), row.names = c(NA,
-5L), class = "data.frame")
df2
# Mode Licence b
# 1 car 1 0
# 2 walk 1 1
# 3 passenger 0 0
# 4 car 1 0
# 5 bus 1 1
Here you go. You could use "ifelse" statements for this as its easier to understand.
data = data.frame(mode = c("car", "walk", "passanger", "car", "bus"), License = c(1,1,0,1,1))
data$b = ifelse(data$mode !="car" & data$License == 1, 1,0)
Another solution using logical operations and implicit conversion between numeric and logical:
df1$b <- with(df1, Mode!="car" & Licence)*1
Note: 0 is equivalent to FALSE and everything else is equivalent to TRUE, so if the possible values are just 0 and 1, we can shorten Licence == 1 to just Licence. The *1 part at he end converts truth values to 0's and 1's again.
Another solution with dplyr:
library(dplyr)
df1 %>% mutate(b = if_else(Mode %in% c('walk', 'bus')&Licence == 1, # condition
true = 1,
false = 0))

R: Efficient extraction of events (continuous increase in a variable)

The task is to efficiently extract events from this data:
data <- structure(
list(i = c(1, 1, 1, 2, 2, 2), t = c(1, 2, 3, 1, 3, 4), x = c(1, 1, 2, 1, 2, 3)),
.Names = c("i", "t", "x"), row.names = c(NA, -6L), class = "data.frame"
)
> data
i t x
1 1 1 1
2 1 2 1
3 1 3 2
4 2 1 1
5 2 3 2
6 2 4 3
Let's call i facts, t is time, and x is the number of selections of i at t.
An event is an uninterrupted sequence of selections of one fact. Fact 1 is selected all throughout t=1 to t=3 with a sum of 4 selections. But fact 2 is split into two events, the first from t=1 to t=1 (sum=1) and the second from t=3 to t=4 (sum=5). Therefore, the event data frame is supposed to look like this:
> event
i from to sum
1 1 1 3 4
2 2 1 1 1
3 2 3 4 5
This code does what is needed:
event <- structure(
list(i = logical(0), from = logical(0), to = logical(0), sum = logical(0)),
.Names = c("i", "from", "to", "sum"), row.names = integer(0),
class = "data.frame"
)
l <- nrow(data) # get rows of data frame
c <- 1 # set counter
d <- 1 # set initial row of data to start with
e <- 1 # set initial row of event to fill
repeat{
event[e,1] <- data[d,1] # store "i" in event data frame
event[e,2] <- data[d,2] # store "from" in event data frame
while((data[d+1,1] == data[d,1]) & (data[d+1,2] == data[d,2]+1)){
c <- c+1
d <- d+1
if(d >= l) break
}
event[e,3] <- data[d,2] # store "to" in event data frame
event[e,4] <- sum(data[(d-c+1):d,3]) # store "sum" in event data frame
c <- 1
d <- d+1
e <- e+1
}
The problem is that this code takes 3 days to extract the events from a data frame with 1 million rows and my data frame has 5 million rows.
How can I make this more efficient?
P.S.: There's also a minor bug in my code related to termination.
P.P.S.: The data is sorted first by i, then by t.
can you try if this dplyr implementation is faster?
library(dplyr)
data <- structure(
list(fact = c(1, 1, 1, 2, 2, 2), timing = c(1, 2, 3, 1, 3, 4), x = c(1, 1, 2, 1, 2, 3)),
.Names = c("fact", "timing", "x"), row.names = c(NA, -6L), class = "data.frame"
)
group_by(data, fact) %>%
mutate(fromto=cumsum(c(0, diff(timing) > 1))) %>%
group_by(fact, fromto) %>%
summarize(from=min(timing), to=max(timing), sumx=sum(x)) %>%
select(-fromto) %>%
ungroup()
how about this data.table implementation?
library(data.table)
data <- structure(
list(fact = c(1, 1, 1, 2, 2, 2), timing = c(1, 2, 3, 1, 3, 4), x = c(1, 1, 2, 1, 2, 3)),
.Names = c("fact", "timing", "x"), row.names = c(NA, -6L), class = "data.frame"
)
setDT(data)[, fromto:=cumsum(c(0, diff(timing) > 1)), by=fact]
event <- data[, .(from=min(timing), to=max(timing), sumx=sum(x)), by=c("fact", "fromto")][,fromto:=NULL]
##results when i enter event in the R console and my data.table package version is data.table_1.9.6
> event
fact from to sumx
1: 1 1 3 4
2: 2 1 1 1
3: 2 3 4 5
> str(event)
Classes ‘data.table’ and 'data.frame': 3 obs. of 4 variables:
$ fact: num 1 2 2
$ from: num 1 1 3
$ to : num 3 1 4
$ sumx: num 4 1 5
- attr(*, ".internal.selfref")=<externalptr>
> dput(event)
structure(list(fact = c(1, 2, 2), from = c(1, 1, 3), to = c(3,
1, 4), sumx = c(4, 1, 5)), row.names = c(NA, -3L), class = c("data.table",
"data.frame"), .Names = c("fact", "from", "to", "sumx"), .internal.selfref = <pointer: 0x0000000000120788>)
Reference
detect intervals of the consequent integer sequences
Assuming the data frame is sorted according to data$t, you can try something like this
event <- NULL
for (i in unique(data$i)) {
x <- data[data$i == i, ]
ev <- cumsum(c(1, diff(x$t)) > 1)
smry <- lapply(split(x, ev), function(z) c(i, range(z$t), sum(z$x)))
event <- c(event, smry)
}
event <- do.call(rbind, event)
rownames(event) <- NULL
colnames(event) <- c('i', 'from', 'to', 'sum')
The result is a matrix, not a data frame.

Resources