Subsetting a data frame using another data frame - r

I'm having trouble with something that shouldn't be that hard to come around. What I would like to do is subsetting a data.frame by using another data.frame, and more precisely, by using a certain parameter.
Here goes the example:
df1<- t(data.frame(A=c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"), B=c("0.5","3","0","0","5","0","15"), C=c("0","0","3","15","15","0","0"), D=c("0.5","0.5","0.5","0","0","0","0"), E=c("37.5","37.5","0.5","62.5","0.5","0.5","1")))
df2<- data.frame(A=c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"), B=c("vasc", "vasc","vasc","spha", "moss","moss","moss"), C=c("a", "a", "b", "a", "c","d","a"))
Now, let's say that I want in my df1 only the objects A (here they are species) that are "vasc" in df2 in my df1.
For that I've tried a few things such as:
df3 <- subset(df2, B=="vasc")
df4 <- df1[,c(df1, as.vector(df2))]
But in doing so, I have an error of type:
Error in df1[, c(df1, as.vector(df2))] : invalid subscript type 'list'
Therefore, I've tried to unlist my dataframes but nothing seems to work. I've been on this problem for a while now and I did explore the forum to see if anyone had an elegant solution to my problem but it looks like not.
Another way of doing this subsetting was to do the following bit of code, but it didn't work either even though I felt closer to the solution:
try11 <- list(df2, df1)%>% rbindlist(., fill=T) # with df1 not transposed
df11 <- try11[try11=="vasc",]
I hope the code is good enough and my explanation clear enough.
Thank you!

You might try:
library(data.table)
setDT(df1)
setDT(df2)
dtPruned <- df1[A %in% df2[B == "vasc", A]]
Make sure to remove the t() call in your df1 definition for this to work, however. Basically, what it's doing is selecting the A column in df2 where B = "vasc". It then selects the rows from df1 where A is in those A's from df2.

You can do it with dplyr
library(dplyr)
species <- as.character(df2[df2$B == "vasc",1])
df1 %>%
slice(A %in% species)
## A tibble: 3 x 5
# A B C D E
# <fct> <fct> <fct> <fct> <fct>
#1 ABI 0.5 0 0.5 37.5
#2 ABI 0.5 0 0.5 37.5
#3 ABI 0.5 0 0.5 37.5
PS
Your data contains only factor. Maybe you want yo use number as numeric class.

This should do it. First we create a character vector (x) of all A values where B == vasc in df2. Then we select columns from df1 where A == x:
# Create a character vector of all A values when B == vasc
x <- as.character(df2[df2$B == "vasc", 1])
# Select columns where row A == x
df1[, which(df1[1, ] %in% x)]
[,1] [,2] [,3]
A "ABI" "BET" "ALN"
B "0.5" "3" "0"
C "0" "0" "3"
D "0.5" "0.5" "0.5"
E "37.5" "37.5" "0.5"
If we avoid the t call, we can do:
df1[df1$A %in% df2[df2$B == "vasc", 1], ]
A B C D E
1 ABI 0.5 0 0.5 37.5
2 BET 3 0 0.5 37.5
3 ALN 0 3 0.5 0.5
We could transpose the data frame to retain the same format as above:
t(df1[df1$A %in% df2[df2$B == "vasc", 1], ])
1 2 3
A "ABI" "BET" "ALN"
B "0.5" "3" "0"
C "0" "0" "3"
D "0.5" "0.5" "0.5"
E "37.5" "37.5" "0.5"
Data:
df1 <- t(data.frame(
A = c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"),
B = c("0.5","3","0","0","5","0","15"),
C = c("0","0","3","15","15","0","0"),
D = c("0.5","0.5","0.5","0","0","0","0"),
E = c("37.5","37.5","0.5","62.5","0.5","0.5","1")
)
)
df2 <- data.frame(
A = c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"),
B = c("vasc", "vasc","vasc","spha", "moss","moss","moss"),
C = c("a", "a", "b", "a", "c","d","a")
)

Related

Row name of the smallest element

I have the following dataframe:
d <- data.frame(a=c(1,2,3,4), b=c(20,19,18,17))
row.names(d) <- c("A", "B", "C", "D")
I want another data.frame, with the same columns and 2 rows, which contain the row names of the 2 smallest elements in that column.
In the example the expected result would be:
# Expected results
exp <- data.frame(a=c("A", "B"), b=c("C","D"))
We loop over the columns with lapply, order the values, use that index to subset the n corresponding row.names of 'd', and wrap with data.frame
n <- 2
data.frame(lapply(d, function(x) sort(head(row.names(d)[order(x)], n))))
-output
# a b
#1 A C
#2 B D
With R 4.1.0, we can also use the |> operator for chaining the functions (applied in the order for easier understanding) along with \(x) - for lambda function in base R
# // ordered the column values
# // get corresponding row names
lapply(d, \(x) row.names(d)[order(x)] |>
head(n) |> # // get the first n values
sort()) |> # // sort them
data.frame() # // convert the list to data.frame
# a b
#1 A C
#2 B D
Or using dplyr
library(dplyr)
d %>%
summarise(across(everything(),
~ sort(head(row.names(d)[order(.)], n))))
# a b
#1 A C
#2 B D
Using sapply in base R -
rn <- rownames(d)
sapply(d, function(x) rn[order(x) %in% 1:2])
# a b
#[1,] "A" "C"
#[2,] "B" "D"

Using grep in list in order to fill a new df in R

Hello I have a list :
list=c("OK_67J","GGT_je","Ojj_OK_778","JUu3","JJE")
and i would like to transforme it as a df :
COL1 COL2
OK_67J A
GGT_je B
Ojj_OK_778 A
JUu3 B
JJE B
where I add a A if there is the OK_pattern and B if not.
I tried :
COL2<-rep("Virus",length(list))
list[grep("OK_",tips)]<-"A"
df <- data.frame(COL1=list,COL2=COL2)
Use grepl :
ifelse(grepl('OK_', list), "A", "B")
#[1] "A" "B" "A" "B" "B"
You can also do it without ifelse :
c("B", "A")[grepl('OK_', list) + 1]
It is better to not use variable name as list since it's a default function in R.
When you exchange your list[grep("OK_",tips)]<-"A" with COL2[grep("OK_",list)] <- "A" your solution will work.
list <- c("OK_67J", "GGT_je", "Ojj_OK_778", "JUu3", "JJE")
COL2 <- rep("B", length(list))
COL2[grep("OK_", list)] <- "A"
df <- data.frame(COL1 = list, COL2 = COL2)
df
# COL1 COL2
#1 OK_67J A
#2 GGT_je B
#3 Ojj_OK_778 A
#4 JUu3 B
#5 JJE B
First off, list is not a list but a character vector:
list=c("OK_67J","GGT_je","Ojj_OK_778","JUu3","JJE")
class(list)
[1] "character"
To transform it to a dataframe:
df <- data.frame(v1 = list)
To add the new column use grepl:
df$v2 <- ifelse(grepl("OK_", df$v1), "A", "B")
or use str_detect:
library(stringr)
df$v2 <- ifelse(str_detect(df$v1, "OK_"), "A", "B")
Result:
df
v1 v2
1 OK_67J A
2 GGT_je B
3 Ojj_OK_778 A
4 JUu3 B
5 JJE B

Unique Data Frame by Prioritize Value in R

I have the following data frame in R:
A<-c(1,0,0,1,0)
B<-c("A","A","B","B","C")
df<-cbind(A,B)
and i want unique this data frame by prioritize a value in column A.
Prioritize a value of 1 rather than a value of 0.
I tried to write the code as follows:
uniq<-unique(subset(df, df[,1]==1))
and result:
A B
[1,] "1" "A"
[2,] "1" "B"
But i want:
A B
[1,] "1" "A"
[2,] "1" "B"
[3,] "0" "C"
How can I achieve this? Thanks before
First your df is actually a matrix so you could start by df <- data.frame(df, stringsAsFactors = FALSE)
Then sort so that A == 1 comes first and finally weed out duplicates
df <- df[order(df[["A"]], decreasing = TRUE), ]
df[!duplicated(df[["B"]]), ]
A B
1 1 A
4 1 B
5 0 C
You can use aggregate, if you make sure you have a data frame and not a matrix:
df<-data.frame(A,B, stringsAsFactor = FALSE)
aggregate(A ~ B, df, max)
# B A
# 1 A 1
# 2 B 1
# 3 C 0
If you want to prioritize a value and simple sorting isn't good enough (because you want to prioritize a character or factor value, or a numeric value that is not a min/max or you want to leave the order of other values intact), you can use :
df2 <- df[order(df$A!=1),]
df2 <- df2[!duplicated(df2[["B"]]), ]
which is a minor twist on #snoram's answer
First sort the data by the first column (decreasing order), then remove rows with duplicate value for second column.
df <- df[order(df[,1], decreasing = T),]
df[duplicated(df[,2])==F,]
A B
[1,] "1" "A"
[2,] "1" "B"
[3,] "0" "C"
I think with the help of Data table you will be able to do it
A<-c(1,0,0,1,0)
B<-c("A","A","B","B","C")
df<-as.data.frame(as.character(cbind(A,B)))
df1<- dplyr::arrange(df,desc(A),B)
library(data.table)
DT <- data.table(df1)
setkey(DT, B)
d<- DT[J(unique(B)), mult = "last"]
tidyverse solution
library(tidyverse)
df %>% as.data.frame( stringsAsFactors = FALSE ) %>%
arrange( B, desc(A) ) %>%
filter( !duplicated(B) )
# A B
# 1 1 A
# 2 1 B
# 3 0 C

Generate random numbers in an R dataframe which are constant across similar-rows

I have a dataframe containing X rows per 'user', where X is not constant between users. What I would like to do is to be able to generate random numbers to fill a new column, but for each 'user' the random number is the same across all of the rows that correspond to that user. For example, the data might look something like this:
user feature1 feature2
1 "A" "B"
1 "L" "L"
1 "Q" "B"
1 "D" "M"
1 "D" "M"
1 "P" "E"
2 "A" "B"
2 "R" "P"
2 "A" "F"
3 "X" "U"
... ... ...
and I would like to generate a new column that might look something like this:
user feature1 feature2 new_rand
1 "A" "B" 0.183
1 "L" "L" 0.183
1 "Q" "B" 0.183
1 "D" "M" 0.183
1 "D" "M" 0.183
1 "P" "E" 0.183
2 "A" "B" 0.971
2 "R" "P" 0.971
2 "A" "F" 0.971
3 "X" "U" 0.302
... ... ...
The first approach I did was to basically use s <- split(df, df$user)but the dataframe contains a huge number of users and I think this is probably an extremely inefficient way to do this.
Many thanks.
#akrun's method is a great one-off but it doesn't leverage vectorization (we repeatedly call rnorm a single time within each level of user), so it's probably on the slow side. A more general way to do this is:
library(data.table)
setDT(df)
df[unique(df, by = "user")[ , new_rand := rnorm(.N)],
new_rand := i.new_rand, on = "user"]
What's going on here? unique returns a new data.table where all the duplicate observations (as defined by by, here user) are removed; we then add a column to this new object ([, := ]). Finally, this augmented data.table is joined back to the original table.
Note that here we only call rnorm once, returning a vector of exactly the right size. We then join this back to the original data set, "spreading" the value as needed across all observations of each user.
Or for assigning to a more specific group, say user and feature1 and feature2:
grps <- c("user", "feature1", "feature2")
df[unique(df, by = grps)[ , new_rand := rnorm(.N)],
new_rand := i.new_rand, on = grps]
We can try data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'user', we get a single random number (rnorm(1)) and assign (:=) to create 'new_rand'
library(data.table)
setDT(df1)[, new_rand := rnorm(1) , by = user]
Or we can use dplyr.
library(dplyr)
df1 %>%
group_by(user) %>%
mutate(new_rand = rnorm(1))
Or another option with left_join
distinct(df1, user) %>%
mutate(new_rand=rnorm(n())) %>%
left_join(df1, ., by='user')
and a base R solution:
df_ <- data.frame(user = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3), feature1 = c("A", "L", "Q", "D", "D", "P", "A", "R", "A", "X"), feature2 = c("B", "L", "B", "M", "M", "E", "B", "P", "F", "U"))
tmp <- by(df_, df_[, 'user'], FUN = function(x) data.frame(x, new_rand = rnorm(1)))
do.call(rbind, tmp)
# user feature1 feature2 new_rand
# 1.1 1 A B -0.6145338
# 1.2 1 L L -0.6145338
# 1.3 1 Q B -0.6145338
# 1.4 1 D M -0.6145338
# 1.5 1 D M -0.6145338
# 1.6 1 P E -0.6145338
# 2.7 2 A B -1.4292151
# 2.8 2 R P -1.4292151
# 2.9 2 A F -1.4292151
# 3 3 X U -0.3309754
or as per akrun's suggestion:
df_[, 'new_rand'] <- ave(seq_along(df_$user), df_$user, FUN = function(x) rnorm(1))

How to convert a factor to numeric in a predefined order in R

I have a factor column, with three values: "b", "c" and "free".
I did
df$new_col = as.numeric (df$factor_col)
But it will convert "b" to 1, "c" to 2 and "free" to 3.
But I want to convert "free" to 0, "b" to 2 and "c" to 5. How can I do it in R?
Thanks a lot
f <- factor(c("b", "c", "c", "free", "b", "free"))
You can try renaming the factor levels,
levels(f)[levels(f)=="b"] <- 2
levels(f)[levels(f)=="c"] <- 5
levels(f)[levels(f)=="free"] <- 0
> f
#[1] 2 5 5 0 2 0
#Levels: 2 5 0
One option would be to call the 'factor' again and specify the levels and labels argument based on the custom order and change to numeric after converting to 'character' or through the levels
df$new_col <- as.numeric(as.character(factor(df$factor_col,
levels=c('b', 'c', 'free'), labels=c(2, 5, 0))))
Another option is recode from library(car). The output will be factor class. If we need to convert to 'numeric', we can do this as in the earlier solution (as.numeric(..).
library(car)
df$new_col <- with(df, recode(factor_col, "'b'=2; 'c'=5; 'free'=0"))
data
df <- data.frame(factor_col= c('b', 'c', 'b', 'free', 'c', 'free'))
You can use the following approach to create the new column:
# an example data frame
f <- data.frame(factor_col = c("b", "c", "free"))
# create new_col
f <- transform(f, new_col = (factor_col == "b") * 2 + (factor_col == "c") * 5)
The result (f):
factor_col new_col
1 b 2
2 c 5
3 free 0

Resources