R subset df based on multiple columns from another data frame - r

I am trying to find a more succinct way to filter a data frame using rows from another data frame (I am currently using a loop).
For example, suppose you have the following data frame df1 consisting of quantities of apples, pears, lemons and oranges. There is also a 5th column which we will call happiness.
require(gtools)
df1 <- data.frame(permutations(n = 4, r = 4, v = 1:4)) %>% cbind(sample(1:24))
colnames(df1) <- c("Apples", "Pears", "Lemons", "Oranges", "Happiness")
However you wish to filter this dataframe to leave only certain combinations of fruit which exist in a second data frame (not with the same column order):
df2 = data.frame(Apples = c(1, 3, 2, 4), Pears = c(4, 1, 1, 3), Lemons = c(2, 2, 3, 1), Oranges = c(3, 4, 4, 2))
Currently I am using a loop to apply each row of df2 as a filter condition one-by-one and then binding the result e.g:
df.ss = list()
for (i in 1:nrow(df2)){
df.ss[[i]] = filter(df1,
df1$Apples == df2$Apples &
df1$Pears == df2$Pears &
df1$Lemons == df2$Lemons &
df1$Oranges == df2$Oranges)
}
df.ss %>% bind_rows()
Is there a more elegant way of going about this ?

I think you are looking for an inner join
dplyr::inner_join(df1, df2)

Related

Piping over a list, subsetting and calculate a function of my own

I have a dataset with these three columns and other additional columns
structure(list(from = c(1, 8, 3, 3, 8, 1, 4, 5, 8, 3, 1, 8, 4,
1), to = c(8, 3, 8, 54, 3, 4, 1, 6, 7, 1, 4, 3, 8, 8), time = c(1521823032,
1521827196, 1521827196, 1522678358, 1522701516, 1522701993, 1522702123,
1522769399, 1522780956, 1522794468, 1522794468, 1522794468, 1522794468,
1522859524)), class = "data.frame", row.names = c(NA, -14L))
I need the code to take all indices less than a number (e.g. 5) and for each of them do the following: Subset the data set if the index is either in column "from" or in column "to" and calculate a function (e.g the difference between the min and max in time). As a result I expect a dataframe with the indexes and the results of the calculation.
This is what I have, but it does not work.
dur<-function(x)max(x)-min(x) #The function to calculate the difference. In other cases I need to use other functions of my own
filternumber <- function(number,x){ #A function to filter data x by the number in the two two columns
x <- x%>% subset(from == number | to == number)
return(x)
}
lista <- unique(c(data$from, data$to)) # Creates a list with all the indexes in the data. I do this to avoid having non-existing indexes
lista <-lista[lista <= 5] #Limit the list to 5. In my code this number would be an argument to a function
result<-lista%>%filteremployee(.,data) %>% select(time) %>% dur() #I use select because I have many other columns in the data
The result in this case should be a dataframe with 1036492 for 1, 967272 for 3 and 92475 for 4
I´ve also try putting filteremployee(.,data) %>% select(time) %>% dur() in side mutate but that does not work either
Perhaps you are looking for something like this:
library(purrr)
library(dplyr)
index <- c(1, 3, 4)
names(index) <- index
index %>%
map_dfr(~ df %>%
filter(from == .x | to == .x) %>%
summarize(result = dur(time)),
.id = "index")
This returns
index result
1 1 1036492
2 3 967272
3 4 92475
The function was created with ==, which is elementwise. Here, we may need to loop
library(dplyr)
library(purrr)
map_dbl(lista, ~ filternumber(.x, data) %>%
select(time) %>%
dur)
[1] 1036492 967272 92475 0

use mutate_at for variables that meet two criteria dplyr R

I'm trying to reverse score (recode) some items in a dataframe. All reverse scored items end in an R, and each scale has a unique start ("hc", "out", and "hm"). I normally would just select all variables that end with an "r", but the issue is that some scales are on a 5-point scale ("hc" and "out") and others are on a 7-point scale ("hm").
Here is a sample of the much, much larger dataset:
library(tidyverse)
data <- tibble(name = c("Mike", "Ray", "Hassan"),
hc_1 = c(1, 2, 3),
hc_2r = c(5, 5, 4),
out_1r = c(5, 4, 2),
out_2 = c(2, 4, 5),
out_3r = c(2, 2, 1),
hm_1 = c(6, 7, 7),
hm_2r = c(7, 1, 7))
Let's say that I want to do this one scale at a time, so I start with hm, which is on a seven-point scale.
I want to try something like this with an & statement, but I get an error:
library(tidyverse)
library(car)
data %>%
mutate_at(vars(ends_with("r") & starts_with("hm")), ~(recode(., "1=7; 2=6; 3=5; 4=4; 5=3; 6=2; 7=1")))
Error: ends_with("r") & starts_with("hc") must evaluate to column positions or names, not a logical vector
What's a clean way to make it perform the reverse scoring on these few variables at a time? Once again, the dataset is too big too practically select individual variables at a time.
Thanks!
It would be easier to use matches here
library(tidyverse)
data %>%
mutate_at(vars(matches("^hm.*r$")), ~(recode(.,
"1=7; 2=6; 3=5; 4=4; 5=3; 6=2; 7=1")))

Trying to compare two dataframes, and writing a logical result to a new dataframe in R

I have an R dataframe that contains 18 columns, I would like to write a function that compares column 1 to column 2, and if both columns contain the same value, a logical result of T or F is written to a new column (this part is not too hard for me), however I would like to repeat this process over for the next columns and write T/F to a new column.
values col 1 = values col 2, write T/F to new column, values col 3 = values col 4, write T/F to a new column (or write results to a new dataframe)
I have been trying to do this with the purrr package, and use the pmap/map function, but I know I am making a mistake and missing some important part.
This function should work if I understand your problem correctly.
df <-
data.frame(a = c(18, 6, 2 ,0),
b = c(0, 6, 2, 18),
c = c(1, 5, 6, 8),
d = c(3, 5, 9, 2))
compare_columns <-
function(x){
n_columns <- ncol(x)
odd_columns <- 2*1:(n_columns/2) - 1
even_columns <- 2*1:(n_columns/2)
comparisons_list <-
lapply(seq_len(n_columns/2),
function(y){
df[, odd_columns[y]] == df[, even_columns[y]]
})
comparisons_df <-
as.data.frame(comparisons_list,
col.names = paste0("column", odd_columns, "_column", even_columns))
return(cbind(x, comparisons_df))
}
compare_columns(df)

Using list's elements in loops in r (example: setDT)

I have multiple data frames and I want to perform the same action in all data frames, such, for example, transform all them into data.tables (this is just an example, I want to apply other functions too).
A simple example can be (df1=df2=df3, without loss of generality here)
df1 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
df2 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
df3 <- data.frame(var1 = c(1, 2, 3, 4, 5), var2 =c(1, 2, 2, 1, 2), var3 = c(10, 8, 15, 7, 9))
My approach was: (i) to create a list of the data frames (list.df), (ii) to create a list of how they should be called afterwards (list.dt) and (iii) to loop into those two lists:
list.df:
list.df<-vector('list',3)
for(j in 1:3){
name <- paste('df',j,sep='')
list.df[j] <- name
}
list.dt
list.dt<-vector('list',3)
for(j in 1:3){
name <- paste('dt',j,sep='')
list.dt[j] <- name
}
Loop (to make all data frames into data tables):
for(i in 1:3){
name<-list.dt[i]
assign(unlist(name), setDT(list.df[i]))
}
I am definitely doing something wrong as the result of this are three data tables with 1 variable, 1 observation (exactly the name list.df[i]).
I've tried to unlist the list.df thinking r would recognize that as an entire data frame and not only as a string:
for(i in 1:3){
name<-list.dt[i]
assign(unlist(name), setDT(unlist(list.df[i])))
}
But I get the error message:
Error in setDT(unlist(list.df[i])) :
Argument 'x' to 'setDT' should be a 'list', 'data.frame' or 'data.table'
Any suggestions?
You can just put all the data into one dataframe. Then, if you want to iterate through dataframes, use dplyr::do or, preferably, other dplyr functions
library(dplyr)
data =
list(df1 = df2, df2 = df2, df3 = df3) %>%
bind_rows(.id = "source") %>%
group_by(source)
Change your last snippet to this:
for(i in 1:3){
name <- list.dt[i]
assign(unlist(name), setDT(get(list.df[[i]])))
}
# Alternative to using lists
list.df <- paste0("df", 1:3)
# For loop that works with the length of the input 'list'/vector
# Creates the 'dt' objects on the fly
for(i in seq_along(list.df)){
assign(paste0("dt", i), setDT(get(list.df[i])))
}
Using data.table (which deserve far more advertising):
a) If you need all your data.frames converted to data.tables, then as was already suggested in the comments by #A5C1D2H2I1M1N2O1R2T1, iterate over your data.frames with setDT
library(data.table)
lapply(mget(paste0("df", 1:3)), setDT)
# or, if you wish to type them one by one:
lapply(list(df1, df2, df3), setDT)
class(df1) # check if coercion took place
# [1] "data.table" "data.frame"
b) If you need to bind your data.frames by rows, then use data.table::rbindlist
data <- rbindlist(mget(paste0("df", 1:3)), idcol = TRUE)
# or, if you wish to type them one by one:
data <- rbindlist(list(df1 = df1, df2 = df2, df3 = df3), idcol = TRUE)
Side note: If you like chaining/piping with the magrittr package (which you see almost always in combination with dplyr syntax), then it goes like:
library(data.table)
library(magrittr)
# for a)
mget(paste0("df", 1:3)) %>% lapply(setDT)
# for b)
data <- mget(paste0("df", 1:3)) %>% rbindlist(idcol = TRUE)

Combining frequency tables in R

I have a vector containing the frequencies of molecules within their respective molecular class for all molecules measured. I also have a vector that contains the per class frequency of significant molecules identified by variable selection. How can I merge these 2 vectors into a data frame and fill in empty frequencies with 0's (in R)?
Here is a workable example:
full = rep(letters[1:4], 4:7)
fullTable = table(full)
sub = rep(letters[1:2], c(2, 4))
subTable = table(sub)
I would like the table to look like:
print(data.frame(Letter=letters[1:4], fullFreq=c(4, 5, 6, 7), subFreq=c(2, 4, 0, 0)))
Try this (I supposed you meant subTable=table(sub) in your last line):
res<-merge(as.data.frame(fullTable),as.data.frame(subTable),by.x=1,by.y=1,all=TRUE)
colnames(res)<-c("Letter","fullFreq","subFreq")
res[is.na(res)]<-0
With the library dplyr
library(dplyr)
full=rep(letters[1:4], 4:7)
sub=rep(letters[1:2], c(2,4))
df <- data.frame(Letter=unique(c(full, sub)))
df <- df %>%
left_join(as.data.frame(table(full)), by=c("Letter"="full")) %>%
left_join(as.data.frame(table(sub)), by=c("Letter"="sub"))
df[is.na(df)] <- 0
df

Resources