selecting columns from a set of names with dplyr - r

I'm attempting to make subsets of a large data frame based on whether the column names are in an externally defined set. So I'm starting with something like:
> x <- c(1,2,3)
> y <- c("a","b","c")
> z <- c(4,5,6)
>
> df <- data.frame(x=x,y=y,z=z)
> df
x y z
1 1 a 4
2 2 b 5
3 3 c 6
chosen_columns <- c(x,y)
And I'm attempting to use this much to end up with:
x y
1 1 a
2 2 b
3 3 c
It seems like using select() from dplyr should be able to handle this perfectly, but I'm not sure what the arguments would be to get that. Something like:
df_chosen <- df %>%
select(is.element(___,chosen_columns))
I'm just not sure what would go in the ___ there.
Thank you!

c(x, y) is not a vector of two columns: it's combining your objects x and y into a vector of characters: c("1", "2", "3", "a","b","c").
You may want to create a vector of column names and then pass it directly to select():
library(dplyr)
chosen_columns <- c("x", "y")
df |> select(all_of(chosen_columns))
(Thank you, Gregor Thomas, for the advice to wrap column names in all_of()).

Related

Sum numeric sub-dataframe within a list

Here I have a r list of dataframes, all dataframes are in the same format and have the same dimensionality, the first 2 columns are strings,like IDs and names(identical for all dataframes), and the rest are numeric values. Here I want to sum numeric parts of all the dataframes in matrix way, i.e. output at index (1,3) is the sum of values at index (1,3) of all the dataframes
e.g. Given list L consist of dataframe x and y, I want to get output like z
x <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1,2),v2=c(3,4))
y <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(5,6),v2=c(7,8))
L <- list(x,y)
z <- data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1+5,2+6),v2=c(3+7,4+8))
I know how to do this using for loop, but I want to learn to do it in a more R-like way, by that I mean, using some vectorized functions, like the apply family
Currently my idea is create a new dataframe with only ID and name columns, then use a global dataframe variable to sum the numeric parts, and at last cbind this 2 parts
output <- x[,1:2]
num_sum <- matrix(0,nrow=nrow(L[[1]]),ncol=ncol(L[[1]][,-c(1,2)]))
lapply(L,function(a){num_sum <<- a[3:length(a)]+num_sum})
cbind(output,num_sum)
but this approach has some problems I prefer to avoid
I need to manully set the 2 parts of output and then manully join the two parts
lapply() will return a list that each element is an intermiediate num_sum returned by an iteration, which requires much more memory space
Here I'm using the global variable num_sum to keep track of the progress, but num_sum is not needed later and I have to manully remove it later
If the order of the two first columns is always the same, you can do:
#Get all numeric columns
num <- sapply(L[[1]], is.numeric)
#Sum them across elements of the list
df_num <- Reduce(`+`, lapply(L, `[`, num))
#Get the non-numeric columns and bind them with sum of numeric columns
cbind(L[[1]][!num], df_num)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12
If they are different you can use powerjoin to do an inner join on the selected columns and sum the rest with conflict argument:
library(powerjoin)
sum_inner_join <-
function(x, y) power_inner_join(x, y, by = c("ID", "name"), conflict = ~ .x + .y)
Reduce(sum_inner_join, L)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12
using dplyr and purrr (which has a bit nicer map functions), you could do something like this:
library(purrr)
library(dplyr)
result <- reduce(L, function(x,y){
xVals <- x |> select(-ID, -name)
yVals <- y |> select(-ID, -name)
totalVals <- xVals |> map2(yVals, function(x,y) {
rowSums(cbind(x,y))
})
return(x |> select(ID, name) |> cbind(totalVals))
})
Similar logic to Maƫl's answer, but squishing it all into a Map call:
data.frame(do.call(Map,
c(\(...) if(is.numeric(..1)) Reduce(`+`, list(...)) else ..1, L)
))
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12
If the first ..1 chunk of the column is numeric, sum + all the values in all the lists, otherwise return the first ..1 chunk.
You could also do it via an aggregation if all the rows are unique:
tmp <- do.call(rbind, L)
nums <- sapply(tmp, is.numeric)
aggregate(tmp[nums], tmp[!nums], FUN=sum)
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12

Using mutate with a stored list of columns and procedures

I would like to iterate through a stored list of columns and procedures to create n new columns based on this list. In the example below, we start with 3 columns, a, b, c and two simple functions func1, func1.
The data frame col_mod contains two sets of modifications that should be applied to the data frame. Each of these modifications should be an addition to the data frame, rather than replacements of the specified columns.
In col_mod row 1, we see that column a should be modified using func1, and in row 2, we see that column c should be modified using func2. The new names of these columns should be a_new and c_new, respectively.
At the bottom of the reprex below, I obtain my desired result, but I would like to do so without hard coding each modification individually . Is there any way to use maybe something from purrr:map or anything similiar?
library(tidyverse)
## fake data
dat <- data.frame(a = 1:5,
b = 6:10,
c = 11:15)
## functions
func1 <- function(x) {x + 2}
func2 <- function(x) {x - 4}
## modification list
col_mod <- data.frame("col" = c("a", "c"),
"func" = c("func1", "func2"),
stringsAsFactors = FALSE)
## desired end result
dat %>%
mutate("a_new" = func1(a),
"c_new" = func2(c))
edit: if it is easier to store the modifications in a list, as shown below, a solution using that would be fine as well, as I am able to store the modifications in either a data frame or list.
col_mod <- list("set1" = list("a", "func1"),
"set2" = list("c", "func2"))
We can do this with the help of Map, use match.fun to apply the function
dat[paste0(col_mod$col, '_new')] <- Map(function(x, y) match.fun(y)(x),
dat[col_mod$col], col_mod$func)
dat
# a b c a_new c_new
#1 1 6 11 3 7
#2 2 7 12 4 8
#3 3 8 13 5 9
#4 4 9 14 6 10
#5 5 10 15 7 11
Using col_mod as dataframe.
col_mod <- data.frame("col" = c("a", "c"),"func" = c("func1", "func2"))
We can use the tidyverse approach to do this
library(dplyr)
library(purrr)
library(stringr)
library(tibble)
imap_dfc(deframe(col_mod), ~ dat %>%
transmute(!! str_c(.y, "_new") := match.fun(.x)(!! rlang::sym(.y)))) %>%
bind_cols(dat, .)

Nested for loop leading to: Error in [<-.data.frame`(`*tmp*` replacement has x rows, data has y

I have 6 data frames (dfs) with a lot of data of different biological groups and another 6 data frames (tax.dfs) with taxonomical information about those groups. I want to replace a column of each of the 6 dfs with a column with the scientific name of each species present in the 6 tax.dfs.
To do that I created two lists of the data frames and I'm trying to apply a nested for loop:
dfs <- list(df.birds, df.mammals, df.crocs, df.snakes, df.turtles, df.lizards)
tax.dfs <- list(tax.birds,tax.mammals, tax.crocs, tax.snakes, tax.turtles, tax.lizards )
for(i in dfs){
for(y in tax.dfs){
i[,1] <- y[,2]
}}
And this is the output I'm getting:
Error in `[<-.data.frame`(`*tmp*`, , 1, value = c("Aotus trivirgatus", :
replacement has 64 rows, data has 43
But both data frames have the same number of rows, I actually used dfs to create tax.dfs applying the tnrs_match_names function from rotl package.
Any suggestions of how I could fix this error or that help me to find another way to do what I need to will be greatly appreciated.
Thank You!
For what it is worth, to iterate over two objects simultaneously, the following works:
Example Data:
df1 <- data.frame(a=1, b=2)
df2 <- data.frame(c=3, d=4)
df3 <- data.frame(e=5, f=6)
df_1 <- data.frame(a='A', b='B')
df_2 <- data.frame(c='C', d='D')
df_3 <- data.frame(e='E', f='F')
dfs <- list(df1, df2, df3)
df_s <- list(df_1, df_2, df_3)
Using mapply:
out <- mapply(function(one, two) {
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
e f
1 F 6
Here, one and two in mapply correspond to the different elements in dfs and df_s. Having said that, let's make it a bit more interesting. Let's change my third example to the following:
df_3 <- data.frame(e=c('E', 'e'), f=c('F', 'f'))
df_s <- list(df_1, df_2, df_3) # needs to be executed again
Now, let's adjust the function:
out <- mapply(function(one, two) {
if(nrow(one) != nrow(two)){return('Wrong dimensions')}
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
[1] "Wrong dimensions"

Counting function in R

I have a dataset like this
id <- 1:12
b <- c(0,0,1,2,0,1,1,2,2,0,2,2)
c <- rep(NA,3)
d <- rep(NA,3)
df <-data.frame(id,b)
newdf <- data.frame(c,d)
I want to do simple math. If x==1 or x==2 count them and write how many 1 and 2 are there in this dataset. But I don't want to count whole dataset, I want my function count them four by four.
I want to a result like this:
> newdf
one two
1 1 1
2 2 1
3 0 3
I tried this with lots of variation but I couldn't success.
afonk <- function(x) {
ifelse(x==1 | x==2, x, newdf <- (x[1]+x[2]))
}
afonk(newdf$one)
lapply(newdf, afonk)
Thanks in advance!
ismail
Fun with base R:
# counting function
countnum <- function(x,num){
sum(x == num)
}
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
dfl <- split(df$b,f = df$group)
# make data frame of counts
newdf <- data.frame(one = sapply(dfl,countnum,1),
two = sapply(dfl,countnum,2))
Edit based on comment:
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
table(subset(df, b != 0L)[c("group", "b")])
Which you prefer depends on what type of result you need. A table will work for a small visual count, and you can likely pull the data out of the table, but if it is as simple as your example, you might opt for the data.frame.
We could use dcast from data.table. Create a grouping variable using %/% and then dcast from 'long' to 'wide' format.
library(data.table)
dcast(setDT(df)[,.N ,.(grp=(id-1)%/%4+1L, b)],
grp~b, value.var='N', fill =0)[,c(2,4), with=FALSE]
Or a slightly more compact version would be using fun.aggregate as length.
res <- dcast(setDT(df)[,list((id-1)%/%4+1L, b)][b!=0],
V1~b, length)[,V1:=NULL][]
res
# 1 2
#1: 1 1
#2: 2 1
#3: 0 3
If we need the column names to be 'one', 'two'
library(english)
names(res) <- as.character(english(as.numeric(names(res))))

How I can select rows from a dataframe that do not match?

I'm trying to identify the values in a data frame that do not match, but can't figure out how to do this.
# make data frame
a <- data.frame( x = c(1,2,3,4))
b <- data.frame( y = c(1,2,3,4,5,6))
# select only values from b that are not in 'a'
# attempt 1:
results1 <- b$y[ !a$x ]
# attempt 2:
results2 <- b[b$y != a$x,]
If a = c(1,2,3) this works, as a is a multiple of b. However, I'm trying to just select all the values from data frame y, that are not in x, and don't understand what function to use.
If I understand correctly, you need the negation of the %in% operator. Something like this should work:
subset(b, !(y %in% a$x))
> subset(b, !(y %in% a$x))
y
5 5
6 6
Try the set difference function setdiff. So you would have
results1 = setdiff(a$x, b$y) # elements in a$x NOT in b$y
results2 = setdiff(b$y, a$x) # elements in b$y NOT in a$x
You could also use dplyr for this task. To find what is in b but not a:
library(dplyr)
anti_join(b, a, by = c("y" = "x"))
# y
# 1 5
# 2 6

Resources