I have this list of dataframes as follows :
library(carData)
library(datasets)
l = list(Salaries,iris)
I only want to select the numeric columns in this list of datasets. Already tried lapply with the function select_if(is.numeric) but it did not work with me.
We can use select with where in the newer version of dplyr - loop over the list with map and select the columns of the data.frames
library(purrr)
library(dplyr)
map(l, ~ .x %>%
select(where(is.numeric)))
Or using base R
lapply(l, Filter, f = is.numeric)
base R option using lapply twice like this:
library(carData)
library(datasets)
l = list(Salaries,iris)
lapply(l, \(x) x[, unlist(lapply(x, is.numeric), use.names = FALSE)])
#> [[1]]
#> yrs.since.phd yrs.service salary
#> 1 19 18 139750
#> 2 20 16 173200
#> 3 4 3 79750
#> 4 45 39 115000
#> 5 40 41 141500
#>
#> [[2]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
Created on 2022-09-25 with reprex v2.0.2
Related
Let suppose I have a dataset, named "df", with many columns, and I need to extract every fifth element of only one column, named “country”. Could anyone suggest a sample code for it?
A base solution is:
df[seq_len(nrow(df)) %% 5 == 0, ]
Also, you could recycle a logical vector:
df[c(rep(FALSE, 4), TRUE), ]
Just use seq to create a sequence of the numbers you want, and use [seq,] for indexing. Aditionally, to select a given oclumn, use [,"col_name"]
df <- iris
row_seq <- seq(5, nrow(df), by=5)
df[row_seq,]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 5 5.0 3.6 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> 15 5.8 4.0 1.2 0.2 setosa
#> 20 5.1 3.8 1.5 0.3 setosa
...
Created on 2022-05-22 by the reprex package (v2.0.1)
Maybe something like this:
library(tidyverse)
tibble(country = LETTERS) |>
filter(row_number() %% 5 == 0)
#> # A tibble: 5 × 1
#> country
#> <chr>
#> 1 E
#> 2 J
#> 3 O
#> 4 T
#> 5 Y
Created on 2022-05-22 by the reprex package (v2.0.1)
I have several dataframes for which I need to fix the classes of multiple columns, before I can proceed. Because the dataframes all have the same variables but the classes seemed to differ from one dataframe to the other, I figured I would go for a 'for loop'and specify the unique length upon which a column should be coded as factor or numeric.
I tried the following for factor:
dataframes <- list(dataframe1, dataframe2, dataframe2, dataframe3)
for (i in dataframes){
cols.to.factor <-sapply(i, function(col) length(unique(col)) < 6)
i[cols.to.factor] <- apply(i[cols.to.factor] , factor)
}
now the code runs, but it doesn't change anything. What am I missing?
Thanks for the help in advance!
The instruction
for(i in dataframes)
extracts i from the list dataframes and the loop changes the copy, that is never reassigned to the original. A way to correct the problem is
for (i in seq_along(dataframes)){
x <- dataframes[[i]]
cols.to.factor <-sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor] , factor)
dataframes[[i]] <- x
}
An equivalent lapply based solution is
dataframes <- lapply(dataframes, \(x){
cols.to.factor <- sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor], factor)
x
})
library(tidyverse)
# example data
list(
iris,
iris %>% mutate(Sepal.Length = Sepal.Length %>% as.character())
) %>%
# unify column classes
map(~ .x %>% mutate(across(everything(), as.character))) %>%
# optional joining if wished
bind_rows() %>%
mutate(Species = Species %>% as.factor()) %>%
as_tibble()
#> # A tibble: 300 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <chr> <chr> <chr> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 290 more rows
Created on 2021-10-05 by the reprex package (v2.0.1)
So i have two dataframes, df1 and df2, both have the same number of columns (and with the same name) but different number of rows. I want to combine them into one big dataframe in which the first column is the first column of df1, the second column is the first column of df2, the third column is the second column of df1, the fourth column is the second column of df2, and so on.
Using the built in BOD data frame construct sample df1 and df2 inputs.
Then iterating over the columns convert the jth column of each to a ts series (since ts series can be cbind'ed even with different numbers of rows) and then cbind them and convert that to a data frame. Finally give it nicer names. No packages are used.
# test data
df1 <- BOD # 6x2 data frame w Time and demand col names
df2 <- head(10 * BOD, 3) # 3x2 data frame w same names
nc <- ncol(df1)
out <- do.call("data.frame", lapply(1:nc, function(j) cbind(ts(df1[,j]), ts(df2[,j]))))
names(out) <- make.names(rep(names(df1), each = 2), unique = TRUE)
out
giving:
Time Time.1 demand demand.1
1 1 10 8.3 83
2 2 20 10.3 103
3 3 30 19.0 190
4 4 NA 16.0 NA
5 5 NA 15.6 NA
6 7 NA 19.8 NA
A base solution using iris dataset.
First, rename the columns to have a the columns named in sequence. Second, having two data frames, df1 and df2, create a dummy variable in both of them that will serve as a key.
Third, left-join df1 and df2 based on dummy(merge with all.x = TRUE argument). Fourth, remove dummy, Five, reorder your columns.
df <- iris
names(df) <- paste0("column",c("A", "B", "C", "D", "E"))
head(df)
df1 <- df[1:10,] #first data frame
df2 <- df[11:16,] #second data frame different number of rows
df1$dummy <- 1:nrow(df1) #Creating a dummy variable for merging
df2$dummy <- 1:nrow(df2) #Creating a dummy variable for merging
result <- base::merge(df1, df2, by = "dummy", all.x = TRUE) #merging per dummy
result$dummy<- NULL # I don't need dummy anymore
result[,sort(names(result))] #Your result
#Output
columnA.x columnA.y columnB.x columnB.y columnC.x columnC.y columnD.x columnD.y columnE.x columnE.y
1 5.1 5.4 3.5 3.7 1.4 1.5 0.2 0.2 setosa setosa
2 4.9 4.8 3.0 3.4 1.4 1.6 0.2 0.2 setosa setosa
3 4.7 4.8 3.2 3.0 1.3 1.4 0.2 0.1 setosa setosa
4 4.6 4.3 3.1 3.0 1.5 1.1 0.2 0.1 setosa setosa
5 5.0 5.8 3.6 4.0 1.4 1.2 0.2 0.2 setosa setosa
6 5.4 5.7 3.9 4.4 1.7 1.5 0.4 0.4 setosa setosa
7 4.6 NA 3.4 NA 1.4 NA 0.3 NA setosa <NA>
8 5.0 NA 3.4 NA 1.5 NA 0.2 NA setosa <NA>
9 4.4 NA 2.9 NA 1.4 NA 0.2 NA setosa <NA>
10 4.9 NA 3.1 NA 1.5 NA 0.1 NA setosa <NA>
# load packages
library(tidyverse)
# define function to allow cbinding DFs of different length
# taken from: https://stackoverflow.com/a/7962286
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
# make some sample DFs
df1 <- data.frame(a = 1:3, b = 4:6)
df2 <- data.frame(a = 1:5, b = 6:10)
# merge them and reorder collate col positions
cbind.fill(df1, df2) %>%
as_tibble(.name_repair = "unique") %>%
select(c(seq(1, ncol(.), 2), seq(2, ncol(.), 2)))
#> New names:
#> * a -> a...1
#> * b -> b...2
#> * a -> a...3
#> * b -> b...4
#> # A tibble: 5 x 4
#> a...1 a...3 b...2 b...4
#> <int> <int> <int> <int>
#> 1 1 1 4 6
#> 2 2 2 5 7
#> 3 3 3 6 8
#> 4 NA 4 NA 9
#> 5 NA 5 NA 10
Created on 2021-03-25 by the reprex package (v1.0.0)
Here's a step by step guide using the iris dataset
# Use iris dataframe used for both df1 and df2
df1 <- iris
df2 <- iris
tot_cols <- ncol(df1) + ncol(df2)
tot_rows <- nrow(df1) + nrow(df2)
# Create empty output data.frame with desired number of rows and cols
df <- data.frame(matrix(ncol=tot_cols, nrow=tot_rows))
# Assign columns to output data.frame
df[, c(seq(1, tot_cols, 2))] <- df1
df[, seq(2, tot_cols, 2)] <- df2
# Assign columns to output data.frame
colnames(df) <- c(rbind(colnames(df1), colnames(df2)))
# View output
head(df)
# Sepal.Length Sepal.Length Sepal.Width Sepal.Width Petal.Length Petal.Length Petal.Width Petal.Width Species Species
# 1 5.1 5.1 3.5 3.5 1.4 1.4 0.2 0.2 setosa setosa
# 2 4.9 4.9 3.0 3.0 1.4 1.4 0.2 0.2 setosa setosa
# 3 4.7 4.7 3.2 3.2 1.3 1.3 0.2 0.2 setosa setosa
# 4 4.6 4.6 3.1 3.1 1.5 1.5 0.2 0.2 setosa setosa
# 5 5.0 5.0 3.6 3.6 1.4 1.4 0.2 0.2 setosa setosa
# 6 5.4 5.4 3.9 3.9 1.7 1.7 0.4 0.4 setosa setosa
Reference:
See How to merge 2 vectors alternating indexes? for more info on why we use rbind().
I am importing a spreadsheet where I have a known vector of what the column headings were originally. When read_excel imports the data, it rightly complains of the duplicated columns and renames them to distinguish them. This is great behaviour. My question is how might I select (from the duplicated columns) the first occurrence of that duplicated column, drop all other duplicated ones and then rename the column back to the original name. I have a working script but it seems clunky. I always struggle to manipulate column headers programmatically within a pipeline.
library(readxl)
library(dplyr, warn.conflicts = FALSE)
cols_names <- c("Sepal.Length", "Sepal.Length", "Petal.Length", "Petal.Length", "Species")
datasets <- readxl_example("datasets.xlsx")
d <- read_excel(datasets, col_names = cols_names, skip = 1)
#> New names:
#> * Sepal.Length -> Sepal.Length...1
#> * Sepal.Length -> Sepal.Length...2
#> * Petal.Length -> Petal.Length...3
#> * Petal.Length -> Petal.Length...4
d_sub <- d %>%
select(!which(duplicated(cols_names)))
new_col_names <- gsub("\\.\\.\\..*","", colnames(d_sub))
colnames(d_sub) <- new_col_names
d_sub
#> # A tibble: 150 x 3
#> Sepal.Length Petal.Length Species
#> <dbl> <dbl> <chr>
#> 1 5.1 1.4 setosa
#> 2 4.9 1.4 setosa
#> 3 4.7 1.3 setosa
#> 4 4.6 1.5 setosa
#> 5 5 1.4 setosa
#> 6 5.4 1.7 setosa
#> 7 4.6 1.4 setosa
#> 8 5 1.5 setosa
#> 9 4.4 1.4 setosa
#> 10 4.9 1.5 setosa
#> # ... with 140 more rows
Created on 2020-04-08 by the reprex package (v0.3.0)
Any idea how to do this in a more streamlined manner?
Based on #rawr's comment, here is the answer as I see it:
library(readxl)
library(dplyr, warn.conflicts = FALSE)
datasets <- readxl_example("datasets.xlsx")
cols_names <- c("Sepal.Length", "Sepal.Length", "Petal.Length", "Petal.Length", "Species")
d <- read_excel(datasets, col_names = cols_names, skip = 1, .name_repair = make.unique) %>%
select(all_of(cols_names))
#> New names:
#> * Sepal.Length -> Sepal.Length.1
#> * Petal.Length -> Petal.Length.1
Created on 2020-04-08 by the reprex package (v0.3.0)
I can select and rename the column name like this without any problem:
library(tidyverse)
iris <- as.tibble(iris)
iris %>% select(sepal_ln = Sepal.Length, sepal_wd = Sepal.Width)
#> # A tibble: 150 × 2
#> sepal_ln sepal_wd
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3.0
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5.0 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5.0 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> # ... with 140 more rows
But want I want do to do is to call the column from string instead of column name. I tried the following but it failed:
> wanted <- "Sepal"
> iris %>% select(sepal_ln = !! paste0(wanted,".Length"),
+ sepal_wd = !! paste0(wanted,".Width"),
+ )
Error: "Sepal.Length", "Sepal.Width": must resolve to integer column positions, not string
>
What's the right way to do that?
We can use select_
iris %>%
select_(sepal_ln = paste0(wanted, ".Length"), paste0(wanted, ".Width"))
Also, there are wrappers within select to do this more easily i.e. one_of, contains, matches etc. to select the required columns from the data
iris %>%
select(setNames(one_of(paste0(wanted, c(".Length", ".Width"))),
c("sepal_ln", "sepal_wd"))) %>%
head(2)
# A tibble: 2 × 2
# sepal_ln sepal_wd
# <dbl> <dbl>
#1 5.1 3.5
#2 4.9 3.0
NOTE: It is not clear whether the select_ methods will get deprecated in the next dplyr release (0.6.0) or not.