extract every nth element of a column in a dataset - r

Let suppose I have a dataset, named "df", with many columns, and I need to extract every fifth element of only one column, named “country”. Could anyone suggest a sample code for it?

A base solution is:
df[seq_len(nrow(df)) %% 5 == 0, ]
Also, you could recycle a logical vector:
df[c(rep(FALSE, 4), TRUE), ]

Just use seq to create a sequence of the numbers you want, and use [seq,] for indexing. Aditionally, to select a given oclumn, use [,"col_name"]
df <- iris
row_seq <- seq(5, nrow(df), by=5)
df[row_seq,]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 5 5.0 3.6 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> 15 5.8 4.0 1.2 0.2 setosa
#> 20 5.1 3.8 1.5 0.3 setosa
...
Created on 2022-05-22 by the reprex package (v2.0.1)

Maybe something like this:
library(tidyverse)
tibble(country = LETTERS) |>
filter(row_number() %% 5 == 0)
#> # A tibble: 5 × 1
#> country
#> <chr>
#> 1 E
#> 2 J
#> 3 O
#> 4 T
#> 5 Y
Created on 2022-05-22 by the reprex package (v2.0.1)

Related

Select only numeric columns from dataframes in a list

I have this list of dataframes as follows :
library(carData)
library(datasets)
l = list(Salaries,iris)
I only want to select the numeric columns in this list of datasets. Already tried lapply with the function select_if(is.numeric) but it did not work with me.
We can use select with where in the newer version of dplyr - loop over the list with map and select the columns of the data.frames
library(purrr)
library(dplyr)
map(l, ~ .x %>%
select(where(is.numeric)))
Or using base R
lapply(l, Filter, f = is.numeric)
base R option using lapply twice like this:
library(carData)
library(datasets)
l = list(Salaries,iris)
lapply(l, \(x) x[, unlist(lapply(x, is.numeric), use.names = FALSE)])
#> [[1]]
#> yrs.since.phd yrs.service salary
#> 1 19 18 139750
#> 2 20 16 173200
#> 3 4 3 79750
#> 4 45 39 115000
#> 5 40 41 141500
#>
#> [[2]]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
Created on 2022-09-25 with reprex v2.0.2

How to add new row with concatenated strings for each group? [duplicate]

If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))

for loop to change columns with a specified unique length to factor in multiple dataframes

I have several dataframes for which I need to fix the classes of multiple columns, before I can proceed. Because the dataframes all have the same variables but the classes seemed to differ from one dataframe to the other, I figured I would go for a 'for loop'and specify the unique length upon which a column should be coded as factor or numeric.
I tried the following for factor:
dataframes <- list(dataframe1, dataframe2, dataframe2, dataframe3)
for (i in dataframes){
cols.to.factor <-sapply(i, function(col) length(unique(col)) < 6)
i[cols.to.factor] <- apply(i[cols.to.factor] , factor)
}
now the code runs, but it doesn't change anything. What am I missing?
Thanks for the help in advance!
The instruction
for(i in dataframes)
extracts i from the list dataframes and the loop changes the copy, that is never reassigned to the original. A way to correct the problem is
for (i in seq_along(dataframes)){
x <- dataframes[[i]]
cols.to.factor <-sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor] , factor)
dataframes[[i]] <- x
}
An equivalent lapply based solution is
dataframes <- lapply(dataframes, \(x){
cols.to.factor <- sapply(x, function(col) length(unique(col)) < 6)
x[cols.to.factor] <- lapply(x[cols.to.factor], factor)
x
})
library(tidyverse)
# example data
list(
iris,
iris %>% mutate(Sepal.Length = Sepal.Length %>% as.character())
) %>%
# unify column classes
map(~ .x %>% mutate(across(everything(), as.character))) %>%
# optional joining if wished
bind_rows() %>%
mutate(Species = Species %>% as.factor()) %>%
as_tibble()
#> # A tibble: 300 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <chr> <chr> <chr> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 290 more rows
Created on 2021-10-05 by the reprex package (v2.0.1)

Programmatically deal with duplicated columns using name_repair

I am importing a spreadsheet where I have a known vector of what the column headings were originally. When read_excel imports the data, it rightly complains of the duplicated columns and renames them to distinguish them. This is great behaviour. My question is how might I select (from the duplicated columns) the first occurrence of that duplicated column, drop all other duplicated ones and then rename the column back to the original name. I have a working script but it seems clunky. I always struggle to manipulate column headers programmatically within a pipeline.
library(readxl)
library(dplyr, warn.conflicts = FALSE)
cols_names <- c("Sepal.Length", "Sepal.Length", "Petal.Length", "Petal.Length", "Species")
datasets <- readxl_example("datasets.xlsx")
d <- read_excel(datasets, col_names = cols_names, skip = 1)
#> New names:
#> * Sepal.Length -> Sepal.Length...1
#> * Sepal.Length -> Sepal.Length...2
#> * Petal.Length -> Petal.Length...3
#> * Petal.Length -> Petal.Length...4
d_sub <- d %>%
select(!which(duplicated(cols_names)))
new_col_names <- gsub("\\.\\.\\..*","", colnames(d_sub))
colnames(d_sub) <- new_col_names
d_sub
#> # A tibble: 150 x 3
#> Sepal.Length Petal.Length Species
#> <dbl> <dbl> <chr>
#> 1 5.1 1.4 setosa
#> 2 4.9 1.4 setosa
#> 3 4.7 1.3 setosa
#> 4 4.6 1.5 setosa
#> 5 5 1.4 setosa
#> 6 5.4 1.7 setosa
#> 7 4.6 1.4 setosa
#> 8 5 1.5 setosa
#> 9 4.4 1.4 setosa
#> 10 4.9 1.5 setosa
#> # ... with 140 more rows
Created on 2020-04-08 by the reprex package (v0.3.0)
Any idea how to do this in a more streamlined manner?
Based on #rawr's comment, here is the answer as I see it:
library(readxl)
library(dplyr, warn.conflicts = FALSE)
datasets <- readxl_example("datasets.xlsx")
cols_names <- c("Sepal.Length", "Sepal.Length", "Petal.Length", "Petal.Length", "Species")
d <- read_excel(datasets, col_names = cols_names, skip = 1, .name_repair = make.unique) %>%
select(all_of(cols_names))
#> New names:
#> * Sepal.Length -> Sepal.Length.1
#> * Petal.Length -> Petal.Length.1
Created on 2020-04-08 by the reprex package (v0.3.0)

Is there a way to use a lookup value from a table in a mutate column?

library(tidyverse)
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
mutate(new_col = rank == 4, Sepal.Width)
table <- df %>%
filter(rank == 4) %>%
select(Species, new_col = Sepal.Width)
correct_df <- left_join(df, table, by = "Species")
df
#> # A tibble: 150 x 8
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Dim
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0.280
#> 2 4.9 3 1.4 0.2 setosa 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 0.3
#> 5 5 3.6 1.4 0.2 setosa 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 0.42
#> 8 5 3.4 1.5 0.2 setosa 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 0.15
#> # ... with 140 more rows, and 2 more variables: rank <dbl>, new_col <lgl>
I'm basically looking for new_col to show the value that corresponds with rank = 4 from the Sepal.Width column. In this case, those values would be 3.9, 3.3, and 3.8. I'm envisioning this similar to a VLookup, or Index/Match in Excel.
When ever I think "now I need to use VLOOKUP like I did in the past in Excel" I find the left_join() function helpful. It's also part of the dplyr package. Instead of "looking up" values in one table in another table, it's easier for R to just make one bigger table where one table remains unchanged (here the "left" one or the first term you put in the function) and the other is added using a column or columns they have in common as an index.
In your specific example, I can't entirely understand what you want new_col to have in it. If you want to do Excel-style VLOOKUP in R, then left_join() is the best starting point.
The question is not clear since it does not mention the purpose of a Vlookup or Index/Match like operation from Excel.
Also, you don't mention what value should "new_col" have if rank is not equal to 4.
Assuming the value is NA, the below solution with a simple ifelse would work:
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
ungroup() %>%
mutate(new_col = ifelse(rank == 4, Sepal.Width,NA))
df

Resources