Adding a counting sequence to a nested dataframe in R using purrr - r

I have a list of nested data frames. I want to add a column of a sequential count to each nested dataframe.
This works in typical dataframe:
library(tidyverse)
mtcars %>% mutate(index = sequence(n()))
mpg cyl disp hp drat wt qsec vs am gear carb index
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
But it does not work in a nested structure:
table <-
tribble(~id, ~data, 1, mtcars) %>%
mutate(data_index = map(data, ~mutate(.x, index = sequence(n()))))
table
# A tibble: 1 x 3
id data data_index
<dbl> <list> <list>
1 1 <data.frame [32 x 11]> <data.frame [32 x 12]>
head(table$data_index[[1]])
mpg cyl disp hp drat wt qsec vs am gear carb index
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1
Any thoughts on how to overcome this?

I modified your code a little bit. The key is n() returns the row number of the entire data frame, not mtcars. So I used nrow(.x) instead.
library(tidyverse)
dt <- tribble(~id, ~data, 1, mtcars)
dt2 <- dt %>%
mutate(data_index = map(data, ~mutate(.x, index = sequence(nrow(.x)))))
head(dt2$data_index[[1]])
# mpg cyl disp hp drat wt qsec vs am gear carb index
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3
# 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4
# 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5
# 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6

Related

How to obtain a random sample in order to their category?

I have a DF called 'listing_df', what I need is to compute the NA data from the variables 'number_of_reviews' and 'review_scores_rating' creating random samples according to the numbers of each group of 'room_type'.
I attach a picture of how the DF looks like:
I tried first of all grouping by 'room_typeI:
test <- listings_df %>% group_by(room_type)
Then, I select the columns where I want to transform the Na data, and create the samples
test$number_of_reviews[is.na(listings_df$number_of_reviews)] <-
sample(listings_df$number_of_reviews, size = sum(is.na(listings_df$number_of_reviews)))
test$review_scores_rating[is.na(listings_df$review_scores_rating)] <-
sample(listings_df$review_scores_rating, size = sum(is.na(listings_df$review_scores_rating)))
I am not sure if it's createn the random data according the room_type, also I would like to know if it's possible to manage this creating a loop.
Thanks!
What you're asking for is called imputation. I'll demonstrate using mtcars as the data, cyl as the grouping variable (your room_type, I suspect), and other columns with NA values.
mt <- mtcars
set.seed(42)
mt$disp[sample(32,10)] <- NA
mt$hp[sample(32,10)] <- NA
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 NA 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 NA 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 NA NA 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 NA NA 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
From here:
dplyr
library(dplyr)
set.seed(42)
mt %>%
group_by(cyl) %>%
mutate(across(c(disp, hp), ~ coalesce(., sample(na.omit(.), size=n(), replace=TRUE)))) %>%
ungroup()
# # A tibble: 32 × 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 91 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 145 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 301 180 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 301 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 52 3.92 3.15 22.9 1 0 4 2
# 10 19.2 6 225 123 3.92 3.44 18.3 1 0 4 4
# # … with 22 more rows
data.table
library(data.table)
cols <- c("disp", "hp")
set.seed(42)
as.data.table(mt)[, c(cols) := lapply(.SD, \(z) fcoalesce(z, sample(na.omit(z), size=.N, replace=TRUE))), .SDcols = cols][] |>
head()
>
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 79.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3: 22.8 4 108.0 113 3.85 2.320 18.61 1 1 4 1
# 4: 21.4 6 460.0 97 3.08 3.215 19.44 1 0 3 1
# 5: 18.7 8 146.7 62 3.15 3.440 17.02 0 0 3 2
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
base R
set.seed(42)
mt[cols] <- lapply(mt[cols], \(z) ave(z, mt$cyl, FUN = \(z) ifelse(is.na(z), sample(na.omit(z), size=length(z), replace=TRUE), z)))
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 91 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 145 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 301 180 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: while I set the random seed for imputation in each dialect, there is no expectation that the order of columns and fixes will be the same between the dialects. For this reason we see that the replacement for the NA values is not the same between the dialects of code; the seed is provided for basic reproducibility, not identical results.

Split variable into new column variable for each factor

Using the R dataset mtcars, I want to make a new binary variable for each level of the "cyl" variable.
For example, the values of cyl are 6, 4, and 8.
I want a new dataset with variables "cyl_4", "cyl_6", and "cyl_8" equal to 1 when each of these numbers occur.
Am looking for solutions that create a new variable for each level of the original variable.
Have:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Want:
mpg cyl disp hp drat wt qsec vs am gear carb cyl_4 cyl_6 cyl_8
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0 1 0
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0 1 0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 0 0
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0 1 0
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0 0 1
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0 1 0
Here's one tidyverse solution: pivot on the cyl column, then replace the values in the 3 resulting columns with 0 where they are NA, otherwise with 1.
library(dplyr)
library(tidyr)
library(tibble)
mtcars %>%
rownames_to_column(var = "model") %>%
pivot_wider(names_from = "cyl",
values_from = "cyl",
names_prefix = "cyl_",
names_sort = TRUE) %>%
mutate(across(starts_with("cyl"), ~ ifelse(is.na(.), 0, 1)))
Result (first 5 rows):
# A tibble: 32 × 14
model mpg disp hp drat wt qsec vs am gear carb cyl_4 cyl_6 cyl_8
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4 0 1 0
2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4 0 1 0
3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1 1 0 0
4 Hornet 4 Drive 21.4 258 110 3.08 3.22 19.4 1 0 3 1 0 1 0
5 Hornet Sportabout 18.7 360 175 3.15 3.44 17.0 0 0 3 2 0 0 1
You could use model.matrix() to create the design matrix for a catogorical variable.
cbind(mtcars, model.matrix(~ cyl - 1, transform(mtcars, cyl = as.factor(cyl))))
# mpg cyl disp hp drat wt qsec vs am gear carb cyl4 cyl6 cyl8
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 0 1 0
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 0 1 0
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1 0 0
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0 1 0
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 0 0 1
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 0 1 0

"Total score" based on multiple "above/below median values"

I am new to R and I have a problem that I simply cant find the solution to. I have tried reading through older posts etc. but I can't figure out how it could/should be done. I hope that some of you might be able to help.
Using the mtcars dataset as an example,
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I wish to assign the cars a value (e.g. 1) if for instance the hp valvue is above the median and 0 if below. This could be called "hp_index".
I would do the same for lets say cyl, disp and drat, and then, I would like to assign the cars an "index_score" where a car for instance would be given the value 1 if a minimum of 3 out of the 4 hp, cyl, disp and drat is above median (that is, if 3 or 4 out of the 4 "hp_index", "cyl_index", "disp_index" and "drat_index" is 1).
Once again, I really hope that some of you might be able to help!
Thanks in advance, and have a nice day!
here's a tidyverse solution:
library(tidyverse)
mtcars %>%
mutate(across(c(hp, cyl, disp, drat), .fns = ~if_else(.x >= median(.x), 1, 0), .names = "{.col}_index"),
index_score = apply(across(c(hp_index, cyl_index, disp_index, drat_index)), 1, sum),
index_score = if_else(index_score >= 3, 1, 0))
Note: I created the conditions in a way that I split with >= median and < median. If you only use > and < there could be edge cases with missings if the row has exactly the median.
First six cases:
mpg cyl disp hp drat wt qsec vs am gear carb hp_index cyl_index disp_index drat_index index_score
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0 1 0 1 0
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0 1 0 1 0
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0 0 0 1 0
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0 1 1 0 0
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1 1 1 0 1
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0 1 1 0 0
Using dplyr :
library(dplyr)
cols <- c('cyl', 'disp', 'drat', 'hp')
mtcars %>%
mutate(across(cols, list(index = ~+(.x > median(.x))))) %>%
mutate(index_score = +(rowSums(select(., ends_with('index'))) >= 3)) -> result
result
Used select(., ends_with('index')) in rowSums assuming there are no other columns that end with index in your actual dataset apart from the newly created ones with across.
In this case based on all the columns, otherwise specific columns can be selected
mtcars$index=rowSums(
sapply(mtcars,function(x){
ifelse(x>median(x),1,0)
})
)
mpg cyl disp hp drat wt qsec vs am gear carb index
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 5
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 4
...
from here you can do another ifelse on the "index" column to decide whether it be 0 or 1.
To use different filters for columns, as you proposed
rowSums(
cbind(
sapply(mtcars[c("cyl", "disp", "drat")],function(x){ifelse(x>median(x),1,0)}),
sapply(mtcars["hp"],function(x){ifelse(x<median(x),1,0)})
)
)

Transform string of expression into quotable expression

How do I transform a string of expression into a quotable expression?
Example:
This is the result I want:
mutate(mtcars,answer=wt+wt)
# mpg cyl disp hp drat wt qsec vs am gear carb answer
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 5.240
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 5.750
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4.640
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 6.430
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6.880
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6.920
...
Here's the function I am writing:
f<-function(df,string_expression){
se<-enexpr(string_expression)
mutate(df,answer=!!se)
}
It will work if I use the following functional call:
f(mtcars,wt+wt)
# mpg cyl disp hp drat wt qsec vs am gear carb answer
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 5.240
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 5.750
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4.640
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 6.430
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6.880
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6.920
...
However, I would like to provide the expression as a string, so I must use the following function call:
f(mtcars,'wt+wt')
# mpg cyl disp hp drat wt qsec vs am gear carb answer
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 wt+wt
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 wt+wt
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 wt+wt
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 wt+wt
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 wt+wt
...
How do I make it (either change the function definition or function call) to get the result I want?
What I have tried:
I have tried to sym(string_expression) -- didn't work.
I have tried to quo(string_expression) -- didn't work.
Thank you!
You could change your f function to something this:
f<-function(df,string_expression){
mutate(df, answer = eval(parse(text = string_expression)))
}
head(f(mtcars,'wt+wt'))
mpg cyl disp hp drat wt qsec vs am gear carb answer
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 5.24
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 5.75
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4.64
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.43
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 6.88
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6.92

use dplyr mutate to create new columns based on a vector of column names

I would like to take the log of some columns, and create new columns that are all named log[original column name].
The code below works, but how can I pass the vector called columnstolog into mutate? Thank you.
library(dplyr)
data(mtcars)
columnstolog <- c('mpg', 'cyl', 'disp', 'hp')
mtcars %>% mutate(logmpg = log(mpg))
mtcars %>% mutate(logcyl = log(cyl))
Use mutate_at, if you can bear with _log being appended to the original column names:
mtcars %>% mutate_at(columnstolog, funs(log = log(.)))
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_log cyl_log disp_log hp_log
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.044522 1.791759 5.075174 4.700480
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.044522 1.791759 5.075174 4.700480
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.126761 1.386294 4.682131 4.532599
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.063391 1.791759 5.552960 4.700480
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.928524 2.079442 5.886104 5.164786
# ...
Ignoring the "use dplyr" part...
require(data.table)
mtcars <- as.data.table(mtcars)
mtcars[, paste0('log', columnstolog) := lapply(.SD, log), .SDcols = columnstolog]
You could also make use of rowwise from dplyr package:
mtcars %>%
rowwise %>%
mutate(logmpg = log(mpg),
logcyl = log(cyl))
# A tibble: 32 x 13
mpg cyl disp hp drat wt qsec vs am gear carb logmpg logcyl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.044522 1.791759
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.044522 1.791759
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.126761 1.386294
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.063391 1.791759
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.928524 2.079442
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.895912 1.791759
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 2.660260 2.079442
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 3.194583 1.386294
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3.126761 1.386294
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 2.954910 1.791759
# ... with 22 more rows

Resources