So with dplyr syntax when I e.g. use mutate I can wait until I like the result and start my data manipulations without an assignment and I can experiment. But it seems with data.table when I perform an operation I overwrite the original data frame and if I change my mind I have to reload the data and start over. When I use the pipe, I run the code often before the next pipe to see if everything is okay...
library(data.table, warn.conflicts = FALSE)
#> Warning: package 'data.table' was built under R version 3.6.1
library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 3.6.1
df <- as.data.table(mtcars)
# dplyr version
mtcars %>%
as_tibble() %>%
mutate(am = 2*am)
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 2 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 2 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 2 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
# here i will still have my original dataframe mtcars.
df[, am := 2*am]
head(df)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 2 4 4
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 2 4 4
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 2 4 1
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
df[cyl ==6, am := 2*am]
head(df)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 4 4 4
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 4 4 4
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 2 4 1
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Created on 2019-07-11 by the reprex package (v0.3.0)
So here if I just want to add a filter with data.table I am going to multiply am with 2 again... is this how data.table works? Is there a way not to overwrite the data frame? Or should I always make a copy when I am afraid of making a mistake?
Related
I have a DF called 'listing_df', what I need is to compute the NA data from the variables 'number_of_reviews' and 'review_scores_rating' creating random samples according to the numbers of each group of 'room_type'.
I attach a picture of how the DF looks like:
I tried first of all grouping by 'room_typeI:
test <- listings_df %>% group_by(room_type)
Then, I select the columns where I want to transform the Na data, and create the samples
test$number_of_reviews[is.na(listings_df$number_of_reviews)] <-
sample(listings_df$number_of_reviews, size = sum(is.na(listings_df$number_of_reviews)))
test$review_scores_rating[is.na(listings_df$review_scores_rating)] <-
sample(listings_df$review_scores_rating, size = sum(is.na(listings_df$review_scores_rating)))
I am not sure if it's createn the random data according the room_type, also I would like to know if it's possible to manage this creating a loop.
Thanks!
What you're asking for is called imputation. I'll demonstrate using mtcars as the data, cyl as the grouping variable (your room_type, I suspect), and other columns with NA values.
mt <- mtcars
set.seed(42)
mt$disp[sample(32,10)] <- NA
mt$hp[sample(32,10)] <- NA
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 NA 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 NA 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 NA NA 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 NA NA 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
From here:
dplyr
library(dplyr)
set.seed(42)
mt %>%
group_by(cyl) %>%
mutate(across(c(disp, hp), ~ coalesce(., sample(na.omit(.), size=n(), replace=TRUE)))) %>%
ungroup()
# # A tibble: 32 × 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 91 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 145 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 301 180 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 301 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 52 3.92 3.15 22.9 1 0 4 2
# 10 19.2 6 225 123 3.92 3.44 18.3 1 0 4 4
# # … with 22 more rows
data.table
library(data.table)
cols <- c("disp", "hp")
set.seed(42)
as.data.table(mt)[, c(cols) := lapply(.SD, \(z) fcoalesce(z, sample(na.omit(z), size=.N, replace=TRUE))), .SDcols = cols][] |>
head()
>
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 79.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3: 22.8 4 108.0 113 3.85 2.320 18.61 1 1 4 1
# 4: 21.4 6 460.0 97 3.08 3.215 19.44 1 0 3 1
# 5: 18.7 8 146.7 62 3.15 3.440 17.02 0 0 3 2
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
base R
set.seed(42)
mt[cols] <- lapply(mt[cols], \(z) ave(z, mt$cyl, FUN = \(z) ifelse(is.na(z), sample(na.omit(z), size=length(z), replace=TRUE), z)))
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 91 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 145 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 301 180 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: while I set the random seed for imputation in each dialect, there is no expectation that the order of columns and fixes will be the same between the dialects. For this reason we see that the replacement for the NA values is not the same between the dialects of code; the seed is provided for basic reproducibility, not identical results.
Using the dataframe mtcars I would like to add the column qsec_control which is calculated as the mean(qsec) of all rows that don't have the same cyl as the current row (e.g. if cyl == 6, it would take mean(qsec[cyl != 6])).
The question feels somewhat dumb, but I cant figure out how to do this.
This solution groups by cyl, then uses dplyr::cur_group_rows() to index into mtcars$qsec:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(qsec_control = mean(
mtcars$qsec[-cur_group_rows()]
)) %>%
ungroup()
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb qsec_cont…¹
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 17.8
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 17.8
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 17.2
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 17.8
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 18.7
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 17.8
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 18.7
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 17.2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 17.2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 17.8
# … with 22 more rows, and abbreviated variable name ¹qsec_control
Replicating zephryl's answer in data.table:
library(data.table)
data(mtcars)
setDT(mtcars)
mtcars[, qsec_control := mtcars[-.I, mean(qsec)] , by = .(cyl)]
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb cyl2 qsec_control
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 6 17.81280
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 6 17.81280
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4 17.17381
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6 17.81280
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 8 18.68611
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6 17.81280
Using the mtcars dataset, I have created a cross table as follows -
tab = with(mtcars, ftable(gear, cyl))
tab
Here is how it looks -
cyl 4 6 8
gear
3 1 2 12
4 8 4 0
5 2 1 2
For this crosstable, I have calculated the row-wise probability
tab_prob = tab %>% prop.table(1) %>% round(4) * 100
tab_prob
cyl 4 6 8
gear
3 6.67 13.33 80.00
4 66.67 33.33 0.00
5 40.00 20.00 40.00
I want to add two columns to the original mtcars dataset
Column 1 cyl_exp - Fill in the expected outcome based on cross-table. For example, in mtcars dataset, if the number of gears is 3, this new column (refer to the tab cross table) should have the value 8, since there is 80% probability that if the number of gears is 3, then cyl should be 8.
Column 2 cyl_prob - Write the probability from table tab_prob in this column based on the value in cyl_exp column.
Here is the expected outcome -
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb cyl_prob cyl_exp
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 66.67 4
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 66.67 4
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 66.67 4
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 80.00 8
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 80.00 8
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 80.00 8
Is there an easy way to accomplish this?
Thanks!
With data.table, I would do it this way:
mtcars <- as.data.table(mtcars, keep.rownames = T)
tab <- mtcars[, .N, by = .(gear, cyl)]
tab[, prob := N/sum(N), by = .(gear)]
tab <- tab[order(-prob, cyl)][!duplicated(gear)]
mtcars[tab, `:=`(cyl_exp = i.cyl, cyl_prob = i.prob), on = .(gear)]
# > head(mtcars)
# rn mpg cyl disp hp drat wt qsec vs am gear carb cyl_exp cyl_prob
# 1: Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 4 0.6666667
# 2: Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 4 0.6666667
# 3: Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4 0.6666667
# 4: Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 8 0.8000000
# 5: Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 8 0.8000000
# 6: Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 8 0.8000000
Here's a way to do this in dplyr :
library(dplyr)
mtcars %>%
count(cyl_exp = cyl, gear, name = 'cyl_prob') %>%
group_by(gear) %>%
mutate(cyl_prob = prop.table(cyl_prob) * 100) %>%
slice(which.max(cyl_prob)) %>%
inner_join(mtcars, by = 'gear')
# cyl_exp gear cyl_prob mpg cyl disp hp drat wt qsec vs am carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 8 3 80 21.4 6 258 110 3.08 3.22 19.4 1 0 1
# 2 8 3 80 18.7 8 360 175 3.15 3.44 17.0 0 0 2
# 3 8 3 80 18.1 6 225 105 2.76 3.46 20.2 1 0 1
# 4 8 3 80 14.3 8 360 245 3.21 3.57 15.8 0 0 4
# 5 8 3 80 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
# 6 8 3 80 17.3 8 276. 180 3.07 3.73 17.6 0 0 3
# 7 8 3 80 15.2 8 276. 180 3.07 3.78 18 0 0 3
# 8 8 3 80 10.4 8 472 205 2.93 5.25 18.0 0 0 4
# 9 8 3 80 10.4 8 460 215 3 5.42 17.8 0 0 4
#10 8 3 80 14.7 8 440 230 3.23 5.34 17.4 0 0 4
# … with 22 more rows
I have kept the data in long format so that it is easier to join. The first part of the answer is used to create cross table.
mtcars %>%
count(cyl_exp = cyl, gear, name = 'cyl_prob') %>%
group_by(gear) %>%
mutate(cyl_prob = prop.table(cyl_prob) * 100)
# cyl_exp gear cyl_prob
# <dbl> <dbl> <dbl>
#1 4 3 6.67
#2 4 4 66.7
#3 4 5 40
#4 6 3 13.3
#5 6 4 33.3
#6 6 5 20
#7 8 3 80
#8 8 5 40
From here we keep only the row with highest probability for each gear and join the data.
I used regular table and prop.table instead of ftable. I propose the following solution :
df <- mtcars
tab=table(mtcars$gear,mtcars$cyl)
tab_prob = round(prop.table(tab,margin=1)*100,2)
exp_cyl <- function(x){
return(as.numeric(names(which.max(tab[toString(x),]))))
}
prob_cyl <- function(x){
return(round(max(tab_prob[toString(x),]),2))
}
df <- mtcars
df %>% mutate(cyl_prob=sapply(gear,prob_cyl),cyl_exp=sapply(gear,exp_cyl))
Output :
mpg cyl disp hp drat wt qsec vs am gear carb cyl_prob cyl_exp
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 66.67 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 66.67 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 66.67 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 80.00 8
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 80.00 8
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 80.00 8
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 80.00 8
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 66.67 4
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 66.67 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 66.67 4
This question already has answers here:
Pass a string as variable name in dplyr::filter
(4 answers)
Closed 2 years ago.
I am trying to create a duplicate column using tidyeval. In each loop the name of the column to duplicate varies and is obtained using a regular expression. For example,
library(tidyverse)
a <- str_subset(string = names(mtcars), pattern = "^a")
a
# am
to get the column to be duplicated.
Then I have no idea how to use the string here to duplicate the column (to a new column a2). Tried various combinations from the code below, but struggling to get my head around tidy evaluations.
# a <- enquo(a)
mtcars %>%
as_tibble() %>%
mutate(a2 := {{a}})
# mutate(a2 := !!a)
# mutate(a2 := vars(!!!a))
# # A tibble: 32 x 12
# mpg cyl disp hp drat wt qsec vs am gear carb am2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 am
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 am
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 am
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 am
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 am
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 am
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 am
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 am
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 am
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 am
(I am looking for am2 to be a copy of am here, so 1 and 0 in each row, not "am")
If only one column is selected, e.g. am
a <- "am"
mtcars %>%
mutate("{a}2" := !!sym(a))
# mpg cyl disp hp drat wt qsec vs am gear carb am2
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 0
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 0
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 0
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 0
If there are more than one columns selected, e.g. mpg and cyl, you can use the .names argument in across().
a <- c("mpg", "cyl")
mtcars %>%
mutate(across(all_of(a), ~ ., .names = "{col}2"))
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2 cyl2
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0 6
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.0 6
# 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8 4
# 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.4 6
# 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.7 8
# 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.1 6
# 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 14.3 8
# 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 24.4 4
I would like to take the log of some columns, and create new columns that are all named log[original column name].
The code below works, but how can I pass the vector called columnstolog into mutate? Thank you.
library(dplyr)
data(mtcars)
columnstolog <- c('mpg', 'cyl', 'disp', 'hp')
mtcars %>% mutate(logmpg = log(mpg))
mtcars %>% mutate(logcyl = log(cyl))
Use mutate_at, if you can bear with _log being appended to the original column names:
mtcars %>% mutate_at(columnstolog, funs(log = log(.)))
# mpg cyl disp hp drat wt qsec vs am gear carb mpg_log cyl_log disp_log hp_log
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.044522 1.791759 5.075174 4.700480
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.044522 1.791759 5.075174 4.700480
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.126761 1.386294 4.682131 4.532599
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.063391 1.791759 5.552960 4.700480
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.928524 2.079442 5.886104 5.164786
# ...
Ignoring the "use dplyr" part...
require(data.table)
mtcars <- as.data.table(mtcars)
mtcars[, paste0('log', columnstolog) := lapply(.SD, log), .SDcols = columnstolog]
You could also make use of rowwise from dplyr package:
mtcars %>%
rowwise %>%
mutate(logmpg = log(mpg),
logcyl = log(cyl))
# A tibble: 32 x 13
mpg cyl disp hp drat wt qsec vs am gear carb logmpg logcyl
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.044522 1.791759
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.044522 1.791759
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.126761 1.386294
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.063391 1.791759
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 2.928524 2.079442
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.895912 1.791759
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 2.660260 2.079442
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 3.194583 1.386294
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3.126761 1.386294
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 2.954910 1.791759
# ... with 22 more rows