Best practice to get a dropped column in dplyr tbl_df - r

I remember a comment on r-help in 2001 saying that drop = TRUE in [.data.frame was the worst design decision in R history.
dplyr corrects that and does not drop implicitly. When trying to convert old code to dplyr style, this introduces some nasty bugs when d[, 1] or d[1] is assumed a vector.
My current workaround uses unlist as shown below to obtain a 1-column vector. Any better ideas?
library(dplyr)
d2 = data.frame(x = 1:5, y = (1:5) ^ 2)
str(d2[,1]) # implicit drop = TRUE
# int [1:5] 1 2 3 4 5
str(d2[,1, drop = FALSE])
# data.frame': 5 obs. of 1 variable:
# $ x: int 1 2 3 4 5
# With dplyr functions
d1 = data_frame(x = 1:5, y = x ^ 2)
str(d1[,1])
# Classes ‘tbl_df’ and 'data.frame': 5 obs. of 1 variable:
# $ x: int 1 2 3 4 5
str(unlist(d1[,1]))
# This ugly construct gives the same as str(d2[,1])
str(d1[,1][[1]])

You can just use the [[ extract function instead of [.
d1[[1]]
## [1] 1 2 3 4 5
If you use a lot of piping with dplyr, you may also want to use the convenience functions extract and extract2 from the magrittr package:
d1 %>% magrittr::extract(1) %>% str
## Classes ‘tbl_df’ and 'data.frame': 5 obs. of 1 variable:
## $ x: int 1 2 3 4 5
d1 %>% magrittr::extract2(1) %>% str
## int [1:5] 1 2 3 4 5
Or if extract is too verbose for you, you can just use [ directly in the pipe:
d1 %>% `[`(1) %>% str
## Classes ‘tbl_df’ and 'data.frame': 5 obs. of 1 variable:
## $ x: int 1 2 3 4 5
d1 %>% `[[`(1) %>% str
## int [1:5] 1 2 3 4 5

Related

recode/replace multiple values in a shared data column to a single value across data frames

I hope I haven't missed it, but I haven't been able to find a working solution to this problem.
I have a set of data frames with a shared column. These columns contain multiple and varying transcription errors, some of which are shared, others not, for multiple values.
I would like replace/recode the transcription errors (bad_values) with the correct values (good_values) across all data frames.
I have tried nesting the map*() family of functions across lists of data frames, bad_values, and good_values to do this, among other things. Here is an example:
df1 = data.frame(grp = c("a1","a.","a.",rep("b",7)), measure = rnorm(10))
df2 = data.frame(grp = c(rep("as", 3), "b2",rep("a",22)), measure = rnorm(26))
df3 = data.frame(grp = c(rep("b-",3),rep("bq",2),"a", rep("a.", 3)), measure = 1:9)
df_list = list(df1, df2, df3)
bad_values = list(c("a1","a.","as"), c("b2","b-","bq"))
good_values = list("a", "b")
dfs = map(df_list, function(x) {
x %>% mutate(grp = plyr::mapvalues(grp, bad_values, rep(good_values,length(bad_values))))
})
Which I didn't necessarily expect to work beyond a single good-bad value pair. However, I thought nesting another call to map*() within this might work:
dfs = map(df_list, function(x) {
x %>% mutate(grp = map2(bad_values, good_values, function(x,y) {
recode(grp, bad_values = good_values)})
})
I have tried a number of other approaches, none of which have worked.
Ultimately, I would like to go from a set of data frames with errors, as here:
[[1]]
grp measure
1 a1 0.5582253
2 a. 0.3400904
3 a. -0.2200824
4 b -0.7287385
5 b -0.2128275
6 b 1.9030766
[[2]]
grp measure
1 as 1.6148772
2 as 0.1090853
3 as -1.3714180
4 b2 -0.1606979
5 a 1.1726395
6 a -0.3201150
[[3]]
grp measure
1 b- 1
2 b- 2
3 b- 3
4 bq 4
5 bq 5
6 a 6
To a list of 'fixed' data frames, as such:
[[1]]
grp measure
1 a -0.7671052
2 a 0.1781247
3 a -0.7565773
4 b -0.3606900
5 b 1.9264804
6 b 0.9506608
[[2]]
grp measure
1 a 1.45036125
2 a -2.16715639
3 a 0.80105611
4 b 0.24216723
5 a 1.33089426
6 a -0.08388404
[[3]]
grp measure
1 b 1
2 b 2
3 b 3
4 b 4
5 b 5
6 a 6
Any help would be very much appreciated
Here is an option using tidyverse with recode_factor. When there are multiple elements to be changed, create a list of key/val elements and use recode_factor to match and change the values to new levels
library(tidyverse)
keyval <- setNames(rep(good_values, lengths(bad_values)), unlist(bad_values))
out <- map(df_list, ~ .x %>%
mutate(grp = recode_factor(grp, !!! keyval)))
-output
out
#[[1]]
# grp measure
#1 a -1.63295876
#2 a 0.03859976
#3 a -0.46541610
#4 b -0.72356671
#5 b -1.11552841
#6 b 0.99352861
#....
#[[2]]
# grp measure
#1 a 1.26536789
#2 a -0.48189740
#3 a 0.23041056
#4 b -1.01324689
#5 a -1.41586086
#6 a 0.59026463
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#6 a 6
#....
NOTE: This doesn't change the class of the initial dataset column
str(out)
#List of 3
# $ :'data.frame': 10 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 2 2 2 2 2 2
# ..$ measure: num [1:10] -1.633 0.0386 -0.4654 -0.7236 -1.1155 ...
# $ :'data.frame': 26 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 1 1 1 1 1 1 ...
# ..$ measure: num [1:26] 1.265 -0.482 0.23 -1.013 -1.416 ...
# $ :'data.frame': 9 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 2 2 2 2 2 1 1 1 1
# ..$ measure: int [1:9] 1 2 3 4 5 6 7 8 9
Once we have a keyval pair list, this can be also used in base R functions
out1 <- lapply(df_list, transform, grp = unlist(keyval[grp]))
Any reason mapping a case_when statement wouldn't work?
library(tidyverse)
df_list %>%
map(~ mutate_if(.x, is.factor, as.character)) %>% # convert factor to character
map(~ mutate(.x, grp = case_when(grp %in% bad_values[[1]] ~ good_values[[1]],
grp %in% bad_values[[2]] ~ good_values[[2]],
TRUE ~ grp)))
I could see it working for your reprex but possibly not the greater problem.
A base R option if you have lot of good_values and bad_values and it is not possible to check each one individually.
lapply(df_list, function(x) {
vec = x[['grp']]
mapply(function(p, q) vec[vec %in% p] <<- q ,bad_values, good_values)
transform(x, grp = vec)
})
#[[1]]
# grp measure
#1 a -0.648146527
#2 a -0.004722549
#3 a -0.943451194
#4 b -0.709509396
#5 b -0.719434286
#....
#[[2]]
# grp measure
#1 a 1.03131291
#2 a -0.85558910
#3 a -0.05933911
#4 b 0.67812934
#5 a 3.23854093
#6 a 1.31688645
#7 a 1.87464048
#8 a 0.90100179
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#....
Here, for every list element we extract it's grp column and replace bad_values with corresponding good_values if they are found and return the corrected dataframe.

Changing Data types of dataframe columns based on template with matching columns in R

I have 2 dataframes.
Template - I will be using data types from this data frame.
df - I want to change datatypes of this data frame based on template.
I want to change data types of second dataframe based on first. Lets suppose I have below data frame which I am using as template.
> template
id <- c(1,2,3,4)
a <- c(1,4,5,6)
b <- as.character(c(0,1,1,4))
c <- as.character(c(0,1,1,0))
d <- c(0,1,1,0)
template <- data.frame(id,a,b,c,d, stringsAsFactors = FALSE)
> str(template)
'data.frame': 4 obs. of 5 variables:
$ id: num 1 2 3 4
$ a : num 1 4 5 6
$ b : chr "0" "1" "1" "4"
$ c : chr "0" "1" "1" "0"
$ d : num 0 1 1 0
I am looking for below things.
To cast data types of template to be exactly same in df.
It should have same columns which are there in template frame.
**Note- It should add additional columns with all NA's if not available in df.
> df
id <- c(6,7,12,14,1,3,4,4)
a <- c(0,1,13,1,3,4,5,6)
b <- c(1,4,12,3,4,5,6,7)
c <- c(0,0,13,3,4,45,6,7)
e <- c(0,0,13,3,4,45,6,7)
df <- data.frame(id,a,b,c,e)
> str(df)
'data.frame': 8 obs. of 5 variables:
$ id: num 6 7 12 14 1 3 4 4
$ a : num 0 1 13 1 3 4 5 6
$ b : num 1 4 12 3 4 5 6 7
$ c : num 0 0 13 3 4 45 6 7
$ e : num 0 0 13 3 4 45 6 7
Desired output-
> output
id a b c d
1 6 0 1 0 NA
2 7 1 4 0 NA
3 12 13 12 13 NA
4 14 1 3 3 NA
5 1 3 4 4 NA
6 3 4 5 45 NA
7 4 5 6 6 NA
8 4 6 7 7 NA
> str(output)
'data.frame': 8 obs. of 5 variables:
$ id: num 6 7 12 14 1 3 4 4
$ a : num 0 1 13 1 3 4 5 6
$ b : chr "1" "4" "12" "3" ...
$ c : chr "0" "0" "13" "3" ...
$ d : logi NA NA NA NA NA NA ...
My attempts-
template <- fread("template.csv"),header=TRUE,stringsAsFactors = FALSE)
n <- names(template)
template[,(n) := lapply(.SD,function(x) gsub("[^A-Za-z0-90 _/.-]","", as.character(x)))]
n <- names(df)
df[,(n) := lapply(.SD,function(x) gsub("[^A-Za-z0-90 _/.-]","", as.character(x)))]
output <- rbindlist(list(template,df),use.names = TRUE,fill = TRUE,idcol="template")
After this I write the output data frame and then reread using write.csv to get data types. But, I am messing up with data types. Please suggest any appropriate way to deal with it.
I'd do
res = data.frame(
lapply(setNames(,names(template)), function(x)
if (x %in% names(df)) as(df[[x]], class(template[[x]]))
else template[[x]][NA_integer_]
), stringsAsFactors = FALSE)
or with magrittr
library(magrittr)
setNames(, names(template)) %>%
lapply(. %>% {
if (. %in% names(df)) as(df[[.]], class(template[[.]]))
else template[[.]][NA_integer_]
}) %>% data.frame(stringsAsFactors = FALSE)
verifying ...
'data.frame': 8 obs. of 5 variables:
$ id: num 6 7 12 14 1 3 4 4
$ a : num 0 1 13 1 3 4 5 6
$ b : chr "1" "4" "12" "3" ...
$ c : chr "0" "0" "13" "3" ...
$ d : num NA NA NA NA NA NA NA NA
I'd suggest looking at the vetr package if you're going to be doing a lot of stuff like this. It has a good approach to templates for data frames and their columns.
Here's some code that does what you want.
require(tidyverse)
new_types <-
map_df(template, class) %>%
t %>%
as.data.frame(stringsAsFactors = F) %>%
rownames_to_column %>%
setNames(c('col', 'type'))
new_data <- df %>%
gather(col, value) %>%
right_join(new_types, by='col') %>%
group_by(col) %>%
mutate(rownum = row_number()) %>%
ungroup %>%
complete(col, rownum=1:max(rownum)) %>%
group_by(col) %>%
summarize(val = list(value), type=first(type)) %>%
mutate(new_val = map2(val, type, ~as(.x, .y, strict = T))) %>%
select(col, new_val) %>%
spread(col, new_val) %>%
unnest
The main idea here is to use map2() from the purrr package to apply the as() function from base R. This function takes in an object (e.g. a vector or column from a dataframe) and a character string that describes a new type, and returns the coerced object. This is the core capability that you need.
My new_types dataframe just lists the column names of the template and the (character-string) named of their type in a data frame.
Except for the map2() line, everything else is mucky data wrangling that could probably be improved.
Some key features:
right_join here is essential to keep only the columns you want.
the lines from mutate(rownum = row_number()) to complete(col, rownum=1:max(rownum)) are necessary only when the target df has columns that aren't in template -- they make sure that the resulting number of NAs is the same as for the other columns.

rename a matrix column which as no initial names with dplyr

I'm trying to rename the columns of a matrix that has no names in dplyr :
set.seed(1234)
v1 <- table(round(runif(50,0,10)))
v2 <- table(round(runif(50,0,10)))
library(dplyr)
bind_rows(v1,v2) %>%
t
[,1] [,2]
0 3 4
1 1 9
2 8 6
3 11 7
5 7 8
6 7 1
7 3 4
8 6 3
9 3 6
10 1 NA
4 NA 2
I usually use rename for that with the form rename(new_name=old_name) however because there is no old_name it doesn't work. I've tried:
rename("v1","v2")
rename(c("v1","v2")
rename(v1=1, v2=2)
rename(v1=[,1],v2=[,v2])
rename(v1="[,1]",v2="[,v2]")
rename_(.dots = c("v1","v2"))
setNames(c("v1","v2"))
none of these works.
I know the base R way to do it (colnames(obj) <- c("v1","v2")) but I'm specifically looking for a dplyrway to do it.
This one with magrittr:
library(dplyr)
bind_rows(v1,v2) %>%
t %>%
magrittr::set_colnames(c("new1", "new2"))
In order to use rename you need to have some sort of a list (like a data frame or a tibble). So you can do two things. You either convert to tibble and use rename or use colnames and leave the structure as is, i.e.
new_d <- bind_rows(v1,v2) %>%
t() %>%
as.tibble() %>%
rename('A' = 'V1', 'B' = 'V2')
#where
str(new_d)
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 11 obs. of 2 variables:
# $ A: int 3 1 8 11 7 7 3 6 3 1 ...
# $ B: int 4 9 6 7 8 1 4 3 6 NA ...
Or
new_d1 <- bind_rows(v1,v2) %>%
t() %>%
`colnames<-`(c('A', 'B'))
#where
str(new_d1)
# int [1:11, 1:2] 3 1 8 11 7 7 3 6 3 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:11] "0" "1" "2" "3" ...
# ..$ : chr [1:2] "A" "B"

Function in R that creates dummy variables if a condition is met

I am looking to create a function that will convert any factor variable with more than 4 levels into a dummy variable. The dataset has ~2311 columns, so I would really need to create a function. Your help would be immensely appreciated.
I have compiled the code below and was hoping to get it to work.
library(dummies)
# example function
for(i in names(Final_Dataset)){
if(count (Final_Dataset[i])>4){
y <- Final_Dataset[i]
Final_Dataset <- cbind(Final_Dataset, dummy(y, sep = "_"))
}
}
I was also considering an alternative approach where I would get all the number of columns that need to be dummied and then loop through all the columns and if the column number is in that array then create dummy variables out of the variable.
Example data
fct = data.frame(a = as.factor(letters[1:10]), b = 1:10, c = as.factor(sample(letters[1:4], 10, replace = T)), d = as.factor(letters[10:19]))
str(fct)
'data.frame': 10 obs. of 4 variables:
$ a: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
$ b: int 1 2 3 4 5 6 7 8 9 10
$ c: Factor w/ 4 levels "a","b","c","d": 2 4 1 3 1 1 2 3 1 2
$ d: Factor w/ 10 levels "j","k","l","m",..: 1 2 3 4 5 6 7 8 9 10
# keep columns with more than 4 factors
fact_cols = sapply(fct, function(x) is.factor(x) && length(levels(x)) > 4)
# create dummy variables for subset (omit intercept)
dummy_cols = model.matrix(~. -1, fct[, fact_cols])
# cbind new data
out_df = cbind(fct[, !fact_cols], dummy_cols)
You could get all the columns with more than a given number of levels (n = 4) with something like
which(sapply(Final_Dataset, function (c) length(levels(c)) > n))

Replicate same data.frame n times and save them into a list

I just need to replicate my data.frame n times (e.g. 100) and save all the outputs into a list.
It should be quite easy and straightforward but I could not find any solution yet.
Fake data.frame:
df = read.table(text = 'a b
1 2
5 6
4 4
11 78
23 99', header = TRUE)
With lapply:
df_list <- lapply(1:100, function(x) df)
We can use replicate
n <- 100
lst <- replicate(n, df, simplify = FALSE)
You can use rep if you wrap it in list, as rep tries to return the same type of object you pass it:
df_list <- rep(list(df), 100)
str(df_list[1:2])
#> List of 2
#> $ :'data.frame': 5 obs. of 2 variables:
#> ..$ a: int [1:5] 1 5 4 11 23
#> ..$ b: int [1:5] 2 6 4 78 99
#> $ :'data.frame': 5 obs. of 2 variables:
#> ..$ a: int [1:5] 1 5 4 11 23
#> ..$ b: int [1:5] 2 6 4 78 99

Resources