Assigning same random value to all of one variable?

Assigning same random value to all of one variable? - r

Using mtcars as an example, I am trying to create a new column and assign all of the same values of cyl to a same random value.
I tried:
mtcars$cyl <- as.factor (mtcars$cyl)
mtcars %>%
group_by(cyl) %>%
mutate (rand = sample( c("A", "B"), replace = T)
However, the length seems to be wrong, and I'm not sure if it will just assign a random A or B to each row instead of the same random A or B to the same factor of cyl. Any insight, should I be creating a for loop for each unique (cyl)?

You need to specify size as 1 in sample to get the same value of cyl the same random value.
library(dplyr)
set.seed(567)
mtcars %>% group_by(cyl) %>% mutate(rand = sample(c("A", "B"), 1))
# mpg cyl disp hp drat wt qsec vs am gear carb rand
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 B
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 B
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 A
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 B
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 A
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 B
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 A
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 A
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 A
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 B
# … with 22 more rows

Given your precision, I think an easy solution is to use a merge. You first generate a dataframe associating cyls with a random value then merge using cyl
dfrand <- data.frame(
rand = sample(c("A","B"), size = length(unique(df$cyl)), replace = TRUE),
cyl = unique(df$cyl), stringsAsFactors = FALSE
)
dfrand
rand cyl
1 B 6
2 A 4
3 B 8
And then you merge. You can use base R
merge(df, dfrand, by = "cyl")
or dplyr:
dplyr::left_join(
df, dfrand, by = 'cyl'
)
The result should look like the following (I take 5 random rows of the generated dataframe)
merge(df, dfrand, by = "cyl")[sample(1:nrow(df), size = 5)]
cyl mpg disp hp drat wt qsec vs am gear carb rand
1: 8 13.3 350.0 245 3.73 3.84 15.41 0 0 3 4 B
2: 4 24.4 146.7 62 3.69 3.19 20.00 1 0 4 2 A
3: 8 17.3 275.8 180 3.07 3.73 17.60 0 0 3 3 B
4: 4 32.4 78.7 66 4.08 2.20 19.47 1 1 4 1 A
5: 4 22.8 108.0 93 3.85 2.32 18.61 1 1 4 1 A

We can use data.table
library(data.table)
as.data.table(mtcars)[, rand := sample(c("A", "B", 1), cyl]

Related

How to filter inside only certain groups (that satisfy a particular condition) in grouped tibble, using dplyr?

Using mtcars dataset, as an example. I would like to:
group table based on the number of cylinders
within each group test whether any car has miles per gallon higher than 25 ( mpg > 25)
for only those groups that have at least one car with mpg > 25, I would like to remove the cars that have mpg < 20
The expected output is cars that belong to a cylinder group with at least one other car having mpg > 25, and that themselves have mpg < 20 are removed from dataset
PS: I can think of several ways to address this problem, but I wanted to see if someone can come up with straightforward and elegant solution, e.g.
xx <- split (mtcars, f = mtcars$cyl)
for (i in seq_along (xx)){
if (any (xx[[i]]$mpg) > 25) xx[[i]] <- filter (xx[[i]] > 20)
}
xx <- bind_rows (xx)

Maybe this ?
library(dplyr)
mtcars %>%
group_by(cyl) %>%
filter(if(any(mpg > 25)) mpg > 20 else TRUE) %>%
ungroup
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
From the groups which has at least one value greater than 25 in mpg, we keep only the rows that has values greater than 20. If a group has no value greater than 25 keep all the rows of those groups.

We can use
library(dplyr)
mtcars %>%
group_by(cyl) %>%
filter(any(mpg > 25) & mpg > 20) %>%
ungroup

Mutate a dynamic column name with conditions using other dynamic column names

I'm trying to use dplyr::mutate to change a dynamic column with conditions using other columns dynamically.
I've got this bit of code:
d <- mtcars %>% tibble
fld_name <- "mpg"
other_fld_name <- "cyl"
d <- d %>% mutate(!!fld_name := ifelse(!!other_fld_name < 5,NA,!!fld_name))
which sets mpg to
mpg
<chr>
1 mpg
2 mpg
3 mpg
4 mpg
5 mpg
6 mpg
7 mpg
8 mpg
9 mpg
10 mpg
it seems to select the field on the LHS of assignment operator, but just pastes the field name on the RHS.
Removing the unquotes on the RHS yields the same result.
Any help is much appreciated.

use get to retreive column value instead
library(tidyverse)
d <- mtcars %>% tibble
fld_name <- "mpg"
other_fld_name <- "cyl"
d %>% mutate(!!fld_name := ifelse(get(other_fld_name) < 5 ,NA, get(fld_name)))
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 NA 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 NA 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 NA 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
Created on 2021-06-22 by the reprex package (v2.0.0)

We can also use ensym function to quote variable name stored as string and unquote it with !! like the following:
library(rlang)
d <- mtcars %>% tibble
fld_name <- "mpg"
other_fld_name <- "cyl"
d %>%
mutate(!!ensym(fld_name) := ifelse(!!ensym(other_fld_name) < 5, NA, !!ensym(fld_name)))
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 NA 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 NA 4 147. 62 3.69 3.19 20 1 0 4 2
9 NA 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows

We could also use .data
library(dplyr)
d %>%
mutate(!! fld_name := case_when(.data[[other_fld_name]] >=5 ~
.data[[fld_name]]))
-output
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 NA 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 NA 4 147. 62 3.69 3.19 20 1 0 4 2
9 NA 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
data
d <- mtcars %>%
as_tibble
fld_name <- "mpg"
other_fld_name <- "cyl"

How to print tibble without row.names / row numbers

Tibbles print with row numbers as row names. See 1, 2 in the left margin below:
tibble::as_tibble(mtcars)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
Can I suppress those numbers from printing, in an argument to tibble:::print.tbl() or otherwise? I know I can use the row.names = FALSE argument in print.data.frame: print.data.frame(as_tibble(mtcars), row.names = FALSE) but then I don't get all the other nice printing options of it being a tibble, it just prints like a regular data.frame.
I'd like to keep the output the same as in print.tbl() - like what's above, here - but without row numbers.

The row names are applied by pillar::squeeze. You could create a copy of the squeeze function (source here) with the relevant section commented out:
squeeze <- function(x, width = NULL, ...) {
zero_height <- length(x) == 0L || length(x[[1]]) == 0L
if (zero_height) {
return(new_colonnade_sqeezed(list(), colonnade = x, extra_cols = seq_along(x)))}
if (is.null(width)) {width <- pillar:::get_width(x)}
if (is.null(width)) {width <- getOption("width")}
rowid <- pillar:::get_rowid_from_colonnade(x)
if (is.null(rowid)) {
rowid_width <- 0 }
else { rowid_width <- max(pillar:::get_widths(rowid)) + 1L }
col_widths <- pillar:::colonnade_get_width(x, width, rowid_width)
col_widths_show <- split(col_widths, factor(col_widths$tier != 0, levels = c(FALSE, TRUE)))
col_widths_shown <- col_widths_show[["TRUE"]]
col_widths_tiers <- split(col_widths_shown, col_widths_shown$tier)
out <- map(col_widths_tiers, function(tier) {
map2(tier$pillar, tier$width, pillar:::pillar_format_parts)
})
#if (!is.null(rowid)) {
# rowid_formatted <- pillar:::pillar_format_parts(rowid, rowid_width - 1L)
# out <- map(out, function(x) c(list(rowid_formatted), x))
#}
extra_cols <- rlang:::seq2(nrow(col_widths_shown) + 1L, length(x))
pillar:::new_colonnade_sqeezed(out, colonnade = x, extra_cols = extra_cols)
}
Then assign the new version into the pillar namespace, and you are set.
library(tidyverse)
assignInNamespace("squeeze", squeeze, ns = "pillar")
as_tibble(mtcars)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows

You could capture the printed output and replace the row numbers with white space, and then hijack tibble:::print.tbl
tibble::as_tibble(mtcars)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
print.tbl <- function(x) {
## tibble:::print.tbl
o <- capture.output(tibble::as_tibble(x))
m <- gregexpr('^ *\\d+', o)
regmatches(o, m) <- ' '
cli::cat_line(o)
invisible(x)
}
tibble::as_tibble(mtcars)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
edit
Here is another attempt with a few more features. You can pass n, width, or n_extra to tibble:::print.tbl as usual:
x <- tibble::as_tibble(mtcars)
print(x, n = 3, width = 60, n_extra = 1)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1
2 21 6 160 110 3.9 2.88 17.0 0 1
3 22.8 4 108 93 3.85 2.32 18.6 1 1
# … with 29 more rows, and 2 more variables: gear <dbl>, …
And you can also exclude the row names, dimensions, and variable class or any combination:
print(x, row.names = FALSE, dims = FALSE, classes = FALSE, n = 3)
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# … with 29 more rows
Here is the function although to get these extra features you must use print(x) where x is of class tbl instead of just x.
print.tbl <- function(x, ..., row.names = TRUE, classes = TRUE, dims = TRUE) {
## tibble:::print.tbl
o <- capture.output(tibble:::print.tbl(tibble::as_tibble(x), ...))
if (!row.names) {
m <- gregexpr('^ *\\d+', o)
regmatches(o, m) <- ' '
}
if (!classes)
o <- o[!grepl('^ +<...>', o)]
if (!dims)
o <- o[!grepl('^# A tibble.*', o)]
cli::cat_line(o)
invisible(x)
}

How to tell readr::read_csv to guess double column correctly

I have runoff data with a lot of zero values and occasionally some non-zero double values.
'readr::read_csv' guesses integer column type because of the many zeros.
How can I make read_csv to guess the correct double column type?
I do not know the mapping of the variable names beforehand, hence I cannot give name-type mapping.
Here is a small example
# create a column of doubles with many zeros (runoff data)
#dsTmp <- data.frame(x = c(rep(0.0, 2), 0.5)) # this works
dsTmp <- data.frame(x = c(rep(0.0, 1e5), 0.5))
write_csv(dsTmp, "tmp/dsTmp.csv")
# 0.0 is written as 0
# read_csv now guesses integer instead of double and reports
# a parsing failure.
ans <- read_csv("tmp/dsTmp.csv")
# the last value is NA instead of 0.5
tail(ans)
Can I tell it to choose a try wider column types instead of issuing a parsing failure?
Issue 645 mentions this problem, but the workaround given there is on the writing side. I have little influence on the writing side.

Here's two techniques. (Data prep at the bottom. $hp and $vs and beyond are the integer columns.)
NB: I add cols(.default=col_guess()) to most of the first-time calls so that we don't get the large message of what read_csv found the columns to be. It can be omitted at the cost of a more noisy console.
Force all columns to be double, with the cols(.default=...) setting, works safely as long as you know there are no non-numbers in the file:
read_csv("mtcars.csv", col_types = cols(.default = col_double()))
# Warning in rbind(names(probs), probs_f) :
# number of columns of result is not a multiple of vector length (arg 1)
# Warning: 32 parsing failures.
### ...snip...
# See problems(...) for more details.
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 NA 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 NA 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 NA 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 NA 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 NA 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 NA 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 NA 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 NA 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 NA 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 NA 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
Change only <int> (col_integer()) columns, taking a little more care. My use of n_max=50 needs to be balanced. Similar to guess_max=, a little more is better. In this case, if I chose n_max=1 then the top couple of mpg values would suggest integers, which is fine. But if you have other fields that are ambiguous with other classes, you'll need more. Since you're talking about not wanting to read in the whole file but are willing to read in "a bit" to get the right guess, I'd think you can go with a reasonable value (100s? 1000s?) here to be robust for chr and lgl.
types <- attr(read_csv("mtcars.csv", n_max=1, col_types = cols(.default = col_guess())), "spec")
(intcols <- sapply(types$cols, identical, col_integer()))
# mpg cyl disp hp drat wt qsec vs am gear carb
# TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
types$cols[intcols] <- replicate(sum(intcols), col_double())
and the final read, noting that $hp and beyond are now <dbl> (unlike in the data prep read, below).
read_csv("mtcars.csv", col_types = types)
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
Data:
library(readr)
mt <- mtcars
mt$cyl <- paste0("c", mt$cyl) # for fun
write_csv(mt, path = "mtcars.csv")
read_csv("mtcars.csv", col_types = cols(.default = col_guess()))
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
# 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows

I transferred the code of the solution of r2evans to a small function:
read_csvDouble <- function(
### read_csv but read guessed integer columns as double
... ##<< further arguments to \code{\link{read_csv}}
, n_max = Inf ##<< see \code{\link{read_csv}}
, col_types = cols(.default = col_guess()) ##<< see \code{\link{read_csv}}
## the default suppresses the type guessing messages
){
##details<< Sometimes, double columns are guessed as integer, e.g. with
## runoff data where there are many zeros, an only occasionally
## positive values that can be recognized as double.
## This functions modifies \code{read_csv} by changing guessed integer
## columns to double columns.
#https://stackoverflow.com/questions/52934467/how-to-tell-readrread-csv-to-guess-double-column-correctly
colTypes <- read_csv(..., n_max = 3, col_types = col_types) %>% attr("spec")
isIntCol <- map_lgl(colTypes$cols, identical, col_integer())
colTypes$cols[isIntCol] <- replicate(sum(isIntCol), col_double())
##value<< tibble as returned by \code{\link{read_csv}}
ans <- read_csv(..., n_max = n_max, col_types = colTypes)
ans
}

data.table::fread seems to work fine for this.
write_csv(dsTmp, ttfile <- tempfile())
ans <- fread(ttfile)
tail(ans)
# x
# 1: 0.0
# 2: 0.0
# 3: 0.0
# 4: 0.0
# 5: 0.0
# 6: 0.5
From the ?fread help page
Rarely, the file may contain data of a higher type in rows outside the
sample (referred to as an out-of-sample type exception). In this event
fread will automatically reread just those columns from the beginning
so that you don't have the inconvenience of having to set colClasses
yourself;

use assign() inside purrr:walk()

I have a number of dataframes and a series of changes I want to make to each of them. For this example, let to the desired change be simply making each data frame a tibble using as_tibble. I know there are various ways of doing this, but I'd like to do this using purrr:walk.
For data frames df1 and df2,
df1 <- mtcars
df2 <- mtcars
I'd like to do the equivalent of
df1 %<>% as_tibble
df2 %<>% as_tibble
using walk. My attempt:
library(tidyverse)
walk(c(df1, df2), ~ assign(deparse(substitute(.)), as_tibble(.)))
This runs but does not make the desired change:
is_tibble(df1)
#> [1] FALSE

Here is how you can combine assign with walk (see the comments the code for more explanation)-
library(tidyverse)
# data
df1 <- mtcars
df2 <- mtcars
# creating tibbles
# this creates a list of objects with names ("df1", "df2")
tibble::lst(df1, df2) %>%
purrr::walk2(
.x = names(.), # names to assign
.y = ., # object to be assigned
.f = ~ assign(x = .x,
value = tibble::as.tibble(.y),
envir = .GlobalEnv)
)
# checking the newly created tibbles
df1
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
df2
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
Created on 2018-11-13 by the reprex package (v0.2.1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Assigning same random value to all of one variable? - r

We can use data.table library(data.table) as.data.table(mtcars)[, rand := sample(c("A", "B", 1), cyl]

Related

How to filter inside only certain groups (that satisfy a particular condition) in grouped tibble, using dplyr?

Mutate a dynamic column name with conditions using other dynamic column names

How to print tibble without row.names / row numbers

How to tell readr::read_csv to guess double column correctly

use assign() inside purrr:walk()

Categories

Resources