I know there a several ways to create a column based on another column, however I would like to know how to do it while creating a data frame.
For example this works but is not the way I want to use it.
v1 = rnorm(10)
sample_df <- data.frame(v1 = v1,
cs = cumsum(v1))
This works not:
sample_df2 <- data.frame(v2 = rnorm(10),
cs = cumsum(v2))
Is there a way to it directly in the data.frame function? Thanks in advance.
It cannot be done using data.frame, but package tibble implements a data.frame analogue with the functionality that you want.
library("tibble")
tib <- tibble(x = 1:6, y = cumsum(x))
tib
# # A tibble: 6 × 2
# x y
# <int> <int>
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
In most cases, the resulting object (called a "tibble") can be treated as if it were a data frame, but if you truly need a data frame, then you can do this:
dat <- as.data.frame(tib)
dat
# x y
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
You can wrap everything in a function if you like:
f <- function(...) as.data.frame(tibble(...))
f(x = 1:6, y = cumsum(x))
# x y
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
Related
I have two dfs : df1 and df2 where the column names are dates. When I join the two df's I get columns like
date1.x, date1.y, date2.x, date2.y, date3.x, date3.y, date4.x, date4.y...........
I want to create new columns which have values which are multiplication of date1.x and date1.y and similarly for other date pairs as well.
df <- data.frame(id=11:13, date1.x=1:3, date2.x=4:6, date1.y=7:9, date2.y=10:12)
df
# id date1.x date2.x date1.y date2.y
# 1 11 1 4 7 10
# 2 12 2 5 8 11
# 3 13 3 6 9 12
grep("^date.*\\.x$", colnames(df), value = TRUE)
# [1] "date1.x" "date2.x"
datenms <- grep("^date.*\\.x$", colnames(df), value = TRUE)
### make sure all of our 'date#.x' columns have matching 'date#.y' columns
datenms <- datenms[ gsub("x$", "y", datenms) %in% colnames(df) ]
datenms
# [1] "date1.x" "date2.x"
subset(df, select = datenms)
# date1.x date2.x
# 1 1 4
# 2 2 5
# 3 3 6
subset(df, select = gsub("x$", "y", datenms))
# date1.y date2.y
# 1 7 10
# 2 8 11
# 3 9 12
subset(df, select = datenms) * subset(df, select = gsub("x$", "y", datenms))
# date1.x date2.x
# 1 7 40
# 2 16 55
# 3 27 72
There are a number of ways to do this, but I suggest that it is a good practice to get used to transforming your data into a format that is easy to work with. The first answer showed you one way to do what you want without transforming your data. My answer will show you how to transform the data so that calculation (this one and others) are easy, and then how to perform the calculation once the data is tidy.
Making your data tidy helps to perform easier aggregations, to graph results, to perform feature engineering for models, etc.
library(dplyr)
library(tidyr)
df <- data.frame(id=11:13, date1.x=1:3, date2.x=4:6, date1.y=7:9, date2.y=10:12)
df
# id date1.x date2.x date1.y date2.y
# 1 11 1 4 7 10
# 2 12 2 5 8 11
# 3 13 3 6 9 12
# Convert the data to a tidy format that is easier for computers to calculate
tidy_df <- df %>%
pivot_longer(
cols = starts_with("date"), # We are tidying any column starting with date
names_to = c("date_num","date_source"), # creating two columns for names
values_to = c("date_value"), # creating one column for values
names_prefix = "date", # removing the "date" prefix
names_sep = "\\." # splitting the names on the period `.`
)
tidy_df
# id date_num date_source date_value
# <int> <chr> <chr> <int>
# 1 11 1 x 1
# 2 11 2 x 4
# 3 11 1 y 7
# 4 11 2 y 10
# 5 12 1 x 2
# 6 12 2 x 5
# 7 12 1 y 8
# 8 12 2 y 11
# 9 13 1 x 3
# 10 13 2 x 6
# 11 13 1 y 9
# 12 13 2 y 12
# Now that the data is tidy we can do easier dataframe grouping and aggregation
tidy_df %>%
group_by(id,date_num) %>%
summarise(date_value_mult = prod(date_value)) %>%
ungroup()
# id date_num date_value_mult
# <int> <chr> <dbl>
# 1 11 1 7
# 2 11 2 40
# 3 12 1 16
# 4 12 2 55
# 5 13 1 27
# 6 13 2 72
# If/When you eventually want the data in a more human readable format you can
# pivot the data back into a human readable format. This is likely after all
# computer calculations are done and you want to present the data. For storing
# the data (such as in a database) you would not need/want this step.
tidy_df %>%
group_by(id,date_num) %>%
summarise(date_value_mult = prod(date_value)) %>%
ungroup() %>%
pivot_wider(
names_from = date_num,
values_from = date_value_mult,
names_prefix = "date"
)
# id date1 date2
# <int> <dbl> <dbl>
# 1 11 7 40
# 2 12 16 55
# 3 13 27 72
I am trying to use expand() function to create combinations of multiple variables in the list vector. The following codes correctly produces 27 rows of combinations when the atomic vectors are listed. However, when I try to use the var_list in many different forms, it the expand() function does not produce desired outcome of 27 combinations. How could I use the var_list to dynamically create combinations of multiple columns in a data frame df?
abc <- letters[1:3]
num <- c(1,2,3)
xyz <- letters[24:26]
df <- as.data.frame(cbind(abc,num,xyz))
combinations_1 <- expand(df,abc,num,xyz) #This returns 27 combinations
var_list <- c("abc","num","xyz")
combinations_2 <- expand(df,var_list) #This returns 3 combinations
combinations_3 <- expand(df,df[var_list]) #This returns 3 combinations
combinations_4 <- expand(df,noquote(var_list)) #This returns 3 combinations
We can use mget to return the values of the object names
expand.grid(mget(var_list))
Or if we need to make use of 'df', just extract with [
expand.grid(df[var_list])
Or using expand
library(dplyr)
library(tidyr)
expand(df, !!! rlang::syms(var_list))
# A tibble: 27 x 3
# abc num xyz
# <fct> <fct> <fct>
# 1 a 1 x
# 2 a 1 y
# 3 a 1 z
# 4 a 2 x
# 5 a 2 y
# 6 a 2 z
# 7 a 3 x
# 8 a 3 y
# 9 a 3 z
#10 b 1 x
# … with 17 more rows
You can make var_list a list of symbols and splice.
library(tidyr)
df %>%
expand(!!!syms(var_list))
# A tibble: 27 x 3
abc num xyz
<fct> <fct> <fct>
1 a 1 x
2 a 1 y
3 a 1 z
4 a 2 x
5 a 2 y
6 a 2 z
7 a 3 x
8 a 3 y
9 a 3 z
10 b 1 x
# ... with 17 more rows
I am trying to expand an existing dataset, which currently looks like this:
df <- tibble(
site = letters[1:3],
years = rep(4, 3),
tr = c(3, 6, 4)
)
tr is the total number of replicates for each site/year combination. I simply want to add in the replicates and later the response variable for each replicate. This was easy for a single site/year combination using the following function:
f <- function(site=NULL, years=NULL, t=NULL){
df <- tibble(
site = rep(site, each = t, times= years),
tr = rep(1:t, times = years),
year = rep(1:years, each = t)
)
df
}
# For one site:
f(site='a', years=4, t=3)
# Producing this:
# # A tibble: 12 x 3
# site tr year
# <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
# 10 a 1 4
# 11 a 2 4
# 12 a 3 4
How can the function be applied to each row of the input dataframe to produce the final dataframe? One of the apply functions in base r or the pmap_df() in the purrr package would seem ideal, but being unfamiliar with how these functions work, all my efforts have only produced errors.
If we want to apply the same function, use pmap
library(purrr)
pmap_dfr(df, ~ f(..1, ..2, ..3))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
another option is condense from the devel version of dplyr
library(tidyr)
df %>%
group_by(rn = row_number()) %>%
condense(out = f(site, years, tr)) %>%
unnest(c(out))
Or in base R, we can also use do.call with Map
do.call(rbind, do.call(Map, c(f, unname(as.data.frame(df)))))
well in base R, you could do:
do.call(rbind,do.call(Vectorize(f,SIMPLIFY = FALSE),unname(df)))
# A tibble: 52 x 3
site tr year
* <chr> <int> <int>
1 a 1 1
2 a 2 1
3 a 3 1
4 a 1 2
5 a 2 2
6 a 3 2
7 a 1 3
8 a 2 3
9 a 3 3
10 a 1 4
# ... with 42 more rows
do.call(rbind, lapply(split(df, df$site), function(x){
with(x, data.frame(site,
years = rep(sequence(years), each = tr),
tr = rep(sequence(tr), years)))
}))
We can use Map to apply f to every value of site, years and tr.
do.call(rbind, Map(f, df$site, df$years, df$tr))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
Akrun's answer worked well for me, so I modified it to make the function to be applied to each row of the dataframe a little more explicit:
df1 <- pmap_df(df, function(site, years, tr){
site = rep(site, each = tr, times=years)
year = rep(1:years, each = tr)
tr = rep(1:tr, times=years)
return(tibble(site, year, tr))
})
I have a function**:
do_thing <- function(x) {
return(x + runif(1, 0, 100))
}
That I'd like to apply to my data:
df <- tibble(x = 1:10)
Preferably with mutate:
set.seed(1)
df %>%
mutate(y = do_thing(x))
The function, however, is not performing as expected:
# x y
# 1 1 27.55087
# 2 2 28.55087
# 3 3 29.55087
# 4 4 30.55087
# 5 5 31.55087
# 6 6 32.55087
# 7 7 33.55087
# 8 8 34.55087
# 9 9 35.55087
# 10 10 36.55087
I actually want the function to apply in a rowwise fashion:
df %>%
rowwise() %>%
mutate(y = do_thing(x))
# x y
# 1 1 38.21239
# 2 2 59.28534
# 3 3 93.82078
# 4 4 24.16819
# 5 5 94.83897
# 6 6 100.46753
# 7 7 73.07978
# 8 8 70.91140
# 9 9 15.17863
# 10 10 30.59746
Is there a way that I might be able to rewrite my function so that it is flexible and can automatically default to rowwise while still working with a single input (ie., do_thing(100))?
** actual function is a lot more complex
Instead of getting the runif for 1 observation, we can specify the n as the number of rows (n()) of the dataset
set.seed(24)
df %>%
mutate(y = x + runif(n(), 0, 100))
# A tibble: 10 x 2
# x y
# <int> <dbl>
# 1 1 46.952549
# 2 2 61.939816
# 3 3 94.972191
# 4 4 102.282408
# 5 5 8.780258
# 6 6 63.793740
# 7 7 80.331417
# 8 8 32.874240
# 9 9 39.073652
#10 10 83.346670
I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6