I have this df
df <- data.table(id=c(1,2,3,4,5,6,7,8,9,10),
var1=c(0,4,5,6,99,3,5,5,23,0),
var2=c(22,4,6,25,6,70,75,23,24,21))
I would like to create a third column being:
df <- data.table(id=c(1,2,3,4,5,6,7,8,9,10),
var1=c(0,4,5,6,99,3,5,5,23,0),
var2=c(22,4,6,25,6,70,75,23,24,21),
var3=c("0_22","4_4","5_6","6_25","99_6","3_70","5_75","5_23","23_24","0_21"))
where the value of each cell will be "var1 underscore var2".
Var1 and Var2 are categorical variables as they represent medications. Var3 would be to represent a combination of medications.
how can I do this?
thanks!
Load packages
library(data.table)
library(dplyr)
Create dataframe
df <- data.table(
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
var1 = c(0, 4, 5, 6, 99, 3, 5, 5, 23, 0),
var2 = c(22, 4, 6, 25, 6, 70, 75, 23, 24, 21)
)
Add new variable
By means of dplyr package and sprintf
df <- df %>%
mutate(var3 = sprintf("%d_%d", var1, var2))
By means of dplyr package and paste0
df <- df %>%
mutate(var3 = paste0(var1, "_", var2))
By means of base package and sprintf
df$var3 <- sprintf("%d_%d", df$var1, df$var2)
By means of base package and paste0
df$var3 <- paste0(df$var1, "_", df$var2)
as #Wimpel says, the solution is df$var3 <- paste(df$var1, df$var2, sep = "_")
thanks!!
You can do this efficiently using the tidyverse and the unite() function
library(tidyverse)
df <- tibble(id=c(1,2,3,4,5,6,7,8,9,10),
var1=c(0,4,5,6,99,3,5,5,23,0),
var2=c(22,4,6,25,6,70,75,23,24,21)) %>%
# create new variable
unite(var3, c(var1, var2), sep = "_", remove = FALSE)
Related
I want to center, but not standardize, a set of variables in a data frame. I tried the code for doing that using mutate_at, but the scale function uses scale = TRUE as default, and I can't figure out how to set it to scale = FALSE. Tis scales the desired variables, but standardizes in addition to centering:
centdata <- mydat %>%
mutate_at(.vars = c(1, 2, 3, 4, 5, 6, 7, 8, 14),
.funs = list("scaled" = scale))
You can use purrr style formula or an anonymous function here.
library(dplyr)
cols <- c(1, 2, 3, 4, 5, 6, 7, 8, 14)
centdata <- mydat %>%
mutate_at(.vars = cols,
.funs = list("scaled" = ~scale(., scale = FALSE)))
Since mutate_at has been deprecated, you can use across.
centdata <- mydat %>%
mutate(across(cols, list("scaled" = ~scale(., scale = FALSE))))
In base R -
mydat[paste0(names(mydat)[cols], '_scaled')] <- lapply(mydat[cols], scale, scale = FALSE)
scale also work on dataframe directly.
mydat[paste0(names(mydat)[cols], '_scaled')] <- scale(mydat[cols])
I have a column of type "list" in my data frame, I want to create a column with the sum.
I guess there is no visual difference, but my column consists of list(1,2,3)s and not c(1,2,3)s :
tibble(
MY_DATA = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
),
NOT_MY_DATA = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
Unfortunately when I try
mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
the result is that every cell in the new column contains the sum of the entire source column (so a value in the millions)
I tried reduce and it did work, but was slow, I was looking for a better solution.
You could use the purrr::map_dbl, which should return a vector of type double:
library(tibble)
library(dplyr)
library(purrr)
df = tibble(
MY_LIST_COL_D = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
df %>%
mutate(NEW_COL= map_dbl(MY_LIST_COL_D, sum), .keep = 'unused')
# NEW_COL
<dbl>
# 1 17
# 2 24
# 3 14
Is this what you were looking for? If you don't want to remove the list column just disregard the .keep argument.
Update
With the underlying structure being lists, you can still apply the same logic, but one way to solve the issue is to unlist:
df = tibble(
MY_LIST_COL_D = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
)
)
df %>%
mutate(NEW_COL = map_dbl(MY_LIST_COL_D, ~ sum(unlist(.x))), .keep = 'unused')
# NEW_COL
# <dbl>
# 1 17
# 2 24
# 3 14
You can use rowwise in dplyr
library(dplyr)
df %>% rowwise() %>% mutate(NEW_COL = sum(MY_LIST_COL_D))
rowwise will also make your attempt work :
df %>% rowwise() %>% mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
Can also use sapply in base R :
df$NEW_COL <- sapply(df$MY_LIST_COL_D, sum)
For all numeric fields in a shiny app I am adding dynamically a slider. Now I want to add also
a dynamic filter for the data based on the slider input.
To illustrate the problem some code with data and static filtering:
library(glue)
library(tidyverse)
data <-
tibble(
a = c(61, 7, 10, 2, 5, 7, 23, 60),
b = c(2, 7, 1, 9, 6, 7, 3, 6),
c = c(21, 70, 1, 4, 6, 2, 3, 61)
)
input <- list("a" = c(2, 10),
"b" = c(7, 10),
"c" = c(1, 5))
data %>% filter(
between(a, input$a[1], input$a[2]),
between(b, input$b[1], input$b[2]),
between(c, input$c[1], input$c[2])
)
Is there a way to implement dynamic filtering?
I built myself a dynamic filter, which basically works:
query <- data %>% filter_if(is.numeric) %>% colnames() %>% map( function(feature){
"between({
feature
}, input${
feature
}[1], input${
feature
}[2])" %>% glue()
}) %>% paste0(collapse = ", ")
eval(parse(text = "data %>% filter({
query
})" %>% glue()))
Is there are a dplyr way?
I have a data frame that looks like this. names and number of columns will NOT be consistent (sometimes 'C' will not be present, other times "D', 'E', 'F' may be present, etc.)
# name and number of columns varies...so need flexible process
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
ABC <- data.frame(A, B, C)
I want to loop through each variable and collect various information. This is a simple example, but what I am doing will be more complicated. I say that so that somebody doesn't just recommend some sort of summary() type solution.
maximum_value <- max(A)
mean_value <- mean(A)
# lots of other calculations for A
ID = 'A'
tempA <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(B)
mean_value <- mean(B)
# lots of other calculations for B
ID = 'B'
tempB <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(C)
mean_value <- mean(C)
# lots of other calculations for C
ID = 'C'
tempC <- data.frame(ID, maximum_value, mean_value)
output <- rbind(tempA, tempB, tempC)
Here is my attempt at creating a loop to go through the variables one by one and aggregate output. I can't figure out how to get [i] to point at an individual column of the data frame ABC.
# initialize data frame
data__ <- data.frame(ID__ = as.character(),
max__ = as.numeric(),
mean__ = as.numeric())
# loop through A, then B, then C
for(i in A:C) {
ID__ <- '[i]'
max__ <- maximum[i]
mean__ <- mean[i]
data__temp <- (ID__, max__, mean__)
data__ <- rbind(data__, data__temp)
}
If I were doing this in SAS, I would use a select into within proc sql to create a list of the variable names, then write an array, then i could loop through them that way, but there's something I'm missing here.
How would I tell R to do this process for each variable in the data frame?
If you use the tidyverse dplyr and tidyr package, you can do
library(tidyr)
ABC %>% gather(ID, value) %>% group_by(ID) %>% summarize_all(funs(mean, max))
or
ABC %>% gather(ID, value) %>% group_by(ID) %>%
summarize(maximum_value = max(value), mean_value=mean(value))
If you'd rather use base functions and there are a lot of "weird" functions, you can use purrr's map_df function
library(purrr)
map2_df(ABC, names(ABC), function(a, n) {
data_frame(ID=n, max_val=max(a), mean_val=mean(a))
})
If I have a data frame, say
df <- data.frame(x = c(1, 2, 3), y = c(2, 4, 7), z = c(3, 6, 10))
then I can modify entries with the which function:
w <- which(df[,"y"] == 7)
df[w,c("y", "z")] <- data.frame(6, 9)
One way I see to do this with the package dplyr is the following:
df <- df %>%
mutate(W = (y==7),
y = ifelse(W, 6, y),
z = ifelse(W, 9, z)) %>%
select(-W)
But I find it a bit unelegant, and I am not so sure it would replace all kinds of which uses. Ideally I would imagine something like:
df <- df %>%
keep(y == 7) %>%
mutate(y = 6) %>%
unkeep()
where keep would provisionally select rows where modifications are to be made, and unkeep would unselect them to recover the full data frame.