Performing the Same mutate on all variables in a data frame - r

I have a 28-variable data frame, and I would like to mutate each variable in the same data frame with the same function. For example, add an extra column for each variable in the data frame where the new column is the log of the variable. So for example if I had
dataframe <- data.frame(X=data1, Y=data2, Z=data3)
I want a new data frame that contains X Y and Z, but also log(X), log(Y) and log(Z). This is easy enough to do using
mutate(dataframe, log(X)); mutate(dataframe(log(Y))
etc but for 28 variables (and multiple transformations on each variable - I want to get sqrt and ^2 of each too) it's a bit too much. I'm aware of the existance of mutate_all, but for some reason when I try to use that it replaces all the variables rather than adding new ones.

We can use mutate_all and specify the suffix in the funs so that it will create as a new column. Otherwise, would replace the original with the output of the function
dataframe %>%
mutate_all(funs(log = log(.))

A base R option would be
df <- head(iris[1:2])
df[paste("log", names(df), sep = "_")] <- log(df)
df
# Sepal.Length Sepal.Width log_Sepal.Length log_Sepal.Width
#1 5.1 3.5 1.629241 1.252763
#2 4.9 3.0 1.589235 1.098612
#3 4.7 3.2 1.547563 1.163151
#4 4.6 3.1 1.526056 1.131402
#5 5.0 3.6 1.609438 1.280934
#6 5.4 3.9 1.686399 1.360977

Related

Using distinct() with a vector of column names

I have a question using distinct() from dplyr on a tibble/data.frame. From the documentation it is clear that you can use it by naming explicitely the column names. I have a data frame with >100 columns and want to use the funtion just on a subset. My intuition said I put the column names in a vector and use it as an argument for distinct. But distinct uses only the first vector element
Example on iris:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct_(iris, exclude.columns)
This is different from
exclude.columns <- c('Sepal.Width', 'Species')
distinct_(iris, exclude.columns)
I think distinct is not made for this operation. Another option would be to subset the data.frame then use distinct and join again with the excluded columns. But my question is if there is another option using just one function?
As suggested in my comment, you could also try:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct(iris, !!! syms(exclude.columns))
Output (first 10 rows):
Sepal.Width Species
1 3.5 setosa
2 3.0 setosa
3 3.2 setosa
4 3.1 setosa
5 3.6 setosa
6 3.9 setosa
7 3.4 setosa
8 2.9 setosa
9 3.7 setosa
10 4.0 setosa
However, that was suggested more than 2 years ago. A more proper usage of latest dplyr functionalities would be:
distinct(iris, across(all_of(exclude.columns)))
It is not entirely clear to me whether you would like to keep only the exclude.columns or actually exclude them; if the latter then you just put minus in front i.e. distinct(iris, across(-all_of(exclude.columns))).
Your objective sounds unclear. Are you trying to get all distinct rows across all columns except $Species and $Sepal.Width? If so, that doesn't make sense.
Let's say two rows are the same in all other variables except for $Sepal.Width. Using distinct() in the way you described would throw out the second row because it was not distinct from the first. Except that it was in the column you ignored.
You need to rethink your objective and whether it makes sense.
If you are just worried about duplicate rows, then
data %>%
distinct(across(everything()))
will do the trick.

Elements of one list as arguments to a function acting on another list

I have a list of data frames, where every data frame is similar (has the same columns with the same names) but contains information on a different, related "thing" (say, species of flower). I need an elegant way to re-categorize one of the columns in all of these data frames from continuous to categorical using the function cut(). The problem is each "thing" (flower) has different cut-points and will use different labels.
I got as far as putting the cut-points and labels in a separate list. If we're following my fake example, it basically looks like this:
iris <- iris
peony <- iris #pretending that this is actually different data!
flowers <- list(iris = iris, peony = peony)
params <- list(iris_param = list(cutpoints = c(1, 4.5),
labels = c("low", "medium", "high")),
peony_param = list(cutpoints = c(1.5, 2.5, 5),
labels = c("too_low", "kinda_low", "okay", "just_right")))
#And we want to cut 'Sepal.Width' on both peony and iris
I am now really stuck. I have tried using some combinations of lapply() and do.call() but I'm kind of just guessing (and guessing wrong).
More generalized, I want to know: how can I use a changing set of arguments to apply a function over different data frames in a list?
I think this is a great time for a for loop. It's straightforward to write and clear:
for (petal in seq_along(flowers)) {
flowers[[petal]]$Sepal.Width.Cut = cut(
x = flowers[[petal]]$Sepal.Width,
breaks = c(-Inf, params[[petal]]$cutpoints, Inf),
labels = params[[petal]]$labels
)
}
Note that (a) I had to augment your breaks to make cut happy about the length of the labels, (b) really I'm just iterating 1, 2. A more robust version would possibly iterate over the names of the list and as a safety check would require the params list to have the same names. Since the names of your lists were different, I just used the indexes.
This could probably be done using mapply. I see no advantage to that - unless you're already comfortable with mapply the only real difference will be that the mapply version will take you 10 times longer to write.
I like Gregor's solution, but I'd probably stack the data instead:
library(data.table)
# rearrange parameters
params0 = setNames(params, c("iris", "peony"))
my_params = c(list(.id = names(params0)), do.call(Map, c(list, params0)))
# stack
DT = rbindlist(flowers, id = TRUE)
# merge and make cuts
DT[my_params, Sepal.Width.Cut :=
cut(Sepal.Width, breaks = c(-Inf,cutpoints[[1]],Inf), labels = labels[[1]])
, on=".id", by=.EACHI]
(I've borrowed Gregor's translation of the cutpoints.) The result is:
.id Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Width.Cut
1: iris 5.1 3.5 1.4 0.2 setosa kinda_low
2: iris 4.9 3.0 1.4 0.2 setosa kinda_low
3: iris 4.7 3.2 1.3 0.2 setosa kinda_low
4: iris 4.6 3.1 1.5 0.2 setosa kinda_low
5: iris 5.0 3.6 1.4 0.2 setosa kinda_low
---
296: peony 6.7 3.0 5.2 2.3 virginica okay
297: peony 6.3 2.5 5.0 1.9 virginica kinda_low
298: peony 6.5 3.0 5.2 2.0 virginica okay
299: peony 6.2 3.4 5.4 2.3 virginica okay
300: peony 5.9 3.0 5.1 1.8 virginica okay
I think stacked data usually make more sense than a list of data.frames. You don't need to use data.table to stack or make the cuts, but it's designed well for those tasks.
How it works.
I guess rbindlist is clear.
The code
DT[my_params, on = ".id"]
makes a merge. To see what that means, look at:
as.data.table(my_params)
# .id cutpoints labels
# 1: iris 1.0,4.5 low,medium,high
# 2: peony 1.5,2.5,5.0 too_low,kinda_low,okay,just_right
So, we're merging this table with DT by their common .id column.
When we do a merge like
DT[my_params, j, on = ".id", by=.EACHI]
this means
Do the merge, matching each row of my_params with related rows of DT.
Do j for each row of my_params, using columns found in either of the two tables.
j in this case is of the form column_for_DT := cut(...), which makes a new column in DT.

Create a new (identical) data frame by sampling an existing data frame column-wise

I am trying to create a new data frame which is identical in the number of columns (but not rows) of an existing data frame. All columns are of identical type, numeric. I need to sample each column of the original data frame (n=241 samples, replace=T) and add those samples to the new data frame at the same column number as the original data frame.
My code so far:
#create the new data frame
tree.df <- data.frame(matrix(nrow=0, ncol=72))
#give same column names as original data frame (data3)
colnames(tree.df)<-colnames(data3)
#populate with NA values
tree.df[1:241,]=NA
#sample original data frame column wise and add to new data frame
for (i in colnames(data3)){
rbind(sample(data3[i], 241, replace = T),tree.df)}
The code isn't working out. Any ideas on how to get this to work?
Use the fact that a data frame is a list, and pass to lapply to perform a column-by-column operation.
Here's an example, taking 5 elements from each column in iris:
as.data.frame(lapply(iris, sample, size=5, replace=TRUE))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 3.2 1.7 0.2 versicolor
## 2 5.8 3.1 1.5 1.2 setosa
## 3 6.0 3.8 4.9 1.9 virginica
## 4 4.4 2.5 5.3 0.2 versicolor
## 5 5.1 3.1 3.3 0.3 setosa
There are several issues here. Probably the one that is causing things not to work is that you are trying to access a column of the data frame data3. To do that, you use the following data3[, i]. Note the comma. That separates the row index from the column index.
Additionally, since you already know how big your data frame will be, allocate the space from the beginning:
tree.df <- data.frame(matrix(nrow = 241, ncol = 72))
tree.df is already prepopulated with missing (NA) values so you don't need to do it again. You can now rewrite your for loop as
for (i in colnames(data3)){
tree.df[, i] <- sample(data3[, i], 241, replace = TRUE)
}
Notice I spelled out TRUE. This is better practice than using T because T can be reassigned. Compare:
T
T <- FALSE
T
TRUE <- FALSE

R: new variable in the for loop [duplicate]

Is it possible to create new variable names on the fly?
I'd like to read data frames from a list into new variables with numbers at the end. Something like orca1, orca2, orca3...
If I try something like
paste("orca",i,sep="")=list_name[[i]]
I get this error
target of assignment expands to non-language object
Is there another way around this?
Use assign:
assign(paste("orca", i, sep = ""), list_name[[i]])
It seems to me that you might be better off with a list rather than using orca1, orca2, etc, ... then it would be orca[1], orca[2], ...
Usually you're making a list of variables differentiated by nothing but a number because that number would be a convenient way to access them later.
orca <- list()
orca[1] <- "Hi"
orca[2] <- 59
Otherwise, assign is just what you want.
Don't make data frames. Keep the list, name its elements but do not attach it.
The biggest reason for this is that if you make variables on the go, almost always you will later on have to iterate through each one of them to perform something useful. There you will again be forced to iterate through each one of the names that you have created on the fly.
It is far easier to name the elements of the list and iterate through the names.
As far as attach is concerned, its really bad programming practice in R and can lead to a lot of trouble if you are not careful.
FAQ says:
If you have
varname <- c("a", "b", "d")
you can do
get(varname[1]) + 2
for
a + 2
or
assign(varname[1], 2 + 2)
for
a <- 2 + 2
So it looks like you use GET when you want to evaluate a formula that uses a variable (such as a concatenate), and ASSIGN when you want to assign a value to a pre-declared variable.
Syntax for assign:
assign(x, value)
x: a variable name, given as a character string. No coercion is done, and the first element of a character vector of length greater than one will be used, with a warning.
value: value to be assigned to x.
Another tricky solution is to name elements of list and attach it:
list_name = list(
head(iris),
head(swiss),
head(airquality)
)
names(list_name) <- paste("orca", seq_along(list_name), sep="")
attach(list_name)
orca1
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
And this option?
list_name<-list()
for(i in 1:100){
paste("orca",i,sep="")->list_name[[i]]
}
It works perfectly. In the example you put, first line is missing, and then gives you the error message.

How to name variables on the fly?

Is it possible to create new variable names on the fly?
I'd like to read data frames from a list into new variables with numbers at the end. Something like orca1, orca2, orca3...
If I try something like
paste("orca",i,sep="")=list_name[[i]]
I get this error
target of assignment expands to non-language object
Is there another way around this?
Use assign:
assign(paste("orca", i, sep = ""), list_name[[i]])
It seems to me that you might be better off with a list rather than using orca1, orca2, etc, ... then it would be orca[1], orca[2], ...
Usually you're making a list of variables differentiated by nothing but a number because that number would be a convenient way to access them later.
orca <- list()
orca[1] <- "Hi"
orca[2] <- 59
Otherwise, assign is just what you want.
Don't make data frames. Keep the list, name its elements but do not attach it.
The biggest reason for this is that if you make variables on the go, almost always you will later on have to iterate through each one of them to perform something useful. There you will again be forced to iterate through each one of the names that you have created on the fly.
It is far easier to name the elements of the list and iterate through the names.
As far as attach is concerned, its really bad programming practice in R and can lead to a lot of trouble if you are not careful.
FAQ says:
If you have
varname <- c("a", "b", "d")
you can do
get(varname[1]) + 2
for
a + 2
or
assign(varname[1], 2 + 2)
for
a <- 2 + 2
So it looks like you use GET when you want to evaluate a formula that uses a variable (such as a concatenate), and ASSIGN when you want to assign a value to a pre-declared variable.
Syntax for assign:
assign(x, value)
x: a variable name, given as a character string. No coercion is done, and the first element of a character vector of length greater than one will be used, with a warning.
value: value to be assigned to x.
Another tricky solution is to name elements of list and attach it:
list_name = list(
head(iris),
head(swiss),
head(airquality)
)
names(list_name) <- paste("orca", seq_along(list_name), sep="")
attach(list_name)
orca1
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
And this option?
list_name<-list()
for(i in 1:100){
paste("orca",i,sep="")->list_name[[i]]
}
It works perfectly. In the example you put, first line is missing, and then gives you the error message.

Resources