R: Apply function on specific columns preserving the rest of the dataframe - r

I'd like to learn how to apply functions on specific columns of my dataframe without "excluding" the other columns from my df. For example i'd like to multiply some specific columns by 1000 and leave the other ones as they are.
Using the sapply function for example like this:
a<-as.data.frame(sapply(table.xy[,1], function(x){x*1000}))
I get new dataframes with the first column multiplied by 1000 but without the other columns that I didn't use in the operation. So my attempt was to do it like this:
a<-as.data.frame(sapply(table.xy, function(x) if (colnames=="columnA") {x/1000} else {x}))
but this one didn't work.
My workaround was to give both dataframes another row with IDs and later on merge the old dataframe with the newly created to get a complete one. But I think there must be a better solution. Isn't it?

If you only want to do a computation on one or a few columns you can use transform or simply do index it manually:
# with transfrom:
df <- data.frame(A = 1:10, B = 1:10)
df <- transform(df, A = A*1000)
# Manually:
df <- data.frame(A = 1:10, B = 1:10)
df$A <- df$A * 1000

The following code will apply the desired function to the only the columns you specify.
I'll create a simple data frame as a reproducible example.
(df <- data.frame(x = 1, y = 1:10, z=11:20))
(df <- cbind(df[1], apply(df[2:3],2, function(x){x*1000})))
Basically, use cbind() to select the columns you don't want the function to run on, then use apply() with desired functions on the target columns.

In dplyr we would use mutate_at in which you can select or exclude (by preceding variable name with "-" minus sign) specific variables.
You can just name a function
df <- df %>%
mutate_at(vars(columnA), scale)
or create your own
df <- df %>%
mutate_at(vars(columnA, columnC), function(x) {do this})

Related

Short syntax simplification with mutate and lapply

I have code that creates a new dataframe, df2 which is a copy of an existing dataframe, df but with four new columns a,b,c,d. The values of these columns are given by their own functions.
The code below works as intended but it seems repetitive. Is there a more succinct form that you would recommend?
df2 <- df %>% mutate(a = lapply(df[,c("value")], f_a),
b = lapply(df[,c("value")], f_b),
c = lapply(df[,c("value")], f_c),
d = lapply(df[,c("value")], f_d)
)
Example of cell contents in "value" column "-0.57(-0.88 to -0.26)".
I am applying a function to extract first number:
f_a <- function(x){
substring(x, 1, regexpr("\\(", x)[1] - 1)
}
This works fine when applied to a single string (-0.57 from the example). In the data frame I found that lapply gives correct values based on input from any cell in the "value" column. The code seems a bit repetitive but works.
We can use map
library(tidyverse)
df[c('a', 'b', 'c', d')] <- map(list(f_a, f_b, f_c, f_d), ~ lapply(df$value, .x))
Note: Without the functions or an example, not clear whether this is the optimal solution. Also, as noted in the comments, many of the functions can be applied directly on the column instead of looping through each element.

How can I select certain columns in a dataframe based on their number of valid values (except NA) in R?

I'm using R, and I have a dataframe with multiple columns. I want to run a code and automatically check the number of values (valid values, not NA) in each column. Then, it should select the columns that 50% of its rows are filled by valid values, and save them in a new dataframe.
Can anybody help me doing this? Thank you very much.
Is there any way that the codes can be applied for an uncertain number of columns?
Using purrr package, you can write function below to check for the percentage of missing values:
pct_missing <- purrr::map_dbl(df,~mean(is.na(.x)))
After that, you can select those columns that have less than 50% missing values by their names.
selected_column <- colnames(df)[pct_missing < 0.5]
To create a new dataset, you may use:
library(dplyr)
df_new <- df %>% select(one_of(selected_column))
You can create a function within R base also to automatically retrieve the colums matching the critria:
Function:
ColSel <- function(df){
vals <- apply(df,2, function(fo) mean(is.na(fo))) < .5
return(df[,vals])
}
Some toy data
## example
df1 <- data.frame(
a = c(runif(19),NA),
b = c(rep(NA,11),runif(9)),
d = rep(NA,20),
e = runif(20)
)
Test
df2 <- ColSel(df1)

using adist on two columns of data frame

I want to use adist to calculate edit distance between the values of two columns in each row.
I am using it in more-or-less this way:
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df$dist <- adist(my_df$A, my_df$B, ignore.case = TRUE)
my_df <- my_df[order(dist),]
The last two rows are the same as in my case, but the actual data frame looks a bit different - columns of my original data frame are character type, not factor. Also, the dist column seems to be returned as 2-column matrix, I have no idea why it happens.
Update:
I have read a bit and found that I need to apply it over the rows, so my new code is following:
apply(my_df, 1, function(d) adist(d[1], d[2]))
It works fine, but for my original dataset calling it by column numbers is inpractical, how can I refer to column names in this function?
Using tidyverse approach, you may use the following code:
library(tidyverse)
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df %>%
rowwise() %>%
mutate(Lev_dist=adist(x=A,y=B,ignore.case=TRUE))
You can overcome that problem by using mapply, i.e.
mapply(adist, df$A, df$B)
#[1] 2 1
As per adist function definition the x and y arguments should be character vectors. In your example the function is returning a 2x2 matrix because it is comparing also the cross words "mad" with "cat" and "car" with "mug".
Just look at the matrix master diagonal.

Data.frame row calculation

I want to calculate rows in my data.frame with a simple function (for e.g. [sqrt(column1 * column 2)]). This is my function. I have 17 rows in which I want to calculate the function to create a new column called d.
How to make it? With combine? With t(x) transfering the data.frame into a matrix? Or with which function? I want it to still have a data.frame (as a table).
You have several options to choose from:
Base R
df$d <- sqrt(df$column1^2 +df$column2^2)
#or
transform(df, d=sqrt(column1^2+column2^2))
Tidyverse
library(tidyverse)
df <- df %>%
mutate( d = sqrt(column1^2+column2^2))
head(df)
All these methods preserve your data as a data frame

Use apply to add multiple columns (more than 100) of random numbers or other function in R

I would like to build a function that adds many columns of random variables or other function to a a dataframe. Here I am trying to append it to map data.
library(plyr)
add <- function(name, df){
new.df = mutate(df, name = runif(length(df[,1])))
new.df
}
The function works to add a column of data...
add("e", iris)
iris2<- add("f", iris)
The apply does not work...
I am trying to add 26 columns from the list of letters so that df$a, df$b, df$c are all random vectors.
new <- lapply(letters, add, df = tx)
What is the most efficient way to columns from a list of col names?
I would like to later loop through all of the column names in another function.
It's not very clear to me, what you want to achieve. This adds multiple columns of random numbers to a data.frame:
cbind(iris,
matrix(runif(nrow(iris)*5), ncol=5))
I don't see a reason to use an *apply function.

Resources