I am trying to write some R code which will take the iris dataset and do a log transform of the numeric columns as per some criterion, say if skewness > 0.2. I have tried to use ldply, but it doesn't quite give me the output I want. It is giving me a transposed data frame, the variable names are missing and the non-numeric column entries are messed up.
Before posting this question I searched and found the following related topics but didn't quite meet what exactly I was looking for
Selecting only numeric columns from a data frame
extract only numeric columns from data frame
data
Below is the code. Appreciate the help!
data(iris)
df <- iris
df <- ldply(names(df), function(x)
{
if (class(df[[x]])=="numeric")
{
tmp <- df[[x]][!is.na(df[[x]])]
if (abs(skewness(tmp)) > 0.2)
{
df[[x]] <- log10( 1 + df[[x]] )
}
else df[[x]] <- df[[x]]
}
else df[[x]] <- df[[x]]
#df[[x]] <- data.frame(df[[x]])
#df2 <- cbind(df2, df[[x]])
#return(NULL)
}
)
Try with lapply:
#Skewness package
library(e1071)
lapply(iris, function(x) {
if(is.numeric(x)){
if(abs(skewness(x, na.rm = T))>0.2){
log10(1 + x)} else x
}
else x
})
We can use lapply
library(e1071)
lapply(iris, function(x) if(is.numeric(x) & abs(skewness(x, na.rm = TRUE)) > 0.2)
log10(1+x) else x)
We can also loop by the columns of interest after creating a logical index
i1 <- sapply(iris, is.numeric)
i2 <- sapply(iris[i1], function(x) abs(skewness(x, na.rm = TRUE)) > 0.2)
iris[i1][i2] <- lapply(iris[i1][i2], function(x) log10(1+x))
Related
How can you iterate in a for loop with specific column names in R? This is the dataset I am using and below are the names of the columns I want to iterate. Also are the column number.
When I try to iterate, it does not compile. I need this to create a multiple cluster data visualization.
if (!require('Stat2Data')) install.packages('Stat2Data')
library(Stat2Data)
data("Hawks")
#summary(Hawks)
for (i in 10:13(Hawks)){
print(Hawks$ColumnName)
}
for (i in Hawks(c("Wing","Weight","Culmen","Hallux"))){
print(Hawks$ColumnName)
}
EDIT
After what Martin told me, this error occurs:
Error in [.data.frame`(Hawks, , i) : undefined columns selected
This is the code I have:
if(!require('DescTools')) {
install.packages('DescTools')
library('DescTools')
}
Hawks$Wing[is.na(Hawks$Wing)] <- mean(Hawks$Wing, na.rm = TRUE)
Hawks$Weight[is.na(Hawks$Weight)] <- mean(Hawks$Weight, na.rm = TRUE)
Hawks$Culmen[is.na(Hawks$Culmen)] <- mean(Hawks$Culmen, na.rm = TRUE)
Hawks$Hallux[is.na(Hawks$Hallux)] <- mean(Hawks$Hallux, na.rm = TRUE)
# ParĂ¡metro Wing
n <- nrow(Hawks) # Number of rows
for (col_names in 10:13){
x <- matrix(Hawks[, i],0.95*n)
#x <- rbind(x1,x2)
plot (x)
fit2 <- kmeans(x, 2)
y_cluster2 <- fit2$cluster
fit3 <- kmeans(x, 3)
y_cluster3 <- fit3$cluster
fit4 <- kmeans(x, 4)
y_cluster4 <- fit4$cluster
}
I have a df (10 rows, 15 columns)
df<-data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
I want to loop over each column, do something to each row and create a new df with the answer.
I actually want to do a linear regression on each column. I get back a list for each column. For example I have a second df with what I want to put into the lm. df2<-data.frame(replicate(2,sample(0:1,10,rep=TRUE)))
I then want to do something like:
new_df <- data.frame()
for (i in 1:ncol(df)){
j<-lm(df[,i] ~ df2$X1 + df2$X2)
temp_df<-j$residuals
new_df[,i]<-cbind(new_df,temp_df)
}
I get the error:
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 0, 8
I have checked other similar posts but they always seem to involve a function or something similarly complex for a newbie like me. Please help
This can be done without loops but for your understanding, using loops we can do
new_df <- df
for (i in names(df)) {
j<-lm(df[,i] ~ df$X1 + df$X2)
new_df[i] <- j$residuals
}
You are initialising an empty dataframe with 0 rows and 0 columns initially as new_df and hence when you are trying to assign the value to it, it gives you an error. Instead of that assign original df to new_df as they both are going to share the same structure and then use the above.
Update
Based on the new example
lst1 <- lapply(names(df), function(nm) {dat <- cbind(df[nm], df2[c('X1', 'X2')])
lm(paste0(nm, "~ X1 + X2"), data = dat)$residuals})
out <- setNames(data.frame(lst1), names(df))
Also, this doesn't need any loop
out2 <- lm(as.matrix(df) ~ X1 + X2, data = cbind(df, df2))$residuals
Old
We can do this easily without any loop
new_df <- df + 10
---
If we need a loop, it can be done with `lapply`
new_df <- df
new_df[] <- lapply(df, function(x) x + 10)
---
Or with a `for` loop
lst1 <- vector('list', ncol(df))
for(i in seq_along(df)) lst1[[i]] <- df[, i] + 10
new_df <- as.data.frame(lst1)
data
set.seed(24)
df <- data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
df2 <- data.frame(replicate(2,sample(0:1,10,rep=TRUE)))
I would do as suggested by akrun. But if you do need (or want) to loop for some reasons you can use:
df<-data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
new_df <- data.frame(replicate(15, rep(NA, 10)))
for (i in 1:ncol(df)){
new_df[ ,i] <- df[ , i] + 10
}
When I use aggregate function on a data.frame which contains character and numeric columns, aggregate fails and returns only NAs for all. How can I solve this? My first idea was to check for value class but it did not work.
name <- rep(LETTERS[1:5],each=2)
feat <- paste0("Feat",name)
valuesA <- runif(10)*10
valuesB <- runif(10)*10
daf <- data.frame(ID=name,feature=feat,valueA=valuesA,valueB=valuesB, stringsAsFactors = FALSE)
aggregate(.~ID, data=daf,FUN=mean)
aggregate(.~ID, data=daf,FUN=function(x){
if(is.character(x)){
return(NA)
}else{ return(mean(x))}
})
pollutantmean <- function(directory, pollutant, ID = 1:332){
+ files_list <- list.files("specdata", full.names = TRUE)
+ dat <- data.frame()
+ for (i in 1:332){
+ dat <- rbind(dat, read.csv(files_list[i]))
+ }
+ dat_subset <- subset(dat, dat$ID == ID)
+ mean(dat_subset$nitrate, na.rm = TRUE)
+ mean(dat_subset$sulfate, na.rm = TRUE)
+ }
pollutantmean(specdata, sulfate, 1:10)
[1] 3.189369
pollutantmean(specdata, nitrate, 70:72)
[1] 3.189369
pollutantmean("specdata", "sulfate", 1:10)
[1] 3.189369
I think the issue is because the dat does not exisis so we can try:
if(exist("dat")== FALSE){
dat <- read.csv(i)
}else{
dat <- rbind(dat, read.csv(files_list[i]))
}
The behavior you are observing is caused by the line
dat_subset <- subset(dat, dat$ID == ID)
subset will look for any variables in the second argument within the first argument. So when you say dat$ID == ID, it looks for ID first within dat. Since dat has an element named ID, it is equivalent to saying dat$ID == dat$ID. Thus, you aren't actually getting a subset of your data; you're using the complete data set.
To illustrate, let's use the mtcars data set. This data set has 32 rows. One of the columns is named am and his made up of the values 0 and 1 indicating if the vehicle is an automatic or manual transmission
nrow(mtcars)
[1] 32
Let's define a vector on which to subset mtcars on am
am <- 0
When we subset against am, we want to get only those rows in mtcars where am == 0.
nrow(subset(mtcars, mtcars$am %in% am))
[1] 32
That didn't seem to work, though. Notice that we get the same result if we use the next line
nrow(subset(mtcars, am %in% am))
[1] 32
Now let's change the name of our subsetting vector and watch what happens.
am_list <- 0
nrow(subset(mtcars, am %in% am_list))
[1] 19
In this case, subset did not find a column named am_list in mtcars, so it looked outside of mtcars to find an object called am_list. This behavior is one of the reasons behind the warning in ?subset
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
I think you would be better off writing your function this way (I took some liberty with rewriting it)
pollutantmean <- function(directory, pollutant, ID = 1:332){
# Get file names from directory argument
files_list <- list.files(directory, full.names = TRUE)
# Create a single data frame from all of the files
dat <- lapply(files_list,
read.csv)
dat <- do.call("rbind", dat)
# Subset the data appropriately
dat_subset <- dat[dat$ID %in% ID, ]
# Get the mean of the two columns, return in a list
lapply(dat_subset[c("nitrate", "sulfate")],
mean,
na.rm = TRUE)
}
I'd like to make a function that removes all outliers from a data set. I've read a lot of Stack Overflow articles about this, so I am aware of the dangers of removing outliers. Here's what I have so far:
# Remove outliers from a column
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
# Removes all outliers from a data set
remove_all_outliers <- function(df){
# We only want the numeric columns
a<-df[,sapply(df, is.numeric)]
b<-df[,sapply(df, !is.numeric)]
a<-lapply(a,function(x) remove_outliers(x))
d<-merge(a,b)
d
}
There are a few things wrong with this that I know of, but please correct me if anything could be handled better.
!is.numeric() is not a thing, How should I accomplish this?
I have allso tried is.numeric==FALSE
is.numeric() converts factors into ints. How do I prevent this?
Did I do lapply right?
Is there a better / easier way to perform the remove_outliers function than separating the data set, performing it, then merging it back with the non-numeric set?
Factors are ints, just not atomic ints.
# Remove outliers from a column
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
You can replace the columns by index so you don't need to create separate data sets. Just make sure you pass the same data to lapply, eg, you don't want to do data[, 1:3] <- lapply(data, FUN) which I have done many times.
# Removes all outliers from a data set
remove_all_outliers1 <- function(df){
# We only want the numeric columns
df[,sapply(df, is.numeric)] <- lapply(df[,sapply(df, is.numeric)], remove_outliers)
df
}
Similar to above (and slightly easier I think), you can pass the entire data set to lapply. Also making sure not to
data <- lapply(data, if (x) something else anotherthing)
or
data[] <- lapply(data, if (x) something)
Which are also mistakes I have made many times
remove_all_outliers2 <- function(df){
df[] <- lapply(df, function(x) if (is.numeric(x))
remove_outliers(x) else x)
df
}
## test
mt <- within(mtcars, {
mpg <- factor(mpg)
gear <- letters[1:2]
})
head(mt)
identical(remove_all_outliers1(mt), remove_all_outliers2(mt))
# [1] TRUE
Your ideas can work with a few minor adjustments. !is.numeric can work as either Negate(is.numeric) or the more verbose function(x) !is.numeric(x) or !sapply(x, is.numeric). Generally, function(function) doesn't work in r out of the box.
# Removes all outliers from a data set
remove_all_outliers <- function(df){
# We only want the numeric columns
## drop = FALSE in case only one column for either
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
## note brackets
a[]<-lapply(a, function(x) remove_outliers(x))
## stack them back together, not merge
## you could merge if you had a unique id, one id per row
## then make sure the columns are returned in the original order
d<-cbind(a,b)
d[, names(df)]
}
identical(remove_all_outliers2(mt), remove_all_outliers(mt))
# [1] TRUE