Apply function not working in R - r

I created a data frame :
fy <- c(2010,2011,2012,2010,2011,2012,2010,2011,2012)
company <-c("Apple","Apple","Apple","Google","Google","Google","Microsoft","Microsoft","Microsoft")
revenue <- c(65225,108249,156508,29321,37905,50175,62484,69943,73723)
profit <- c(14013,25922,41733,8505,9737,10737,18760,23150,16978)
companiesData <- data.frame(fy, company, revenue, profit)
I am trying to create new column using apply command but it is given error:
companiesData$Margin<-apply(companiesData,1,function(x){(x[4]/x[3])*100})
Error in x[4]/x[3] : non-numeric argument to binary operator
Can someone please tell me what is the mistake here?

The mistake is that apply coerces its first argument to a matrix and since companiesData has numeric and non-numeric variables, all variables are converted to non-numeric resulting in the operation x/y being invalid, because division is not defined for non-numeric data.
Solution: you don't need apply in this case.
companiesData$Margin <- 100 * companiesData$profit / companiesData$revenue
or equivalently
companiesData <- within(companiesData, Margin <- 100 * profit / revenue)
do what you want.

Related

Trying to write a loop that makes a calculation for every vector in a dataframe

I am trying to write a loop that will perform a calculation on every value of every vector in a dataframe. Essentially, I am trying to standardize the values in the dataframe. I am trying to find the mean of each vector. Then I subtract that mean from the individual data values in each vector. Then I want to divide the difference (data value subtract mean of vector) by the standard deviation of the vector.
The expected result is that the mean is 0 and the standard deviation 1 for every individual vector in the dataframe.
I tried using this code:
for(i in colnames(metabolites)) {
metabolites<-metabolites %>%
(i-(mean(i)))/sd(i)
}
But it returns this error:
> for(i in colnames(metabolites)) {
+ metabolites<-metabolites %>%
+ (i-(mean(i)))/sd(i)
+ }
Error in i - (mean(i)) : non-numeric argument to binary operator
In addition: Warning message:
In mean.default(i) : argument is not numeric or logical: returning NA
Tried writing the loop a couple different ways. Expected it to produce a standardized dataset where every vector has its own mean of 0 and a standard deviation of 1
The issue is that in the for-loop, colnames(metabolites)[i] is each column name, a character variable. So you are passing the name of the column to mean, not the column values. Hence the error "non-numeric argument".
Column values are accessed using metabolites[, i] so something like this should work:
for(i in colnames(metabolites)) {
metabolites[, i] <- (metabolites[, i] - mean(metabolites[, i])) / sd(metabolites[, i])
}
You may also want to look at the scale function, or dplyr::mutate as a way to alter column values.

passing multiple column to function for fixing outlier

summary(standard_airline)
#outlier treatment
outFix <- function(x){
quant <- quantile(x,probs = c(.25,.75))
h <- 1.5*IQR(x,na.rm = T)
x[x<(quant[1]-h)] <- quant[1]
x[x>(quant[2]+h)] <- quant[2]
}
v <- colnames(airline[,-1])
data2 <- lapply(v,outFix)
Error - Error in (1 - h) * qs[i] : non-numeric argument to binary operator
I couldn't find out what is the error coming here although logically seems right, Is there any way in R to pass multiple column of a dataset to a particular function. Here I want to pass every column except ID to fix the outliers.
Problem
The issue you are encountering is that v is a character vector of column names. Your function outFix expects a numeric vector. So what your lapply code is actually doing is something like this: outFix("Balance"). So it's trying to compute quantiles and IQRs on a string, which is why you're having your error.
quantile("Balance")
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
Solutions
In the following code replace df with airline for your specific data.
In base R:
df[,-1] <- lapply(df[, -1], function(x) outFix(as.numeric(x))) # exclude first column
Or using your code:
df[, v] <- lapply(df[, v], function(x) outFix(as.numeric(x)))
Using dplyr you can apply your function to every column and except ID with:
library(dplyr)
df %>%
dplyr::mutate_at(dplyr::vars(-ID), ~ outFix(as.numeric(.))) # remove ID by name
df %>%
dplyr::mutate_at(-1, ~ outFix(as.numeric(.))) # remove ID by column position
This makes sure that all your columns are numeric before being passed to your function outFix.
If you're certain that all of your columns are numeric ahead of time then you don't need to use the as.numeric function, but could be good to have in case.

Creation of summary variable in R with values of "substitution" variable if focal variable is NA

I have a question. I have three "Nationality" variables in my dataset (as three separate columns):
"Nationality_birth",
"Nationality_now"
"Nationality_difference" (this variable has a value, if "Nationality_birth" and "Nationality_now" are diverging - indicating which of the two nationalities someone feels more detached to -, otherwise NA)
Now I want to create a fourth variable "Nationality", which is based on "Nationality_difference" (focal variable), but has the values of "Nationality_birth" (substitute variable) if "Nationality_difference" is NA.
I tried the following code:
data$Nationality <- data$Nationality_difference
data$Nationality[is.na(data$Nationality_difference)] <- data$Nationality_birth
I get the following error:
Error in data$Nationality[is.na(data$Nationality_difference)] <- data$Nationality_birth :
replacement has length zero
What am I missing?
Thanks a lot in advance!
You can try:
data$Nationality <-
ifelse(!is.na(data$Nationality_difference),
data$Nationality_difference,
data$Nationality_birth)

Error in seq.default(1, 1, length.out = nrow(x)) : argument 'length.out' must be of length 1

I am trying to make a simple function that finds outliers and marks the corresponding observation as valid.obs=1 if it is not an outlier,or valid.obs=0 if it is indeed an outlier.
For example, for the variable "income", the outliers will be identified based on the following formula: if
income>=(99percentile(income)+standard_deviation(income)), then it is an outlier.
If income<(99percentile(income)+standard_deviation(income)), then it is not an outlier.
rem= function(x){
u=quantile(x,probs=0.99,na.rm=TRUE) #calculating the 99th percentile
s=sapply(x,sd,na.rm=TRUE) #calculating the standard deviation
uc=u+s
v=seq(1,1,length.out = nrow(x))
v[x>=uc]=0
v[x<uc]=1
x$valid.obs=v
return(x)
}
I go on to apply this function to a single column of a dataframe. The dataframe has 132 variables with 5000 entries. I choose the variable "income"
apply(data["income"],2,rem)
It, then shows the error:
Error in seq.default(1, 1, length.out = nrow(x)) :
argument 'length.out' must be of length 1
Outside the function "rem", the following code works just fine:
nrow(data["income"])
[1] 5000
I am new to R and there aren't many functions in my armoury yet.The objective of this function is very simple. Please let me know why this error has crept in and if there is an easier way to go about this?
Use
v = rep(1, length.out = length(x))
apply iterates through "margins" or rows/columns of a data frame and passes the data frame columns as named vectors to FUN. The vector will have a length but not a row count.
ie. Inside rem you are passing
> nrow(c(1,2,3))
NULL
A few other things not directly related to your error:
For the same reason as above, there is no need to call sd inside sapply. Just call it normally on the vector.
s=sd(x,na.rm=TRUE) #calculating the standard deviation
You can also simplify three lines (and remove your initial problem entirely) by using
v=as.numeric(x<uc)
This will create a logical vector (automatically the same length as x) with TRUE/FALSE values based on <uc. To get your 0s and 1s just coerce the logical values with as.numeric
Finally, if all you need to do is add one column to data based on the values in income you want to return v instead and call the function like so
data$valid.obs <- rem(data$income)
Your function will now return a vector which can essentially be added to data under the new name of valid.obs

How to perform RMSE with missing values?

I have a huge dataset with 679 rows and 16 columns with 30 % of missing values. So I decided to impute this missing values with the function impute.knn from the package impute and I got a dataset with 679 rows and 16 columns but without the missing values.
But now I want to check the accuracy using the RMSE and I tried 2 options:
load the package hydroGOF and apply the rmse function
sqrt(mean (obs-sim)^2), na.rm=TRUE)
In two situations I have the error: errors in sim .obs: non numeric argument to binary operator.
This is happening because the original data set contains an NA value (some values are missing).
How can I calculate the RMSE if I remove the missing values? Then obs and sim will have different sizes.
How about simply...
sqrt( sum( (df$model - df$measure)^2 , na.rm = TRUE ) / nrow(df) )
Obviously assuming your dataframe is called df and you have to decide on your N ( i.e. nrow(df) includes the two rows with missing data; do you want to exclude these from N observations? I'd guess yes, so instead of nrow(df) you probably want to use sum( !is.na(df$measure) ) ) or, following #Joshua just
sqrt( mean( (df$model-df$measure)^2 , na.rm = TRUE ) )
The rmse() function in R package hydroGOF has an NA-remove parameter:
# require(hydroGOF)
rmse(sim, obs, na.rm=TRUE, ...)
which, according to the documentation, does the expected when na.rm is TRUE:
"When an ’NA’ value is found at the i-th position in obs OR sim, the i-th value
of obs AND sim are removed before the computation."
Without a minimal reproducible example, it's hard to say why that didn't work for you.
If you want to eliminate the missing values before you input to the hydroGOF::rmse() function, you could do:
my.rmse <- rmse(df.sim[rownames(df.obs[!is.na(df.obs$col_with_missing_data),]),]
, df.obs[!is.na(df.obs$col_with_missing_data),])
assuming you have the "simulated" (imputed) and "observed" (original) data sets in different data frames named df.sim and df.obs, respectively, that were created from the same original data frame so have the same dimensions and row names.
Here is a canonical way to do the same thing if you have more than one column with missing data:
rows.wout.missing.values <- with(df.obs, rownames(df.obs[!is.na(col_with_missing_data1) & !is.na(col_with_missing_data2) & !is.na(col_with_missing_data3),]))
my.rmse <- rmse(df.sim[rows.wout.missing.values,], df.obs[rows.wout.missing.values,])

Resources