I have written several functions and want to only apply them to the last two columns of an input CSV file. The question is how to convert the last two columns to vectors and apply my functions to them?
myAvg <- function(anyVector){
average <- sum(anyVector) / length(anyVector)
return(average)
}
mySD <- function(anyVector){
std_Dev <- sqrt(sum((anyVector - mean(anyVector)) ^ 2 / (length(anyVector) - 1)))
return(std_Dev)
}
myRange <- function(anyVector){
myRange <- max(anyVector) - min(anyVector)
return(myRange)
}
data <- read.csv("CardioGoodnessFit.csv")
print(data)
As #Mako212 suggested this can be simple achieved by using the apply function in R:
avg = apply(data[,c('Income','Miles')],MARGIN=2,FUN=myAvg)
sdev = apply(data[,c('Income','Miles')],MARGIN=2,FUN=mySD)
Function myAvg will be applyied to each column of the subset of data. Columns of interest can be specified either by providing the names of the columns or column numbers in a vector. Apply is generally used for a matrix or data.frame type object. While MARGIN provides the option to apply the FUN column-wise (MARGIN = 2) , row-wise (MARGIN=1) or both(to each element of data MARGIN=c(1,2))
There is no need to convert to vectors (or in this case, even to write functions) if you use e.g. dplyr:
library(dplyr)
# means
data %>% summarise(avg = mean(Income))
data %>% summarise(avg = mean(Miles))
# standard deviations
data %>% summarise(sdev = sd(Income))
data %>% summarise(sdev = sd(Miles))
# range
data %>% summarise(range = max(Income) - min(Income))
data %>% summarise(range = max(Miles) - min(Miles))
Related
I was trying to figure out a way to transform all values in selected columns of my dataset using an equation $$x_i = x_{max} - x_i$$ using dplyr. I'm not sure how to correctly do this for one column, let alone multiple columns. My attempt at mutating 1 column:
df1 <- df %>% mutate(column1 = replace(column1, ., x = max(column1) - x)
My x = max(column1) - x part is not literal, I just want to know how I can implement that equation into all row entries in the column. Furthermore, how can I do this for multiple columns in the same line? Any help is appreciated. Thanks!
If it is to replace all values across multiple columns, loop across the numeric columns and subtract the values from its max value for that column
library(dplyr)
df <- df %>%
mutate(across(where(is.numeric), ~ max(., na.rm = TRUE) - .))
So I have this code in R that I'm using on a dataframe df that comes in the format where each row is a wavelength (823 rows/wavelengths) and each column is a pixel (written as V1-V2554).
I have the code to normalise each reflectance value as such per each spectrum/pixel:
# Define function to find vector length
veclen=function(vec) {
sqrt(sum(vec^2))
}
# Find vector length for spectrum of each pixel
df_vecV6 <- df %>%
group_by(Wavelength) %>%
summarise(veclengthV6 = veclen(V6))
# Join new variable "veclength"
df <- df %>%
left_join(df_vecV6, by = "Wavelength")
# Define function that return normalized vector
vecnorm=function(vector) {
vector/veclen(vector)
}
# Normalize by dividing each reflectance value by the vector’s length
df$refl_normV6 <- vecnorm(df$V6)
but I want to create a loop to do this for all 2553 columns. I started writing it but seem to come up with problems. In this case df is finaldatat and I wanted to create a list svec to store vector lengths before the next steps:
for(i in (1:ncol(finaldatat))){
svec[[i]]<- finaldatat %>%
#group_by(Wavelength) %>%
summarise (x = veclen(finaldatat[,i]))
}
That first step runs, but the vector lengths that are meant to be below zero are way above so I already know there's a problem. Any help is appreciated!
Ideally in the final dataframe I would only have the normalised results in the same 2554x824 format.
You can use dplyr's across function to apply vecnorm to all columns from V1 to V2554.
result <- df %>%
group_by(Wavelength) %>%
summarise(across(V1:V2554, vecnorm))
#In older version of dplyr use summarise_at :
summarise_at(vars(V1:V2554), vecnorm)
This is my dataset - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
Ive been try to create a histogram for the most busiest ID's of a dataset.
So first I create the count
hostcount <-plyr::count(nycab2$host_id)
Then I try to use just the top 1-
hostcounttop <- head(arrange(hostcount, decreasing = TRUE), n = 10)
But I get this error
Error: Length of ordering vectors don't match data frame size
What's wrong with my code as the it's the same data frame size?
arrange does not have decreasing = TRUE argument. You can use desc on the column name.
However, plyr has been retired and most of the functions are available in dplyr. You can do :
library(dplyr)
hostcounttop <- nycab2 %>% count(host_id) %>% slice_max(n, 1)
#If you have dplyr version < 1.0.0 use `top_n`.
#hostcounttop <- nycab2 %>% count(host_id) %>% top_n(1, n)
In base R, we can do this as :
mat <- stack(table(nycab2$host_id))
mat <- mat[order(mat$values), ]
#to get top 10 values use tail
tail(mat, 10)
You could try using order() instead of arrange, and subsetting the dataframe using brackets?
E.g. assuming the problem isn't with your dataframe, try:
order(hostcount, decreasing = TRUE)[1:10]
You can change the 1:10 to get as many as you want.
I have a tibble (or data frame, if you like) that is 19 columns of pure numerical data and I want to filter it down to only the rows where at least one value is above or below a threshold. I prefer a tidyverse/dplyr solution but whatever works is fine.
This is related to this question but a distinct in at least two ways that I can see:
I have no identifier column (besides the row number, I suppose)
I need to subset based on the max across the current row being evaluated, not across a column
Here are attempts I've tried:
data %>% filter(max(.) < 8)
data %>% filter(max(value) < 8)
data %>% slice(which.max(.))
Here's a way which will keep rows having value above threshold. For keeping values below threshold, just reverse the inequality in any -
data %>%
filter(apply(., 1, function(x) any(x > threshold)))
Actually, #r2evans has better answer in comments -
data %>%
filter(rowSums(. > threshold) >= 1)
Couple more options that should scale pretty well:
library(dplyr)
# a more dplyr-y option
iris %>%
filter_all(any_vars(. > 5))
# or taking advantage of base functions
iris %>%
filter(do.call(pmax, as.list(.))>5)
Maybe there are better and more efficient ways, but these two functions should do what you need if I understood correctly. This solution assumes you have only numerical data.
You transpose the tibble (so you obtain a numerical matrix)
Then you use map to get the max or min by column (which is the max/min by row in the initial dataset).
You obtain the row index you are looking for
Finally, you can filter your dataset.
# Random Data -------------------------------------------------------------
data <- as.tibble(replicate(10, runif(20)))
# Threshold to be used -----------------------------------------------------
max_treshold = 0.9
min_treshold = 0.1
# Lesser_max --------------------------------------------------------------
lesser_max = function(data, max_treshold = 0.9) {
index_max_list =
data %>%
t() %>%
as.tibble() %>%
map(max) %>%
unname()
index_max =
index_max_list < max_treshold
data[index_max,]
}
# Greater_min -------------------------------------------------------------
greater_min = function(data, min_treshold = 0.1) {
index_min_list =
data %>%
t() %>%
as.tibble() %>%
map(min) %>%
unname()
index_min =
index_min_list > min_treshold
data[index_min,]
}
# Examples ----------------------------------------------------------------
data %>%
lesser_max(max_treshold)
data %>%
greater_min(min_treshold)
We can use base R methods
data[Reduce(`|`, lapply(data, `>`, threshold)),]`
I have a dataframe containing a line per company, with different variables (some numeric, others not):
data <- data.frame(id=1:5,
CA = c(1200,1500,1550,200,0),
EBE = c(800,50,654,8555,0),
VA = c(6984,6588,633,355,84),
FBCF = c(35,358,358,1331,86),
name=c("qsdf","xdwfq","qsdf","sqdf","qsdfaz"),
weight = c(1, 5, 10,1 ,1))
I would like to summarise all numeric variables by a weighted sum. If I wanted a simple sum I would do:
data %>% summarise_if(is.numeric,sum)
but I don't see how to define a weighted sum.
I tried:
w.sum <- function(x) {sum(x*weight) %>% return()}
but without any success.
We can use it inside the funs
data %>%
summarise_if(is.numeric, funs(sum(.*weight)))
Note that the above is based on the condition that if the columns are numeric class. Based on the example the 'id' column is numeric, which may not need the summariseation. A better option would be summarise_at to specify the columns of interest
data %>%
summarise_at(names(.)[2:5], funs(sum(.*weight)))