This question already has answers here:
Standardize data columns in R
(16 answers)
Scale only certain columns R [closed]
(1 answer)
Closed 3 years ago.
Hey awesome community,
I am trying to learn how to use loops to loop through aspects of a dataset. I'm using the sns data set provided free for machine learning and trying to run a k means cluster analysis. The first thing I need to do is to center and scale the variables. I want to do this using a loop, and I need to select all but the first four variables in the data set. Here's what I tried, and I'm not sure why this doesn't work:
for(i in names(sns.nona[, -c(1:4)])){
scale(i, center = TRUE, scale = TRUE)
}
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I get the above error, which must mean it's not selecting the actual column of the data set, just the name. I guess I should expect that, but how do I make it reference the data?
edit: I also tried:
for(i in names(sns.nona)[-c(1:4)]){
scale(sns.nona[,i], center = TRUE, scale = TRUE)
}
This did not return an error but it does not appear to be centering the data. I should get some negative values if the original value was 0 as I'd be subtractign the column mean from it...
A way to do this avoiding writing a loop:
scale(data[-1:-4])
Also, if you want to do this while enabling yourself to modify the selected columns without creating a new data frame:
data[-1:-4] <- lapply(data[-1:-4], scale)
You could use the tidyverse family of packages, which is what I use for pretty much everything I do in R.
It's never too early to start using them imo.
require(tidyverse)
#Convert sns.nona to tibble (robust data format which we can do cool stuff to)
sns.nona = as.tibble(sns.nona)
#Do cool stuff: mutate_at("columns to change","function to apply to columns")
sns.nona = sns.nona %>%
mutate_at(5:(ncol(sns.nona)),function(x) scale(x, center = T, scale = T))
NB don't be alarmed by the %>%. Basically x %>% function(y,z) is equivalent to function(x,y,z)
You might need to assign the result back after applying scale
for(i in names(df)[-(1:4)]){
df[, i] <- scale(df[,i], center = TRUE, scale = TRUE)
}
Or with lapply you could do
df[-(1:4)] <- lapply(df[-(1:4)], scale, center = TRUE, scale = TRUE)
and with dplyr we can . use mutate_at
library(dplyr)
df %>% mutate_at(-(1:4), scale, center = TRUE, scale = TRUE)
Related
tl;dr I need to condition if a promotion was on or not based upon drops(or not) in price over time. I am open to alternative approaches.
I have a data frame of prices split across several grouping factors over time. My goal is for each 'ITEM' in 'EACH' store to check the mode of the 'PRICE' for the past 7 dates (if they exist). If the value of the observation is less than 10% of the mode of price, then in the 'Promotion' column should be populated with a 1, if not a 0.
EXAMPLE DATA
dat <- data.frame(Date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 10),
Item = rep(LETTERS[1:4], times = 10),
Store = as.factor(sample(rep(c("NY","SYD","LON","PAR"), each = 10))),
Price = rnorm(n = 40, mean = 2.5, sd = 1))
So far I have used dplyr's group_split to break out item and store groupings into separate data frames to capture all the conditions. What I believe I need to do now is mutate the new column using an ifelse statement with rollapply. I have so far attempted to use the following line of code...
data %>% mutate(Promotion = ifelse(rollapply(Price, 7, Mode <= Price*0.91,1,0)))
this returns an error statement...
Error: Problem with `mutate()` input `PRMT_IND2`.
x comparison (5) is possible only for atomic and list types
i Input `PRMT_IND2` is `ifelse(...)`.
I am not really sure where to go from here. If you have time I would also appreciate it if you could tell me how to apply this across all the groups created by the group_split, and how to stitch this back together.
note. Observations (dates/rows) are no even across shops, and some are populated with less than 7 days. I can remove these if the rolling apply will not work without it. But that loses quite a chunk of data.
I am using this function for the Mode...
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Maybe you can use rolling mean instead of mode.
library(dplyr)
library(zoo)
dat %>%
group_by(Item, Store) %>%
mutate(Promotion = as.integer(abs((Price -
rollmeanr(Price, 7, fill = NA))/Price) > 0.1))
This will give NA's to first 6 value and give 1 if Price varies more than 10% than previous 7 days value and 0 otherwise. Also note, that we take absolute value here so it will give 1 if the price increases by 10% or decreases.
As Ronah Shak pointed out, the function does not seem like the most appropriate choice.
Also, note that the use of tabulate converts the values to integers, which may be problematic for the values you have.
Regarding the error, as you correctly guessed, the problem was that your splitted data does not always have 7 dates so the rollapply function with width=7returned an error.
Allowing your function to use the length of the Date vector OR 7 if available solves the issue.
Also, you can use just apply your function using group_by, splitting the data is not necessary.
dat %>%
group_by(Store,Item)%>%
mutate(price_check = Price*0.91,
Promotion = ifelse(rollapply(Price, width = min(length(Date),7), Mode)>=price_check,1,0))
I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))
I have data that has multiple sequences that I'd like to replace by sampling from another data frame. In my head, it would work something like
x = seq(1,100, 0.5)
sample_set = rnorm(20,1,1)
# here I want to replace certain values in x and replace them with values sampled from the normal distribution
x[c(2:5,30:32,50:56),1] = sample(sample_set, length([c(2:5,30:32,50:56)]), replace = TRUE)
In my data, this replacement only works for the first sequence specified in
x[c(2:5,30:32,50:56),1] # i.e. items 2:5
I've explored recode() and several other options, but nothing has completed the replacement at all locations. Thanks in advance! I'm probably overthinking this...
You have some inconsistencies in the way you refer to x. First you declare it as a one-dimensional object and then you refer to it as a matrix. I believe if you fix that and remove the dat from inside the sample function, everything works as you described:
x[c(2:5,30:32,50:56)] = sample(sample_set, length(c(2:5,30:32,50:56)), replace = TRUE)
I have a large data set I am attempting to sample rows from. Each row has a family ID, and there may be one or multiple rows for each family ID. I want to parse the data set by randomly sampling one row for each family ID. I have attempted to accomplish this by using both tapply() and split() + lapply() functions, but to no avail. Below is code that reproduces my issue - the size and scope of the factor levels and data entries mirror the data set I am working with.
set.seed(63)
f1 <- factor(c(rep(30000:32000, times=1),
rep(30500:31700, times = 2),
rep(30900:31900, times = 3)))
f2 <- factor(rep(sample(1:7, replace = TRUE), times = length(f1)/7))
x1 <- round(matrix(rnorm(length(f1)*300), nrow = length(f1), ncol = 300),3)
df <- data.frame(f1, f2, x1)
Next, I used tapply to sample one row per factor from f1, and then check for repeats. (f2 is a secondary factor that indexes another aspect of the observations, but is [hopefully] irrelevant here; I only include it for full disclosure of the structure of my data set).
s1 <- tapply(1:nrow(df), df$f1, sample, size=1)
any(duplicated(s1))
The output for the second line of code using duplicated is TRUE, which means there are repeats. Stumped, I tried split to see if that was the problem.
df.split <- split(1:nrow(df), df$f1)
any(duplicated(df.split))
The output here for duplicated is FALSE, so the problem is not split. I then used the output df.split with lapply and sample to see if the problem was with tapply.
df.unique <- unlist(lapply(df.split, sample, size = 1, replace = FALSE,
prob = NULL))
any(duplicated(df.unique))
In the first line, I sampled one value from each element of df.split which outputs a list, then I used unlist to convert into a vector. The output for duplicated here is also TRUE.
Somewhere within sample and lapply there is funky stuff going on (since tapply merely calls lapply). I'm not sure how to fix the issue (I searched SO and Google and found nothing related to my issue), so any help would be greatly appreciated!
EDIT: I'm hoping someone could tell me why the above code using tapply and lapply is not working as intended. Arthur has provided a nice answer, and I have coded a loop for sample as well. I'm wondering why the above code is misbehaving.
I would do that:
library(data.table)
data.table(df)[,.SD[sample(.N,1)],by='f1']
... but actually your original approach with tapply is faster if you just want an index and not the actual subset table ; however, you must notice that sample(n) actually samples in 1:n when length(n)==1. See ?sample. This version is error-proof:
s1 <- tapply(1:nrow(df), list(df$f1), function(v) v[sample(1:length(v), 1)])` is error prooff
This question already has answers here:
Adaptive moving average - top performance in R
(3 answers)
Closed 8 years ago.
Say i have two columns in a dataframe/data.table, one the level and the other one volume. I want to compute a rolling average of the level, weighted by volume, so volume acts as weight (normalized to 1) for some rolling window.
Base R has a weighted.mean() function which does similar calculation for two static vectors. I tried using sapply to pass a list/vector fo argument to it and create a rollign series, but to no avail.
Which "apply" mechanism should i use with weighted.mean() to get the desired result, or i would have to loop/write my own function?
////////////////////////////////////////////////////////////////////////////////////////
P.S. in the end i settled on writing simple custom function, which utilizes the great RccpRoll package. I found RccpRoll to be wicked fast, much faster than other rolling methods, which is important to me, as my data is several million rows.
the code for the function looks like this(i've added some NAs in the beggining since RccpRoll returns data without NAs):
require(RcppRoll)
my.rollmean.weighted <- function(vec1,vec2,width){
return(c(rep(NA,width-1),roll_sum(vec1*vec2,width)/roll_sum(vec2,width)))
}
I think this might work. It employs the technique demonstrated in the rollapply documentation for rolling regression. The key is by.column=FALSE. This provides a matrix of all the columns on a rolling basis.
require(zoo)
df <- data.frame(
price = cumprod(1 + runif(1000,-0.03,0.03)) * 25,
volume = runif(1000,1e6,2e6)
)
rollapply(
df,
width = 50,
function(z){
#uncomment if you want to see the structure
#print(head(z,3))
return(
weighted_mean = weighted.mean(z[,"price"],z[,"volume"])
)
},
by.column = FALSE,
align = "right"
)
Let me know if it doesn't work or is not clear.
Here is a code snippet that might help. It uses the rollmean function from the zoo package, and intervals of two (you pick the interval). The variable you would calculate using the weighted.mean function, I assume:
library(zoo) # for the rollmean() function
movavg <- rollmean(df$weightedVariable, k = 2, align = "right")