R: Looping in R and write result columnwise in other data frame - r

since I am fairly new to R I am struggling for days to come to the right solution. All the internet and stackoverflow search could not bring me ahead so far.
All tries with rbind, cbind, lapply, sapply did not work. So here is the problem:
I have a data frame given wich a time series in column "value X"
I want to calculate single and exponential moving averages on this column (SMA and EMA).
Since you can change the parameter "n" as window size in SMA/EMA calculation I want to change the parameter in a loop starting from 5 to 150 in steps of 5. And then write the result into a data frame.
So the data frame should look like.
SMA_5 | SMA_10 | SMA_15 .... EMA_5 | EMA_10 | EMA_15 ...
Ideally the column names are also created in this loop.
Can you help me out?
Thank you in advance

As far as I know, the loops are seen as a non-optimal solution in R and should be avoided if possible. It seems to me that in-built R functions sapply and colnames may provide quite a simple solution for your problem:
library("TTR")
# example of data
test <- data.frame(moments = 101:600, values = 1:500)
seq_of_windows_size <- seq(from = 5, to = 150, by = 5)
col_names_of_sma <- paste("SMA", seq_of_windows_size, sep = "_")
SMA_columns <- sapply(FUN = function(i) SMA(x = test$values, n = i),
X = seq_of_windows_size)
colnames(SMA_columns) <- col_names_of_sma
Then you'll have just to add the SMA_columns to your original dataframe. The steps for EMA may be much the same.
Hope, it helps :)

Related

R: Efficiently Calculate Deviations from the Mean Using Row Operations on a DF (Without Using a For Loop)

I am generating a very large data frame consisting of a large number of combinations of values. As such, my coding has to be as efficient as possible or else 1) I get errors like - R cannot allocate vector of size XX or 2) the calculations take forever.
I am to the point where I need to calculate r (in the example below r = 3) deviations from the mean for each sample (1 sample per row of the df)(Labeled dev1 - dev3 in pic below):
These are my data in R:
I tried this (r is the number of values in each sample, here set to 3):
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
When I try this, I get:
I am guessing that this code is attempting to calculate the difference between each row of X1 (x) and the entire vector of X1$x.bar instead of 81 for the 1st row, 81.25 for the 2nd row, etc.
Once again, I can easily do this using for loops, but I'm assuming that is not the most efficient way.
Can someone please stir me in the right direction? Any assistance is appreciated.
Here is the whole code for the small sample version with r<-3. WARNING: This computes all possible combinations, so the df's get very large very quick.
options(scipen = 999)
dp <- function(x) {
dp1<-nchar(sapply(strsplit(sub('0+$', '', as.character(format(x, scientific = FALSE))), ".",
fixed=TRUE),function(x) x[2]))
ifelse(is.na(dp1),0,dp1)
}
retain1<-function(x,minuni) length(unique(floor(x)))>=minuni
# =======================================================
r<-3
x0<-seq(80,120,.25)
X0<-data.frame(t(combn(x0,r)))
names(X0)<-paste("x",1:r,sep="")
X<-X0[apply(X0,1,retain1,minuni=r),]
rm(X0)
gc()
X$x.bar<-rowMeans(X)
dp1<-dp(X$x.bar)
X1<-X[dp1<=2,]
rm(X)
gc()
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
Because R is vectorized you only need to subtract x.bar from from x1, x2, x3 collectively:
devs <- X1[ , 1:3] - X1[ , 4]
X1devs <- cbind(X1, devs)
That's it...
I think you just got the margin wrong, in apply you're using 1 as in row wise, but you want to do column wise so use 2:
X2<-apply(X1[,1:r], 2, function(x) x-X1$x.bar)
But from what i quickly searched, apply family isn't better in performance than loops, only in clarity. Check this post: Is R's apply family more than syntactic sugar?

How to make a data frame with function in R?

For some basic publications I have to make almost same codes for many tables. So I have to make a quite fast code to make data frames from files and to make some same operations with data using only one same formula.
Example:
# Creating function
basic_sum <- function (place, DF, factor_col, sum) {
# Uploading data.frame
DF <- read.csv (place, sep = ";")
# Converting to factor
for (i in factor_col) {
DF [, i] <- as.factor (DF [, i])
}
# Summary
sum <- summary (DF)
View (sum)
}
Than I'm running that code and get a function basic_sum
If I want to work with my Data I call this function with arguments:
basic_sum (place = "~/DataFrame.csv", DF = DataFrame,
factor_col = c (1, 6 : 11), sum = DF_sum)
After running it nothing happens. I mean, I don't have anything new in Environment. No new data, no new vars or something else.
In my thoughts it seems that finally I have to get:
1) data.frame "DataFrame", that was uploaded DataFrame.csv;
2) 1st, 6th, 7th and all other columns until 11th will be factor
3) data.frame "DF_sum" with summary of all my columns from "DataFrame"
4) I will see data.frame "DF_sum".
Well, I see all of it in console, but I need it in Environment and to save it somewhere.
Seems that I'm doing something wrong... But I don't know what.
P.S.: If I try to run it without function (of course replacing DF to DataFrame, factor_col to с (1, 6 : 11) and so on...) everything is all right. But I have to rewrite code every time or at lest replace all DF and other that bother me.
With great regards,
Dmitrii

simple [SAS] macro in [R], how?

I have datasets called example1,example2,example3,example4 of which variable is SEX(1 or 2) in working library
and I've made datasets called exampleS1,exampleS2,exampleS3,exampleS4 restricted to SEX=1 by using MACRO in SAS
like this way.
%macro ms(var=);
data exampleS&var.;
set example&var.; IF SEX=1;
run;
%mend ms;%ms(var=1);%ms(var=2);%ms(var=3);%ms(var=4);
Now, I want to do this job in R
It's bit not easy to do this in R to me. How can I do it? (assuming example1,example2, example3,example4 are data.frames)
Thank you in advance.
Having variables with numeric index in the name is a very SAS thing to do, and not at all very R like. If you have related data.frames, in R, you keep them in a list. There are many ways to read in many files into a list (see here). So say you have a list of data.frames
examples <- list(
data.frame(id=1:3, SEX=c(1,2,1)),
data.frame(id=4:6, SEX=c(1,1,2)),
data.frame(id=7:9, SEX=c(2,2,1))
)
Then you can get all the SEX=1 values with
exampleS <- lapply(examples, subset, SEX==1)
and you access them with
exampleS[[1]]
exampleS[[2]]
exampleS[[3]]
You should program R the R-way, not the SAS-way, because this will lead to endless pain. SAS-macro-language and R don't mix imo, but this is how:
# create example df's
for (i in 1:4) {
assign(paste0("example", i), data.frame(sex = sample(0:1, 10, replace = T)))
}
example1; example2; example3; example4
# filter and store result in a list of df's
l <- list(example1 = example1, example2 = example2, example3 = example3, example4 = example4)
want <- lapply(l, function(x) subset(x, sex == 1))
want$example1; want$example2; want$example3; want$example4 # get list of data frames
# almost certainly what you should do
# in principle possible to this too, but advise against it
list2env(lapply(l, function(x) subset(x, sex == 1)), .GlobalEnv)
example1; example2; example3; example4

For loop to produce a series of dataframes in output based on a criteria

I have gleamed from existing somewhat similar questions that 1) for loops are slow and 2) outputting to a list and then making a dataframe is preferable to outputting directly to a dataframe.
Nonetheless:
So I have a bunch of NIBRS/UCR (Incident-Based Uniform Crime Reporting) data. I want to create 50 new lists/dataframes/tables, each segregating the data by state.
Data is:
Date CrimeDataField1 CDF2 CDF3 etc State.Abbrev.
xxx xxx xxx xxx xxx xxx
My clumsy attempt at a for loop:
for(i in unique(State.Abbrev.)){
+ i.allyrs<-HCtest1[State.Abbrev.=="i",]}
Thanks for any help!
Edit: I should add that the goal here is that each new dataframe of data-by-state should ideally be named AbbreviationforthatState.allyrs, and I wanted to take care of both the output and naming in this manner via my single for loop. Maybe not intelligent?
You can use the built-in split function:
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(1)
split(x, sample(rep(1:2, 13)))
split(x, x$let)
In your case probably
list_of_dfs = split(HCtest1,HCtest1$State.Abbrev.)
By the way, for loops aren't bad as such it is extending data within for loops that is bad. If you can pre-allocate then it is not that bad (just not as nice looking).
Have a look at the R Inferno which will give you insight into R's method of working (it is copy on change) and given that you are starting out this link.
Edit: To name your list:
names(list_of_dfs) = paste("MyName",1:length(list_of_dfs),sep="*")

How to rewrite this Stata code in R?

One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?
foreach i in A B C D {
forval n=1990/2000 {
local m = 'n'-1
# create new columns from existing ones on-the-fly
generate pop'i''n' = pop'i''m' * (1 + trend'n')
}
}
DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.
For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.
Use a data structure that the language gives you. In this case probably a list.
Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).
But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.
I make some data as I believe it is in your R version now (at least, I hope so...)
Data <- data.frame(
popA1989 = 1:10,
popB1989 = 10:1,
popC1989 = 11:20,
popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))
You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year
newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL
Filling up the dataframe is then quite easy :
for(i in 1:11){
tmp <- newData[newData$year==(1988+i),]
newData <- rbind(newData,
data.frame( values = tmp$values*Trend[,i],
pop = tmp$pop,
year = tmp$year+1
)
)
}
In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.
And if you insist, you can still create a wide format with unstack()
unstack(newData,values~paste("pop",pop,year,sep=""))
Adaptation of Joshua's answer to add the columns to the dataframe :
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
trend <- Trend[,i-1989] # get trend variable
Data <- within(Data,assign(new, old*(1+trend)))
}
}
Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep="")) # get old variable
trend <- get(paste("trend",i,sep="")) # get trend variable
assign(new, old*(1+trend))
}
}
Assuming you have population data in vector pop1989
and data for trend in trend.
require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)

Resources