I am reproducing some Stata code in R and struggling with the following command:
gen new_var=((var1_a==1 & var1_b==0) | (var2_a==1 & var2_b==0))
I am generally familiar with the gen syntax, but in this case I do not understand how values are assigned based on the boolean condition.
What would the above be in R?
In Stata, in general the above gen command will work because you have variables in your in-memory dataset (similar to a single R dataframe) named var1_a, var1_b, var_2_a, and var2_b. If these exist as vectors in your R environment, then our colleague Nick Cox is exactly correct: all is needed is the statement without the leading gen.. (although typically in R we would write it like this):
new_var <- (var1_a==1 & var1_b==0) | (var2_a==1 & var2_b==0)
However, if you have a data frame object, say df that contains columns with these names, and the objective is to add another column to df that reflects your logical conditions (like adding a new variable ("column") to the dataset in Stata using generate / gen. In this case, the above approach will not work as the columns var1_a, var1_b, etc will not be found in the global environment.
Instead, to add a new column called new_var to the dataframe called df, we would write something like this:
df["new_var"] <- (df$var1_a==1 & df$var1_b==0) | (df$var2_a==1 & var2_b==0)
I have a data set with around 1000 columns/parameters and want to perform regression among each of these parameters. So, data in column 1 will be stacked against all other 999 parameters for linear regression and so on.
The nonoptimized version of this approach would be:
loop <- c(1:ncol(Data))
for ( column in loop ){
# Fetch next data to be compared
nextColumn <- column + 1
# Fetch next column
while ( nextColumn <= ncol(Data) ){
# Analysis logic
# Increment the counter
nextColumn <- nextColumn + 1
}
}
Above code will work, but will take lot of time. To optimize, I want to use parallel processing in R. There are many different packages which can be useful in this case, for example parallel and doparallel as explained in this question.
However, there might be some overhead involved which as a new R programmer I might not be aware off. I am looking for suggestions from R experts on better way to write above code in R and whether any specific package can be useful.
Looking forward to suggestions, thanks.
Use mapply like this:
X <- 1:(ncol(mtcars)-1) # first through penultimate column
Y <- 2:ncol(mtcars) # second through last column
mapply(function(x,y) sum(mtcars[,x],mtcars[,y]), X, Y)
The following is an example of how I want to treat my data sets. It might be a bit different to understand how my data frame is structured, but I hope it makes sense:
First density must be calculated for columns A, B, and C using raw data from columns ADry, AEthanol, BDry ...... (Since these were earlier defined as vectors too, i used the vectors instead data frame columns as it was shorter - ADry_1_0 instead of Sample_1_0$ADry_1_0)
Sample_1_0$ADensi_1_0=(ADry_1_0/(ADry_1_0-AEthanol_1_0))*(peth-pair)+pair
Sample_1_0$BDensi_1_0=(BDry_1_0/(BDry_1_0-BEthanol_1_0))*(peth-pair)+pair
Sample_1_0$CDensi_1_0=(CDry_1_0/(CDry_1_0-CEthanol_1_0))*(peth-pair)+pair
This yields 10 densities for both A, B, and C. What's interesting is the mean density
Mean_1_0=apply(Sample_1_0[7:9],2,mean)
Next standard deviations are found. We are mainly interested in standard deviations for our raw data columns (ADry and AEthanol), as error propagation calculations are afterwards carried out to find out how the deviations sum up when calculating the densities
StdAfv_1_0=apply(Sample_1_0,2,sd)
Error propagation (same for B and C)
ASd_1_0=(sqrt((sd(Sample_1_0$ADry_1_0)/mean(Sample_1_0$ADry_1_0))^2+(sqrt((sd(Sample_1_0$ADry_1_0)^2+sd(Sample_1_0$AEthanol_1_0)^2))/(mean(Sample_1_0$ADry_1_0)-mean(Sample_1_0$AEthanol_1_0)))^2))*mean(Sample_1_0$ADensi_1_0)
In the end we semi manually gathered the end informations (mean density and deviation hereof) in a plot-able dataframe. Some of the codes might be a tad long and maybe we could have achieved equal results using shorter codes, but bear with us, we are rookies.
So now to the real actual problem
This was for A_1_0, B_1_0, and C_1_0. We would like to apply the same series of commands to 15 other data frames. The dimensions are the same, and they will be named A_1_1, A_1_2, A_2_0 and so on.
Is it possible to use some kind of loop function or make a loadable script containing x and y placeholders, where we can easily insert A_1_1 for instance??
Thanks in advance, i tried to keep the amount of confusion at a minimum, although it's tough!
Data list
If instead of individual vectors you combine the raw data into data frames (or even better data.tables) and then subsequently store all the data frames for all runs into a list as #Gregor suggested, you can use this function below and the lapply function.
my_func <- function(dataset, peth, pair){
require(data.table)
names <- names(dataset)
setDT(dataset)[, `:=` (ADens = (get(names[1])/(get(names[1])-get(names[4])))*(peth-pair)+pair,
BDens = (get(names[2])/(get(names[2])-get(names[5])))*(peth-pair)+pair,
CDens = (get(names[3])/(get(names[3])-get(names[6])))*(peth-pair)+pair)
][, .(ADens_mean = mean(ADens),
ADens_sd = sd(ADens),
AErr = (sqrt((sd(get(names[1]))/mean(get(names[1])))^2) +
(sqrt((sd(get(names[1]))^2 + sd(get(names[4]))^2))/
(mean(get(names[1])) - mean(get(names[4]))))^2)* mean(ADens),
BDens_mean = mean(BDens),
BDens_sd = sd(BDens),
BErr = (sqrt((sd(get(names[2]))/mean(get(names[2])))^2) +
(sqrt((sd(get(names[2]))^2 + sd(get(names[5]))^2))/
(mean(get(names[2])) - mean(get(names[5]))))^2)* mean(BDens),
CDens_mean = mean(CDens),
CDens_sd = sd(CDens),
CErr = (sqrt((sd(get(names[3]))/mean(get(names[3])))^2) +
(sqrt((sd(get(names[3]))^2 + sd(get(names[6]))^2))/
(mean(get(names[3])) - mean(get(names[6]))))^2)* mean(CDens))
]
}
rbindlist(lapply(list_datasets, my_func, peth = 2, pair = 1))
Now, this assumes that you put your raw vectors into data frames with the columns in the order in which they appeared in your example (and that they are the only columns in the data set). If this is not the case, you may just have to edit the indices in the names[x] calls. If you wanted to have a little more flexibility, you could also define a list of list with the column names for each data set in your individual raw data sets, add that as an argument to my_func and then replace all the instances of names[x] with get(list_column_names[x])
This function should output a data.table with the results for each set of data sets (1-16) in individual rows with 6 columns (ADens_mean, ADens_sd, ...)
NOTE since there was no actual data to work with, I can't say for sure that this function does exactly what you want, but I think it will be close. This will also require you to download the data.table package.
I've got this dataset
install.packages("combinat")
install.packages("quantmod")
library(quantmod)
library(combinat)
library(utils)
getSymbols("AAPL",from="2012-01-01")
data<-AAPL
p1<-4
dO<-data[,1]
dC<-data[,4]
emaO<-EMA(dO,n=p1)
emaC<-EMA(dC,n=p1)
Pos_emaO_dO_UP<-emaO>dO
Pos_emaO_dO_D<-emaO<dO
Pos_emaC_dC_UP<-emaC>dC
Pos_emaC_dC_D<-emaC<dC
Pos_emaC_dO_D<-emaC<dO
Pos_emaC_dO_UP<-emaC>dO
Pos_emaO_dC_UP<-emaO>dC
Pos_emaO_dC_D<-emaO<dC
Profit_L_1<-((lag(dC,-1)-lag(dO,-1))/(lag(dO,-1)))*100
Profit_L_2<-(((lag(dC,-2)-lag(dO,-1))/(lag(dO,-1)))*100)/2
Profit_L_3<-(((lag(dC,-3)-lag(dO,-1))/(lag(dO,-1)))*100)/3
Profit_L_4<-(((lag(dC,-4)-lag(dO,-1))/(lag(dO,-1)))*100)/4
Profit_L_5<-(((lag(dC,-5)-lag(dO,-1))/(lag(dO,-1)))*100)/5
Profit_L_6<-(((lag(dC,-6)-lag(dO,-1))/(lag(dO,-1)))*100)/6
Profit_L_7<-(((lag(dC,-7)-lag(dO,-1))/(lag(dO,-1)))*100)/7
Profit_L_8<-(((lag(dC,-8)-lag(dO,-1))/(lag(dO,-1)))*100)/8
Profit_L_9<-(((lag(dC,-9)-lag(dO,-1))/(lag(dO,-1)))*100)/9
Profit_L_10<-(((lag(dC,-10)-lag(dO,-1))/(lag(dO,-1)))*100)/10
which are given to this frame
frame<-data.frame(Pos_emaO_dO_UP,Pos_emaO_dO_D,Pos_emaC_dC_UP,Pos_emaC_dC_D,Pos_emaC_dO_D,Pos_emaC_dO_UP,Pos_emaO_dC_UP,Pos_emaO_dC_D,Profit_L_1,Profit_L_2,Profit_L_3,Profit_L_4,Profit_L_5,Profit_L_6,Profit_L_7,Profit_L_8,Profit_L_9,Profit_L_10)
colnames(frame)<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D","Profit_L_1","Profit_L_2","Profit_L_3","Profit_L_4","Profit_L_5","Profit_L_6","Profit_L_7","Profit_L_8","Profit_L_9","Profit_L_10")
There is vector with variables for later usage
vector<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D")
I made all possible combination with 4 variables of the vector (there are no depended variables)
comb<-as.data.frame(combn(vector,4))
comb
and get out the ,,nonsense" combination (where are both possible values of variable)
rc<-comb[!sapply(comb, function(x) any(duplicated(sub('_D|_UP', '', x))))]
rc
Then I prepare the first combination to later subseting
var<-paste(rc[,1],collapse=" & ")
var
and subset the frame (with all DVs)
kr<-eval(parse(text=paste0('subset(frame,' , var,')' )))
kr
Now I have the subseted df by the first combination of 4 variables.
Then I used the evaluation function on it
evaluation<-function(x){
s_1<-nrow(x[x$Profit_L_1>0,])/nrow(x)
s_2<-nrow(x[x$Profit_L_2>0,])/nrow(x)
s_3<-nrow(x[x$Profit_L_3>0,])/nrow(x)
s_4<-nrow(x[x$Profit_L_4>0,])/nrow(x)
s_5<-nrow(x[x$Profit_L_5>0,])/nrow(x)
s_6<-nrow(x[x$Profit_L_6>0,])/nrow(x)
s_7<-nrow(x[x$Profit_L_7>0,])/nrow(x)
s_8<-nrow(x[x$Profit_L_8>0,])/nrow(x)
s_9<-nrow(x[x$Profit_L_9>0,])/nrow(x)
s_10<-nrow(x[x$Profit_L_10>0,])/nrow(x)
n_1<-nrow(x[x$Profit_L_1>0,])/nrow(frame)
n_2<-nrow(x[x$Profit_L_2>0,])/nrow(frame)
n_3<-nrow(x[x$Profit_L_3>0,])/nrow(frame)
n_4<-nrow(x[x$Profit_L_4>0,])/nrow(frame)
n_5<-nrow(x[x$Profit_L_5>0,])/nrow(frame)
n_6<-nrow(x[x$Profit_L_6>0,])/nrow(frame)
n_7<-nrow(x[x$Profit_L_7>0,])/nrow(frame)
n_8<-nrow(x[x$Profit_L_8>0,])/nrow(frame)
n_9<-nrow(x[x$Profit_L_9>0,])/nrow(frame)
n_10<-nrow(x[x$Profit_L_10>0,])/nrow(frame)
pr_1<-sum(kr[,"Profit_L_1"])/nrow(kr[,kr=="Profit_L_1"])
pr_2<-sum(kr[,"Profit_L_2"])/nrow(kr[,kr=="Profit_L_2"])
pr_3<-sum(kr[,"Profit_L_3"])/nrow(kr[,kr=="Profit_L_3"])
pr_4<-sum(kr[,"Profit_L_4"])/nrow(kr[,kr=="Profit_L_4"])
pr_5<-sum(kr[,"Profit_L_5"])/nrow(kr[,kr=="Profit_L_5"])
pr_6<-sum(kr[,"Profit_L_6"])/nrow(kr[,kr=="Profit_L_6"])
pr_7<-sum(kr[,"Profit_L_7"])/nrow(kr[,kr=="Profit_L_7"])
pr_8<-sum(kr[,"Profit_L_8"])/nrow(kr[,kr=="Profit_L_8"])
pr_9<-sum(kr[,"Profit_L_9"])/nrow(kr[,kr=="Profit_L_9"])
pr_10<-sum(kr[,"Profit_L_10"])/nrow(kr[,kr=="Profit_L_10"])
mat<-matrix(c(s_1,n_1,pr_1,s_2,n_2,pr_2,s_3,n_3,pr_3,s_4,n_4,pr_4,s_5,n_5,pr_5,s_6,n_6,pr_6,s_7,n_7,pr_7,s_8,n_8,pr_8,s_9,n_9,pr_9,s_10,n_10,pr_10),ncol=3,nrow=10,dimnames=list(c(1:10),c("s","n","pr")))
df<-as.data.frame(mat)
return(df)
}
result<-evaluation(kr)
result
And I need to help in several cases.
1, in evaluation function the way the matrix is made is wrong (s_1,n_1,pr_1 are starting in first column but I need to start the order by rows)
2, I need to use some loop/lapply function to go trough all possible combinations (not only the first one like in this case (var<-paste(rc[,1],collapse=" & ")) and have the understandable output where is evaluation function used on every combination and I will be able to see for which combination of variables is the evaluation done (understand I need to recognize for what is this evaluation made) and compare evaluation results for each combination.
3, This is not main point, BUT I generally want to evaluate all possible combinations (it means for 2:n number of variables and also all combinations in each of them) and then get the best possible combination according to specific DV (Profit_L_1 or Profit_L_2 and so on). And I am so weak in looping now, so, if it this possible, keep in mind what am I going to do with it later.
Thanks, feel free to update, repair or improve the question (if there is something which could be done way more easily, effectively - do it - I am open for every senseful advice.
I have to aggregate (of course with a categorical break variable) a quite big data table containing some continuous variables by resulting the mean, median, standard deviation and interquartile range (IQR) of the required variables.
The first three is an easy one with the SPSS Aggregate command, but I have no idea how to compute IQR by aggregating the data table.
I know I could compute IQR by using Descriptives (by quartiles), but as I need the calculations in aggregation - this is not an option. Unfortunately using R fails also thanks to some odd circumstances (not able to load a huge comma separated file in R neither with base:: read.table, neither with sqldf, neither with bigmemory and neither with ff packages).
Any idea is welcomed! And of course: thank you in advance.
P.S.: I thought about estimating IQR by multiplying the standard deviation by 1.5, but that method would not work as the distributions are skewed, so assuming normality does not stands.
P.S.: do you think using R within SPSS would not result in memory problems like while opening the dataset in pure R?
This syntax should do the trick. There is no need to migrate back and forth between SPSS and R solely for this task.
*making fake data, 4 million records and 150 variables.
input program.
loop i = 1 to 4000000.
end case.
end loop.
end file.
end input program.
dataset name Temp.
execute.
vector X(150).
do repeat X = X1 to X150.
compute X = RV.NORMAL(0,1).
end repeat.
*This is the command you are interested in, puts the stats table into a new dataset.
Dataset declare IQR.
OMS
/SELECT TABLES
/IF SUBTYPES = 'Statistics'
/DESTINATION FORMAT = SAV outfile = 'IQR' VIEWER=NO.
freq var = X1
/format = notable
/ntiles = 4.
OMSEND.
This takes along time still with such a large dataset, but thats to be expected. Just search the SPSS help files for "OMS" to find the example syntax with how OMS works.
Given the further constraint that you want to calculate the IQR for many groups, there is a few different ways I could see to proceed. One would be just use the split file command and run the above frequency command again.
split file by group.
freq var = X1 X2
/format = notable
/ntiles = 4.
split file end.
You could also get specific percentiles within ctables (and can do whatever grouping/nesting you want for that). Potentially a more useful solution at this point though is to make a program that actually saves separate files (or reduces the full dataset the specific group while still loaded), does the calculation on each separate file and dumps it into a dataset. Working with the dataset that has the 4 million records is a pain, and it does not appear to be necessary if you are just splitting the file up anyway. This could be accomplished via macro commands.
OMS can capture any pivot table as a dataset, so any statistical results displayed that way can be used as a dataset. Another approach, however, in this case would be to use the RANK command. RANK allows for grouping variables, so you could get rank within group, and it can compute the quartiles and percentiles within group. For example,
RANK VARIABLES=salary (A) BY jobcat minority
/RANK /NTILES(4) /PERCENT. Then aggregating with FIRST and the group variables as breaks would give you a dataset of the quartiles by group from which to compute the iqr.
Many ways to skin a cat.
-Jon Peck