Create a subsample from a data frame in R

Create a subsample from a data frame in R - r

I have five data frames among which I want to run regressions:
df1: stock returns
df2: housing returns
df3: actual inflation rate
df4: expected inflation rate
df5: unexpected inflation rate
Dataframe example
Each of the data frames has the same format as above, with only different data inside it.
I want to do separate regression of housing and stocks against expected and unexpected inflation as below:
df1[i] ~ df4[i] + df5[i]
df2[i] ~ df4[i] + df5[i]
I want to compare the results of regression for periods where actual inflation (included in df3) is higher than the median value with periods where actual inflation is lower than the median value. For doing that, I need to create two subsamples from each data frame based on the value that each column has in df3.
Since I don't have a deep knowledge of R, I don't know how to do it. Is it possible to do it? and how? Or is it better to create 13 different data frames for each country?
Thank you in advance!

Related

Subsetting data based on whether a variable is in a list

I have some data on NFL player statistics. I want to separate it into training and test datasets where the split is based on the year of observation.
In particular, my data contains observations of player statistics from 1999 through 2019. I want to randomly select 20% of years (4 years) of data to serve as my test set and then have the remaining 17 years of data be my training set.
What I have now is:
# Set seed
set.seed(43)
# Determine how many years of data should be in test
split <- round(nrow(as.data.frame(table(data$year)))*0.20)
# Pick (split) random years to keep as test
test_years <- sample(data$year, split)
What I want to know how to write is:
train <- data where year is not in test_years
How would I do this?

We can use %in% to create a logical vector, then negate (!) to change TRUE/FALSE to FALSE/TRUE and subset the rows of the 'data'
train <- data[!data$year %in% test_years,]

t-test for multiple columns of dataframe

I have a hourly PM10 data from 2014 to 2019 and it has 4 stations. I want to apply t-test on these stations to compare their values. My data contains only 4 columns which names are bafra, atakum, canik, and ilkadim (stations names).
This link is related my question but it always gived error.
R: t test over multiple columns using t.test function
I tried this code; but I want to show my results as a more clear such as results with station name, p value, conf. interval...
library(reshape2)
meltdf <- melt(samsun)
pairwise.t.test(meltdf$value, meltdf$Var2, p.adjust = "none")
How can i do?

remove trend from data in R

I have two data sets, one of which shows seasonality while the other shows a trend.
I have removed seasonality from the first data set but I am not able to remove trend from the other data set.
Also, if I remove trend from the other data set and then try to make a data frame of both the altered data sets, then the number of rows will be different for both the data sets (because I have removed seasonality from the first data set using lag, so there is a difference of 52 values in the two data sets).
How do I go about it?

For de-trending a time series, you have several options, but the most commonly used one is HP filter from the "mFilter" package:
a <- hpfilter(x,freq=270400,type="lambda",drift=FALSE)
The frequency is for the weekly nature of the data, and drift=FALSE sets no intercept. The function calculates the cyclical and trend components and gives them to you separately.
If the time indices for both your series are the same (i.e weekly), you could use the following, where x and y are your dataframes:
final <- merge(x,y,by=index(a),all=FALSE)
You can always set all.x=TRUE (all.y=TRUE) to see which rows of x (y) have no matching output in y (x). Look at the documentation for merge here.
Hope this helps.

R assign categorical variables to matrix

I have 5 categorical variables: age(5 levels), sex(2 levels), zone(4 levels), qmat(5 levels), and qsoc(5 levels) for a total of 1000 unique combinations. Each unique combination has a corresponding data value (e.g. population size). I would like to assign this data to a 1000 x 6 table where the first five columns contain the indices of age, sex, zone, qmat, qsoc and the 6th column holds the data value.
I would like to avoid using nested for loops which are inefficient in R (some of my datasets will have more than 1000 unique combinations). I know there exist many tools in R for parallel operations (but am not familiar with them). Is there an efficient way to perform the above variable assignment using parallel/vector operations? Any suggestions or references would be appreciated.

It's hard to understand how the original data you have looks like, but assuming that you have your data on a data frame, you may want to use aggregate().
# simulating a data frame
set.seed(1)
N = 9000
df = data.frame(pop=rnorm(N),
age=sample(1:5, N, replace=T),
sex=sample(1:2, N, replace=T)
)
# 'aggregate' this data frame by 'age' and 'sex'
newData = aggregate(pop ~ age + sex, data=df, FUN=sum)

The R function expand.grid() will solve my problem e.g.
expand.grid(list(age,sex,zone,qmat,qsoc))
Thanks for all the responses and I apologize for any possible vagueness in the wording of my question.

Time Dummy Variables and Regressing columns of a dataframe as dependent variables

I have tried to search this question on here but I couldn't find anything so sorry if this question has already been answered. My dataset consists of daily information for a large number of stocks (1000+) over a 10 year period. So I have read my dataset as a data frame time series where each column is a separate stock. I would like to regress each of the stock against month dummy variables capture the season variation and obtain the residuals. What I have done is the following:
for (i in 1:1000){
month.f<-factor(months(time(stockinfo[,i])))
dummy<-model.matrix(month.f)
residStock[,1]<-residuals(lm(stockinfo[,i]~dummy,na.action=na.exclude))
}
#Stockinfo is data.frame
Is this the correct way to do it?
Secondly, i would like to run a regression using the residuals as the the dependent variable and other independent variables from another data frame. What would be the best way to do this, would I have to use a for loop again?
Thank you a lot for your help.

You can create a list of stocks as follows and then use Map function and can avoid R for loop (Not tested since you didn't provide the sample data)
Assume your data is mydata with month as 1,2, you use 11 months as dummy if there are 12 months
mystock<-list("APP~","INTEL~","MICROSOFT~") # stocks with tilde sign
myresi<-Map(function(x) resi(lm(as.formula(paste(x,paste(levels(as.factor(mydata$month))[-1],collapse="+"))),data=mydata),mystock) #-1 means we are using only 11 months excluding first as base month
Say your independent var is indep1,indep2, and indep3 and dependent is dep (And assuming that dep and indep are same for each stocks)
myestimate<-Map(function(x)lm(dep~indep1+indep2+indep3,data=x),myresi)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create a subsample from a data frame in R - r

Related

Subsetting data based on whether a variable is in a list

t-test for multiple columns of dataframe

remove trend from data in R

R assign categorical variables to matrix

Time Dummy Variables and Regressing columns of a dataframe as dependent variables

Categories

Resources