I have a dataset with four variables (df)
household
group
income
post
1
0
20'000
0
1
0
22'000
1
2
1
10'000
0
2
1
20'000
1
3
0
20'000
0
3
0
21'000
1
4
1
9'000
0
4
1
16'000
1
5
1
8'000
0
5
1
18'000
1
6
0
22'000
0
6
0
26'000
1
7
1
12'000
0
7
1
24'000
1
8
0
24'000
0
8
0
27'000
1
Group is a binary variable and is 1, when household got support from state. and post variable is also binary and is 1, when it is after some household got support from state.
Now I would like to run a before vs after regression that estimates the group effect by comparing post-period and before period for the supported group. I would like to put the dependent variable in logs, to have the effect in percentage, so the impact of state support on income.
I used that code, but I don't know if it is right to get the answer?
library("fixest")
feols(log(income) ~ group + post,data=df) %>% etable()
Is there another way?
If you are looking for the classic 2x2 design your code was almost correct. Change '+' with '*'. This tell us that the supported group increased the income with 7 250 more than the group which not received support.
comparing = feols(income ~ group * post,data)
comparing_log = feols(log(income) ~ group * post,data)
etable(comparing,comparing_log)
PS: The interpretation of the coefficient as percentage change is a good approximation for small numbers. The correct formula for % change is: exp(beta)-1. In this case it is exp(0.5829)-1 = 0.7912.
So the change here is 79,12%.
Related
I have a dataset containing insurance pricing and coverage information. The first column refers to the policy identifier, and the remaining columns refer to premium, limit, deductible, and further details as dummy variables (State and coverage).
Identifier
Price
Limit
Deductible
Peril1
Peril2
Peril3
Peril4
Peril5
Peril6
State1
State2
State3
State4
POL1
250.0
100000
500.0
1
1
1
0
0
1
1
0
0
0
POL1
625.0
100000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
1650.0
500000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
2500.0
1000000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
4375.0
2000000
2000.0
1
1
1
0
0
1
1
0
0
0
POL2
25.0
50000
500.0
0
0
1
1
0
0
1
0
0
0
POL3
60.25
25000
500.0
1
1
1
1
1
1
1
0
0
0
POL3
73.25
50000
500.0
1
1
1
1
1
1
1
0
0
0
Moreover, as it can be seen from the sample dataframe, several rows can refer to the same insurance product. In the original data frame, up to 40 rows may refer to a single policy, while other policies are described in a single row.
I am trying to conduct a multivariate regression
reg <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df)
By conducting the multivariate regression, it emerges that the distribution of residual errors does not follow a normal distribution. I therefore decided to Log() the dependent variable. Moreover, in my dataframe there are several outliers and presence of heteroscedasticity.
For the reasons above I thought WLS regression could be a solution to my problem, because it can help me assigning an appropriate weight to each error term. Trying to understand the functioning and theory behind WLS I tried to conduct simple weighted regression as explained here
wt <- 1 / lm(abs(reg$residuals) ~ reg$fitted.values)$fitted.values^2
wls_model <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df, weight=wt)
But when looking at the results I don’t think this is the correct approach to tackle my problem, also considering the fact that by trying to solve this issue many rows are not considered.
From my understand, as the weight parameter of lm should be a vector, I could assign a specific weight to each policy. For instance, each row pertaining POL1 is 1/5. Despite having read documentation, relevant posts, and searched for packages that could facilitate my work, it is not clear to me how to implement WLS in my case.
I have a data frame and I want to generate a new column that has the result of an calculation based on the row before. Additionally the calculation has some conditions.
The data frame consist of energy production = p, energy consumption = c, energy grid = g, energy safe = s
My goal is to calculate the usage of a battery in a PV-System. When the modules produces more then needed the battery gets loaded and otherwise unloaded. When the batterie don't have enough energy the grid delivers the remainig energy.
So in the first line the batterie gets loaded because I produce more than I need. In the 5 line I need more energy than I produce, so the batterie gets unloaded and so on.
One row is one hour. So n+1 is based on the energy demand and supply of n.
### Old:
n p c g
1 2 1 0
2 3 1 0
3 4 3 0
4 3 5 2
5 5 8 3
6 2 1 0
### New:
n p c g s
1 2 1 0 1
2 3 1 0 3
3 4 3 0 4
4 3 5 0 2
5 5 8 1 0
6 2 1 0 1
When i use your code the result is like this:
First column - c
Second Colum - p
Third colum - g
Fourth colum - s
The battery gets loaded but the unload process does not fit from what is expected. The battery has 2.3801 energy and the demand in n+1 is 0.875.
So the result should be 2.3801 - 0.875 = 1.5015
This process should end when s = 0
I dont understand why your codes works for the rest of data.
I found a solution here which works very well for my problem.
My battery is floored at 0 and limited to 16 kWh, so I added just the pmin function.
mutate(result = accumulate(production-consumw1, ~ pmin(16,pmax(0, .x + .y)), .init = 0)[-1])
Thanks for your help!
For my previous lines of code for making tables from column names, they successfully made short and dense matrices for me to readily process data from two questions (from survey results): (2nd example).
However, when I try using the same line of code (above), I don't get that sleek matrix. I end up getting a list of un-linked tables, which I do not want. Perhaps it's due to the new column only having 0's and 1's as numeric characters, vs. the others that have more than 2: (1st example).
[Please forgive my formatting issues (StackOverflow Status: Newbie). Also, many thanks in advance to those checking in on and answering my question!]
>table(select(data_final, `Relationship 2Affected Individual`, Satisfied_Treatments))
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, , 1 = 1, Response = 10679308122
0
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, ,
...
> table(select(data_final, `Relationship 2Affected Individual`, Indirect_Benefits))
Indirect_Benefits
Relationship 2Affected Individual 0 1 2 3
1 4 1 0 0
2 42 17 9 3
3 12 1 1 0
6 5 2 2 0
Other (please specify) 1 0 0 0
>#rstudioapi::versionInfo()
>#packageVersion("dplyr")
table(data_final$Relationship 2Affected Individual, data_final$Satisfied_Treatments)
Problem Solved^
I have created a document term matrix that looks something like this:
inspect(dtm[1:4,1:6])
allowed allowing almost alone companyunder companywide
Doc1.txt 1 1 1 0 1 0
Doc2.txt 0 1 1 0 1 1
Doc3.txt 0 0 0 1 0 1
Doc4.txt 1 0 1 0 1 1
After taking it's column sum it gives me.
colSums(dtm)
allowed 2
allowing 2
almost 3
alone 1
companyunder 3
companywide 3
This essentially indicates that these words are found in how many documents (for eg allowed 2 tells me that allowed is found in two documents.).
I'm having difficulty in creating a frequency distribution plot which will have x-axis as the document number and y-axis as the number of words the document contains.
Is this what you're looking for?
dtm = array(c(1,0,0,1,1,1,0,0,1,1,0,1,0,0,1,0,1,1,0,1,0,1,1,1),dim=c(4,6))
dimnames(dtm) = list(c("Doc1","Doc2","Doc3","Doc4"),c("allowed","allowing","almost","alone","companyunder","companywide"))
print(dtm)
plot(rowSums(dtm))
I am trying to use svm() to classify my data. A sample of my data is as follows:
ID call_YearWeek week WeekCount oc
x 2011W01 1 0 0
x 2011W02 2 1 1
x 2011W03 3 0 0
x 2011W04 4 0 0
x 2011W05 5 1 1
x 2011W06 6 0 0
x 2011W07 7 0 0
x 2011W08 8 1 1
x 2011W09 9 0 0
x 2011W10 10 0 0
x 2011W11 11 0 0
x 2011W12 12 1 1
x 2011W13 13 1 1
x 2011W14 14 1 1
x 2011W15 15 0 0
x 2011W16 16 2 1
x 2011W17 17 0 0
x 2011W18 18 0 0
x 2011W19 19 1 1
The third column shows week of the year. The 4th column shows number of calls in that week and the last column is a binary factor (if a call was received in that week or not). I used the following lines of code:
train <- data[1:105,]
test <- data[106:157,]
model <- svm(oc~week,data=train)
plot(model,train,week)
plot(model,train)
none of the last two lines work. they dont show any plots and they return no error. I wonder why this is happening.
Thanks
Seems like there are two problems here, first is that not all svm types are supported by plot.svm -- only the classification methods are, and not the regression methods. Because your response is numeric, svm() assumes you want to do regression so it chooses "eps-regression" by default. If you want to do classification, change your response to a factor
model <- svm(factor(oc)~week,data=train)
which will then use "C-classification" by default.
The second problem is that there does not seem to be a univariate predictor plot implemented. It seems to want two variables (one for x and one for y).
It may be better to take a step back and describe exactly what you want your plot to look like.