I tried to use the omit and the omit.yes.no options in stargazer() to omit a dummy variable. It seems that there is a bug involving this option.
These are what I expect get from the output from stargazer.
logit_1 logit_2 logit_3
| covariates 1 | 21*** 20 *** 21.4***
(0.2) (0.12) (0.10)
| covariate 2 | 0.5 0.3*** 0.31***
(0.4) (0.13) (0.15)
| factor(covariate 3) A | 0.123*** 0.3***
(0.06) (0.08)
| factor(covariate 3) B | 1.5** 1.03***
(O.78) (0.073)
| OM | No No Yes
My stargazer command is the following;
stargazer (logit_1,logit_2,logit_3, omit='OM', omit.labels="OM", omit.yes.no = c("Yes","No")).
When I run the previous command the results of the OM variable is No Yes Yes.
When I run
stargazer (logit_1,logit_2, omit='OM', omit.labels="OM", omit.yes.no = c("Yes","No"))
I get No No.
And when I run
stargazer (logit_2,logit_3, omit='OM', omit.labels="OM", omit.yes.no = c("Yes","No"))
I get Yes Yes.
Yes, this is a known bug that will be removed in the next release. For now, you can apply the following fix:
On line 3956 of stargazer-internal.R, please replace:
if (!is.na(.global.coefficients[k,i])) {
by the following:
if (!is.na(.global.coefficients[.global.coefficient.variables[k],i]))
Then, install again from source. You can also e-mail the package's author for a working version of stargazer that corrects this issue.
Related
I used a 'arulesSequences' in R to analyze the user's sequence of actions.
And I got results like this
'#' | sequence | support
1 | <{A,B}> | 0.87
2 | <{A},{B}> | 0.68
They seem to be the same sequence. What's the difference?
Does this mean that session has changed?
The sequence 1 means that A and B happened together at the same time. The sequence 2 means that first A happened, then B happened
I am very new to statistics and R in general so my question might be a bit dumb, but since I cannot find my solutions online I thought I should try ask it here.
I have a data frame dataset of a whole lot of different variables very similar to as follows:
Item | Size | Value | Town
----------------------------------
A | 10 | 800 | 1
B | 11 | 100 | 2
A | 17 | 900 | 2
D | 13 | 200 | 3
B | 15 | 500 | 1
C | 12 | 250 | 3
E | 14 | NA | 2
A | | 800 | 1
C | | 800 | 2
Basically, I have to try and 'guess' the Size based on the type of Item, it's Value, and the Town it was sold in, so I think a regression method would be a good idea.
I try and use a polynomial regression (although I'm not even sure if that's correct) to see how that looks by using a function similar to the following:
summary(lm(Size~ polym(factor(Item), Value, factor(Town), degree=2, raw=TRUE), dataset))
But I get this Warning message when I try to do this:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In Ops.factor(X, Y, ...) : ‘^’ not meaningful for factors
Can anyone tell me why this happens? More importantly, is what I've done even correct?
My second question is regarding NA values in a regression. In the dataset above, I have an NA value in the Value Column. From what I understand, R ignores rows which have an NA value in a column. But what if I have a lot of NA values? Also, it seems like a waste of data to automatically eliminate entire rows if there is only one NA value in a column, so I was wondering if there is perhaps a better way of solving or working around this issue. Thanks!
EDIT: I just have one more question: In the regression model I have created it appears there are new 'levels' in the testing data which were not in the training data (e.g. the error says factor(Town) has new levels). What would be the right thing to do for cases such as this?
Yes, follow #RemkoDuursma's suggestion in using lm(Size ~ factor(Item) + factor(Town) + Value,...) and look into other degrees as well (was there a reason why you chose squared?) by comparing residuals.
In regards to substituting NA values, you have many options:
substitute all with median variable value
substitute all with mean variable value
substitute each with prediction based on values of other variables
good luck, and next time you might want to check out https://stats.stackexchange.com/!
I'm having trouble running a Friedman test over my data.
I'm trying to run a Friedman test using this command:
friedman.test(mean ~ isi | expId, data=monoSum)
On the following database (https://www.dropbox.com/s/2ox0y1b4gwld0ai/monoSum.csv):
> monoSum
expId isi N mean
1 m80B1 1 10 100.000000
2 m80B1 2 10 73.999819
3 m80B1 3 10 45.219362
4 m80B1 4 10 116.566174
. . . . .
18 m80L2 2 10 82.945491
19 m80L2 3 10 57.675480
20 m80L2 4 10 207.169277
. . . . . .
25 m80M2 1 10 100.000000
26 m80M2 2 10 49.752687
27 m80M2 3 10 19.042592
28 m80M2 4 10 150.411035
It gives me back the error:
Error in friedman.test.default(c(100, 73.9998193095267, 45.2193621626293, :
not an unreplicated complete block design
I figure it gives the error because, when monoSum$isi==1 the value of mean is always 100. Is this correct?
However, monoSum$isi==1 is alway 100 because it is the control group on which all the other monoSum$isi groups are normalized. I can not assume a normal distribution, so I cannot run a rmANOVA…
Is there a way to run a friedman test on this data or am I missing a very essential point here?
Many thanks in advance!
I don't get an error if I run your dataset:
Friedman rank sum test
data: mean and isi and expId
Friedman chi-squared = 17.9143, df = 3, p-value = 0.0004581
However, you have to make sure that expId and isi are coded as factors. Run these commands:
monoSum$expID$<-factor(monoSum$expID)
monoSum$isi$<-factor(monoSum$isi)
Then run the test again. This has worked for me with a similar problem.
I know this is pretty old but for future generations (see also: me when I forget and google this again):
You can determine what the missing values are in your dataframe by running table(groups, blocks) or in the case of this question table(monoSum$isi, monoSum$expID). This will return a table of 0s and 1s. This missing records are in the the cells with 0s.
I ran into this problem after trying to remove the blocks that had incomplete results; taking a subset of the data did not remove the blocks for some reason.
Just thought I would mention I found this post because I was getting a similar error message. The above suggestions did not solve it. Strangely, I had to sort my dataframe so that block by block the groups appeared in order (i.e. I could not have the following:
Block 1 A
Block 1 B
Block 2 B
Block 2 A
It has to appear as A, B, A, B)
I ran into the same cryptic error message in R, though in my case it was resolved when I applied the 'as.matrix' function to what was originally a dataframe for the CSV file I imported in using the read.csv() function.
I also had a missing data point in my original data set, and I found that when my data was transformed into a matrix for the friedman.test() call, the entire row containing the missing data point was omitted automatically.
Using the function as.matrix() to transform my dataframe is the magic that got the function to run for me.
I had this exact error too with my dataset.
It turns out that the function friedman.test() accepts data frames (fx those created by data.frame() ) but not tibbles (those created by dplyr and other modern tools). The solution for me was to convert my dataset to a dataframe first.
D_fri <- D_all %>% dplyr::select(FrustrationEpisode, Condition, Participant)
D_fri <- as.data.frame(D_fri)
str(D_fri) # confirm the object should now be a 'data.frame'
friedman.test(FrustrationEpisode ~ Condition | Participant, D_fri)
I ran into this problem too. Fixed mine by removing the NAs.
# My data (called layers) looks like:
| resp.no | av.l.all | av.baseem | av.base |
| 1 | 1.5 | 1.3 | 2.3 |
| 2 | 1.4 | 3.2 | 1.4 |
| 3 | 2.5 | 2.8 | 2.9 |
...
| 1088 | 3.6 | 1.1 | 3.3 |
# Remove NAs
layers1 <- na.omit(layers)
# Re-organise data so the scores are stacked, and a column added with the original column name as a factor
layers2 <- layers1 %>%
gather(key = "layertype", value = "score", av.l.all, av.baseem, av.base) %>%
convert_as_factor(resp.no, layertype)
# Data now looks like this
| resp.no | layertype | score |
| 1 | av.l.all | 1.5 |
| 1 | av.baseem | 1.3 |
| 1 | av.base | 2.3 |
| 2 | av.l.all | 1.4 |
...
| 1088 | av.base | 3.3 |
# Then do Friedman test
friedman.test(score ~ layertype | resp.no, data = layers2)
Just want to share what my problem was. My ID factor did not have correct levels after doing pivot_longer(). Because of this, the same error was given. I made sure the correct level and it worked by the following:as.factor(as.character(df$ID))
Reviving an old thread with new information. I ran into a similar problem after removing NAs. My group and block were factors before the NA removal. However, after removing NAs, the factors retained the levels before the removal even though some levels were no longer in the data!
Running the friedman.test() with the as.matrix() trick (e.g., friedman.test(a ~ b | c, as.matrix(df))) was fine but running frdAllPairsExactTest() or friedman_effsize() would throw the not an unreplicated complete block design error. I ended up re-factoring the group and block (i.e., dropping the levels that were no longer in the data, df$block <- factor(df$block)) to make things work. After the re-factor, I did not need the as.matrix() trick, either.
I know its not a code level question but wanted your views.
I need to perform “Prediction Analysis” in UNIX level using Time series model (like ARIMA).
We have implemented the same using R , but my work environment is not supporting R.
Data snapshot
Year | Month| Data1| Data2 | Data3
2012 | Jan | 1 |1 |3
2012 | Feb | 2 |21 | 4
So I wanted to implement some algorithm which will help me in finding the predicted values for future months.
Is there any other way of implementing “Time series Prediction Analysis” in UNIX (preferably Perl/Shell).
Since you are interested in perl and statistics, I'm sure you are aware of PDL. There are some specific time-series statistics modules available and of course, since perl is involved, other CPAN modules can be used.
R is still king and has a lot of packages to choose from - and, lucky us, R and perl play nice together using Statistics::R. I've not tried using Statistics-R from the PDL shell but this too may be possible to some extent.
Here's a pdl example session using MVA
/home/zombiepl % pdl
pdl> use Statistics::MVA::MultipleRegression;
pdl> $lol = [ [qw/745 36 66/],
[qw/895 37 68/],
[qw/442 47 64/],
[qw/440 32 53/],
[qw/1598 1 101/],];
pdl> linear_regression($lol);
The coefficients are: B[0] = -281.426985090045, B[1] = -7.61102966577879,
B[2] = 19.0102910918022.
R^2 is 0.943907302962818
Cheers and good luck with your project.
I use R for most of my statistical analysis. However, cleaning/processing data, especially when dealing with sizes of 1Gb+, is quite cumbersome. So I use common UNIX tools for that. But my question is, is it possible to, say, run them interactively in the middle of an R session? An example: Let's say file1 is the output dataset from an R processes, with 100 rows. From this, for my next R process, I need a specific subset of columns 1 and 2, file2, which can be easily extracted through cut and awk. So the workflow is something like:
Some R process => file1
cut --fields=1,2 <file1 | awk something something >file2
Next R process using file2
Apologies in advance if this is a foolish question.
Try this (adding other read.table arguments if needed):
# 1
DF <- read.table(pipe("cut -fields=1,2 < data.txt| awk something_else"))
or in pure R:
# 2
DF <- read.table("data.txt")[1:2]
or to not even read the unwanted fields assuming there are 4 fields:
# 3
DF <- read.table("data.txt", colClasses = c(NA, NA, "NULL", "NULL"))
The last line could be modified for the case where we know we want the first two fields but don't know how many other fields there are:
# 3a
n <- count.fields("data.txt")[1]
read.table("data.txt", header = TRUE, colClasses = c(NA, NA, rep("NULL", n-2)))
The sqldf package can be used. In this example we assume a csv file, data.csv and that the desired fields are called a and b . If its not a csv file then use appropriate arguments to read.csv.sql to specify other separator, etc. :
# 4
library(sqldf)
DF <- read.csv.sql("data.csv", sql = "select a, b from file")
I think you may be looking for littler which integrates R into the Unix command-line pipelines.
Here is a simple example computing the file size distribution of of /bin:
edd#max:~/svn/littler/examples$ ls -l /bin/ | awk '{print $5}' | ./fsizes.r
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4 5736 23580 61180 55820 1965000 1
The decimal point is 5 digit(s) to the right of the |
0 | 00000000000000000000000000000000111111111111111111111111111122222222+36
1 | 01111112233459
2 | 3
3 | 15
4 |
5 |
6 |
7 |
8 |
9 | 5
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 | 6
edd#max:~/svn/littler/examples$
and it takes for that is three lines:
edd#max:~/svn/littler/examples$ cat fsizes.r
#!/usr/bin/r -i
fsizes <- as.integer(readLines())
print(summary(fsizes))
stem(fsizes)
See ?system for how to run shell commands from within R.
Staying in the tradition of literate programming, using e.g. org-mode and org-babel will do the job perfectly:
You can combine several different programming languages in one script and execute then separate, in sequence, export the results or the code, ...
It is a little bit like sweave, only that the code blocks can by python, bash, R, sql, and numerous other. Check t out: org-mode and bable and an example using different programming languages
Apart from that, I think org-mode and babel is the perfect way of writing even pure R scripts.
Preparing data before working with it in R is quite common, and I have a lot of scripts for Unix and Perl pre-processing, and have, at various times, maintained scripts/programs for MySQL, MongoDB, Hadoop, C, etc. for pre-processing.
However, you may get better mileage for portability if you do some kinds of pre-processing in R. You might try asking new questions focused on some of these particulars. For instance, to load large amounts of data into memory mapped files, I seem to evangelize bigmemory. Another example is found in the answers (especially JD Long's) to this question.