how to acces output of CrossTable - r

I have the following output from CrossTable
| predict
actual | bad | good | Row Total |
-------------|-----------|-----------|-----------|
bad | 412 | 188 | 600 |
-------------|-----------|-----------|-----------|
good | 149 | 451 | 600 |
-------------|-----------|-----------|-----------|
Column Total | 561 | 639 | 1200 |
-------------|-----------|-----------|-----------|
and I need to assign output to indivial variable like
a1<-412
a2<-451 and so on how can I do it ?

CrossTable produces that screen output as a "service" for users of SAS and SPSS who are used to it. The table function is what "real useRs" would use and it delivers a table object that can be assessed via standard indexing:
with( dfrm, table(actual, predict)[1,1] ) # should be "a1" = 412
Assuming this is gmodels::CrossTable the help page tells you that the value returned is a list for which the t node is a similar matrix so this should succeed:
with( dfrm, CrossTable(actual, predict)$t[1,1]) # should be "a1" = 412
The proportions tables have different names. Read the Value section of ?CrossTable. Looking at the help page for descr::CrossTable it appears, it has done things a bit differently but has retained that structure and names in the value returned.

Related

Join and merge are only working partially and completely in R

I am working with a dataset that was obtained from a global environment and loaded into R. It has been saved as a CSV and is being read in R as a data frame from that CSV. This dataset (survey_df) has almost 3 million entries, I am trying to join this dataset based on a column ID (repeated multiple times since there are multiple entries per id) to what originally was a shapefile and is now loaded in R as a data frame shapefile_df . This data frame has 60,000 unique entries, each representing a geometry in a country. We expect to have many entries per geometry in most cases. I am using a simple left_join which should in theory join these two datasets together. I am running into an issue where they are not fully joining together, only some entries are. I have tried inner,fully and right join as well as merge and I keep getting the same issue. I made a full_join and a copy of the id columns to compare the ones that are not joining and I do not see any patterns. They seem to be the same id, they are not joining for some reason. I tried formatting them as.character and as.factor and yet nothing. Below I pasted a sample of the join/unjoined df.
Matched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
0901200010229 | 0901200010229 | 0901200010229
0901500010729 | 0901500010729 | 0901500010729
090050001087A | 090050001087A | 090050001087A
0900600010467 | 0900600010467 | 0900600010467
0901400010897 | 0901400010897 | 0901400010897
0901200011960 | 0901200011960 | 0901200011960
Unmatched ids
| survey_df_id | survey_id_copy | shapefile_df_id
-------------- | -------------- |--------------
01903900010480 | 01903900010480 | NA
070470001010A | NA | 070470001010A
0704700010117 | NA | 0704700010117
0704700010140 | NA | 0704700010140
0705200010672 | NA | 0705200010672
0705200010742 | NA | 0705200010742
Most of the entries that are unmatched are like the first row where shapefile_df_id is NA. However, there are a few where survey_id_copy is NA. This field is simply a mutate of survey_df_id and in theory should not be any different yet they are. Any idea what could be causing this? I suspect this is a formatting issue but as a said, using as. hasn't fixed the issue. I am using tidyverse and read.csv. Any help?

Converting 1 row instance to a suitable format in R for repeated measures ANOVA

I'm really struggling with how to format my data to a suitable one in R.
At the moment, I have my data in the format of:
ParticipantNo | Sex | Age | IV1(0)_IV2(0)_DV1 | IV1(1)_IV2(0)_DV1 | etc
There are two levels for IV1, and 3 for IV2, so 6 columns per DV.
I've stacked them, so that I compare all IV1 results with each other, and the same for IV2 using a Friedman test.
However, I'd like to compare across groups like Sex and Age, and was told ANOVA is the best for this. I've used ANOVA directly before in SPSS, which accepts this data format.
The problem I have is getting this data into the correct format in R.
As I understand it, it should look like:
1 | M | 40 | IV1(0)_IV2(0)_DV1_Result
1 | M | 40 | IV1(1)_IV2(0)_DV1_Result
1 | M | 40 | IV1(0)_IV2(1)_DV1_Result
1 | M | 40 | IV1(1)_IV2(1)_DV1_Result
1 | M | 40 | IV1(0)_IV2(2)_DV1_Result
1 | M | 40 | IV1(1)_IV2(2)_DV1_Result
Then I can do
aov(sex~DV1_result, data=data)
Does this seem like the correct thing to do, and if so, how can I convert from the format I have to the one I need in R?
Figured it out!
I used stack on my data, and then separate (i.e. s = separate(stack(data), "ind", c("IV1", "IV2").
Then I could do the ANOVA by aov(values ~ IV1 * IV2, data = s)
Hope this helps someone!

Issues solving a Regression with numeric and categorical variables in R

I am very new to statistics and R in general so my question might be a bit dumb, but since I cannot find my solutions online I thought I should try ask it here.
I have a data frame dataset of a whole lot of different variables very similar to as follows:
Item | Size | Value | Town
----------------------------------
A | 10 | 800 | 1
B | 11 | 100 | 2
A | 17 | 900 | 2
D | 13 | 200 | 3
B | 15 | 500 | 1
C | 12 | 250 | 3
E | 14 | NA | 2
A | | 800 | 1
C | | 800 | 2
Basically, I have to try and 'guess' the Size based on the type of Item, it's Value, and the Town it was sold in, so I think a regression method would be a good idea.
I try and use a polynomial regression (although I'm not even sure if that's correct) to see how that looks by using a function similar to the following:
summary(lm(Size~ polym(factor(Item), Value, factor(Town), degree=2, raw=TRUE), dataset))
But I get this Warning message when I try to do this:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
In addition: Warning message:
In Ops.factor(X, Y, ...) : ‘^’ not meaningful for factors
Can anyone tell me why this happens? More importantly, is what I've done even correct?
My second question is regarding NA values in a regression. In the dataset above, I have an NA value in the Value Column. From what I understand, R ignores rows which have an NA value in a column. But what if I have a lot of NA values? Also, it seems like a waste of data to automatically eliminate entire rows if there is only one NA value in a column, so I was wondering if there is perhaps a better way of solving or working around this issue. Thanks!
EDIT: I just have one more question: In the regression model I have created it appears there are new 'levels' in the testing data which were not in the training data (e.g. the error says factor(Town) has new levels). What would be the right thing to do for cases such as this?
Yes, follow #RemkoDuursma's suggestion in using lm(Size ~ factor(Item) + factor(Town) + Value,...) and look into other degrees as well (was there a reason why you chose squared?) by comparing residuals.
In regards to substituting NA values, you have many options:
substitute all with median variable value
substitute all with mean variable value
substitute each with prediction based on values of other variables
good luck, and next time you might want to check out https://stats.stackexchange.com/!

Plotting mean-values of elements with same frequencies

I have two columns in my google sheet that corresponds to 1) the frequencies of the elements and 2) their respective 'values'.
What I want is a diagram that holds the different frequencies on the x-axis, and for each frequency I want the y-axis to hold that specific frequency's value (and if there are more than one element with that frequency I want it to plot their mean value).
Two elements can share the same frequency and/or the same score, and that's why I want the mean-functionality added aswell.
If the following data would be my values:
280 6
280 4
250 2
240 1
230 3
Forgive my ascii-skills, but I'd want the graph to plot the following in that case:
^
.
.
|
| |
| |
| | |
| | | |
| | | | |
___230___240___250___260___270___280___...>
I'm not entirely familiar with Google Sheets yet and I'm not really sure how to accomplish this.
I think a pivot table should serve. Your LH column in Rows, RH in Values, with Summarise by AVERAGE. Then chart the results (select what in the image is B14:B17, Insert..., Chart and accept the first recommendation):

Getting summary data from graphite

I have created a dashboard using the data published from my application using statsd with a graphite backend. This has worked great for building nice data visualizations. (Kudos to etsy and others!)
Now I need to make a summary dashboard that will display a grid, more-or-less, that shows each stat with a count. (No graphs on this page but clicking on the stat name will take you to the graph.)
So, for example, I am collecting statistics for how many messages each node in our cluster recieves, processes successfully, fails to process. What I need is something like the following:
| Node Name | Messages Recieved | Successful | Failed |
| ------------- |:-----------------:| -----------:| ---------:|
| Node1 | 1126 | 1120 | 6 |
| Node2 | 1155 | 1100 | 55 |
| Node3 | 1124 | 1119 | 5 |
| Node4 | 1204 | 1198 | 6 |
I have a timespan selector on the toolbar and based on that selection these numbers should be updated to reflect the selected timespan. I'm having a hard time getting numbers that seem to aligne with what I expect. And in some scenarios with the summarize function I am getting decimal values back which does not seem to make sense to me.
Any help or guidance would be greatly appreciated.

Resources