how to load dataset package - r

I downloaded the dataset package but not sure how to load it. I know how to read csv files but not sure how to read the data.
http://www.inside-r.org/r-doc/datasets/state.division
I have to use state.division.
Thanks

Welcome to StackOverflow and R. First I would start with:
> library(help = "datasets")
This tells you a little about the available datasets in this package.
This package is part of the base R installation, and you don't need to load it. If you're curious where these datasets are stored on your machine, you can enter:
> system.file("data",package = "datasets")
For more info on the state datasets, you can enter: ?state
This tells you that state.division is one of the datasets available in this package.
> str(state.division)
However, it won't make a lot of sense without some additional context, so try something like:
> head(df <- data.frame(state.abb, state.division, state.x77))
state.abb state.division Population Income Illiteracy Life.Exp Murder HS.Grad
Alabama AL East South Central 3615 3624 2.1 69.05 15.1 41.3
Alaska AK Pacific 365 6315 1.5 69.31 11.3 66.7
Arizona AZ Mountain 2212 4530 1.8 70.55 7.8 58.1
Arkansas AR West South Central 2110 3378 1.9 70.66 10.1 39.9
California CA Pacific 21198 5114 1.1 71.71 10.3 62.6
Colorado CO Mountain 2541 4884 0.7 72.06 6.8 63.9
Frost Area
Alabama 20 50708
Alaska 152 566432
Arizona 15 113417
Arkansas 65 51945
California 20 156361
Colorado 166 103766
With a data.frame you should have the context you need to start make interesting plots or models, for example a linear regression model:
summary(lm(Murder ~ state.division + Illiteracy, data=df, weights=Population))

Related

R dplyr select() isn't working and there's no error code [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to create a table from the dslabs gapminder dataset showing only the life expectancies and fertility rates of African countries in 2012. The first part of my code works, but after select() it just stops working and I don't even get an error code. I'm sure it's probably something super trivial but I'm still new at this. Will someone please help me figure out why this code isn't working?
library(dplyr)
library(dslabs)
data(gapminder)
df <- gapminder %>% filter(., continent == "Africa" & year == as.factor(2012)) %>% select(., 'fertility' <= 3 & life_expectancy >= 70)
Based on the code, all of the logical expressions should go inside the filter. If the functions are getting masked by other the same function from another package, either do this on a fresh R session with only dplyr loaded or use the :: to specify the package source that holds the function (as showed in the comments)
library(dplyr)
gapminder %>%
dplyr::filter(continent == "Africa",
year == 2012,
fertility <=3,
life_expectancy >= 70)
-output
#country year infant_mortality life_expectancy fertility population gdp continent region
#1 Algeria 2012 22.4 76.2 2.82 37439427 NA Africa Northern Africa
#2 Cape Verde 2012 22.4 71.9 2.33 500870 NA Africa Western Africa
#3 Egypt 2012 22.6 70.5 2.81 85660902 NA Africa Northern Africa
#4 Libya 2012 12.9 75.5 2.41 6283403 NA Africa Northern Africa
#5 Mauritius 2012 12.8 74.1 1.50 1258335 NA Africa Eastern Africa
#6 Morocco 2012 26.4 74.1 2.71 32984190 NA Africa Northern Africa
#7 Seychelles 2012 12.2 73.7 2.21 94524 NA Africa Eastern Africa
#8 Tunisia 2012 13.6 77.4 2.02 10881450 NA Africa Northern Africa
select is used to select the columns and not the rows - filter or slice does that

How do I plot the data I have in a horizontal bar graph with descending values so that all the names of the states appear?

I would like to plot the following data:
Alabama Alaska Arizona
5471 1349 2328
Arkansas California Colorado
2842 16306 3201
Connecticut Delaware District of Columbia
3067 1685 3195
Florida Georgia Hawaii
15029 8925 289
Idaho Illinois Indiana
661 17556 5852
Iowa Kansas Kentucky
2517 2145 4157
Louisiana Maine Maryland
8103 907 5798
Massachusetts Michigan Minnesota
5981 6136 2408
Mississippi Missouri Montana
3599 6631 638
Nebraska Nevada New Hampshire
1651 1952 964
New Jersey New Mexico New York
5387 1645 9712
North Carolina North Dakota Ohio
8739 573 10244
Oklahoma Oregon Pennsylvania
3455 2286 8929
Rhode Island South Carolina South Dakota
895 6939 544
Tennessee Texas Utah
7626 13577 1072
Vermont Virginia Washington
472 5949 3434
West Virginia Wisconsin Wyoming
1575 4787 494
In a horizontal bar graph with descending values. I tried various plots, but the names of the states do not appear. Only some names are printed.
I have used the simple Plot function, but I am unable to figure out how to get the names of the states to appear.
Plotting the above data in a horizontal histogram
plot(table(dfnew$state), type = "h")
Only a few names of the states appear.
While I see that you tried to provide your data (Thank you), it is not in a format that I can use without typing it all in again. I don't want to do that, so I will use the built-in USArrests data instead.
You can get a horizontal bar graph using the barplot function. Trying to squeeze 50 states in there, you will need to adjust the margins and use small print, but it certainly can be done. You can use order to sort the entries.
data(USArrests)
par(mar=c(4,7,1,2))
barplot(USArrests$Murder[order(USArrests$Murder)],
names.arg=row.names(USArrests)[order(USArrests$Murder)],
las=2, cex.names=0.7, horiz=TRUE)
I think that what you need for your data is
par(mar=c(4,7,1,2))
TAB = table(dfnew$state)
barplot(sort(TAB), names.arg=names(TAB)[order(TAB)],
las=2, cex.names=0.7, horiz=TRUE)
but without your data, that is untested. BTW, you may also need to make your graphics window bigger than the default.
Start with arrange() from the dplyr package to get values in descending order:
data %>% arrange(desc(value))
Then use ggplot2's geom_bar along with coord_flip, which will give you the horizontal bars. Try something like this:
ggplot(data, aes(x=state, y=value)) +
geom_bar() +
coord_flip()

state.divsion index in R

I'm asked to use the state.x77 data set and find the minimum income for each division defined by state.division and then use the state.name to find the name of the state that is in New England that has the minimum income. I'm getting some weird answers. Does anyone know what I'm doing wrong?
x <- tapply(state.x77$Income, state.division, min)
x
New England Middle Atlantic South Atlantic East South Central
3694 4449 3617 3098
West South Central East North Central West North Central Mountain
3378 4458 4167 3601
Pacific
4660
x1 <- tapply(state.x77$Income, state.name[state.division], min)
x1
Alabama Alaska Arizona Arkansas California Colorado
3694 4449 3617 3098 3378 4458
Connecticut Delaware Florida
4167 3601 4660
I personally tend to go straight for dplyr, where you could use either
library(dplyr)
result <- state.x77 %>%
group_by(state.division) %>%
filter(Income == min(Income))
if you want to preserve all minimum value rows (as in, if there are two minimums) or
state.x77 %>%
group_by(state.division) %>%
slice(which.min(Income))
if you want only one minimum value row.
If you want to only use the base package, you could try using ave() with min:
state.x77[state.x77$Distance == ave(state.x77$Income, state.x77$state.division, FUN = min), ]

How to calculate mean of values from specified rows and order it in R?

I have a set of data like t(USArrests):
Alabama Alaska Arizona Arkansas California Colorado Connecticut
Murder 13.2 10.0 8.1 8.8 9.0 7.9 3.3
Assault 236.0 263.0 294.0 190.0 276.0 204.0 110.0
UrbanPop 58.0 48.0 80.0 50.0 91.0 78.0 77.0
Rape 21.2 44.5 31.0 19.5 40.6 38.7 11.1
I would like to calculate the mean of Murder and Assault only for each state and sort the states from high to low based on their mean values.
I am new to R and am lost on how to do this. Could someone help me? Thanks!
If you want the mean of Murder and Assault together (assuming this is the case since each state only have 1 obs for each), you could do:
sort(colMeans(df[c("Murder","Assault"),]), decreasing = T)
Or if your data is really untransposed use rowMeans instead:
sort(rowMeans(USArrests[,c("Murder","Assault")]), decreasing = T)
dplyr is good solution for this. There is no need to t() the data.
library(dplyr)
library(tibble)
USArrests %>%
rownames_to_column(var = "State") %>%
# perform operations by row
rowwise() %>%
# add a column with the mean
mutate(Mean = mean(c(Murder, Assault))) %>%
# should ungroup after using rowwise()
ungroup() %>%
# sort by Mean descending
arrange(desc(Mean))
Consider using the data.table package:
library(data.table)
DT <- data.table(cbind(USArrests), State = row.names(USArrests))
mean_stats <- DT[, list(mean_murder = mean(Murder),
mean_assault = mean(Assault)), by = State]
mean_stats[order(-mean_murder, -mean_assault)]
Here, I've ordered the results in decreasing order, first by mean murder rate and then by mean assault rate. However, as you can see, it is trivial to change that. Here is some sample output:
> head(mean_stats[order(-mean_murder, -mean_assault)])
State mean_murder mean_assault
1: Georgia 17.4 211
2: Mississippi 16.1 259
3: Florida 15.4 335
4: Louisiana 15.4 249
5: South Carolina 14.4 279
6: Alabama 13.2 236
If you are new to R, do yourself a favor and use the data.table package. Generally, it is fast for merging and aggregation and has a compact and understandable syntax.

subsetting by a variable name of a column r

row.names Hospital State Heart Attack Heart Failure
1 2275 PROVIDENCE MEMORIAL HOSPITAL TX 16.1 9.1
2 2276 MEMORIAL HERMANN BAPTIST ORANGE HOSPITALTX 16.3 14.3
4 2278 UNITED REGIONAL HEALTH CARE SYSTEM TX 17.4 15.1
5 2279 ST JOSEPH REGIONAL HEALTH CENTER TX 15.7 15.6
6 2280 PARKLAND HEALTH AND HOSPITAL SYSTEM TX 12.9 11.2
7 2281 UNIVERSITY OF TEXAS MEDICAL BRANCH GAL TX 17.4 11.8
Hello R peeps, I need to get a row.name where input, which is variable column name (Heart Attack or Heart Failure) is minimum for that column. In the exmple above, if I input "Heart failure" it needs to return [1] 2275Which row name in the first row. so far I got this:inds<-subset(wfperstate, wfperstate[[outname]]==min)where wfperstate is my data frame
outname is my inputPlease, help!
To transform my last comment into a function :
get_min_rowname <-
function(dat,col)
dat[which.min(dat[[col]]),"row.names"]
Then you apply it :
get_min_rowname(wfperstate, "Heart Attack")
get_min_rowname(wfperstate, "Heart Failure")

Resources