How do I plot the data I have in a horizontal bar graph with descending values so that all the names of the states appear? - r

I would like to plot the following data:
Alabama Alaska Arizona
5471 1349 2328
Arkansas California Colorado
2842 16306 3201
Connecticut Delaware District of Columbia
3067 1685 3195
Florida Georgia Hawaii
15029 8925 289
Idaho Illinois Indiana
661 17556 5852
Iowa Kansas Kentucky
2517 2145 4157
Louisiana Maine Maryland
8103 907 5798
Massachusetts Michigan Minnesota
5981 6136 2408
Mississippi Missouri Montana
3599 6631 638
Nebraska Nevada New Hampshire
1651 1952 964
New Jersey New Mexico New York
5387 1645 9712
North Carolina North Dakota Ohio
8739 573 10244
Oklahoma Oregon Pennsylvania
3455 2286 8929
Rhode Island South Carolina South Dakota
895 6939 544
Tennessee Texas Utah
7626 13577 1072
Vermont Virginia Washington
472 5949 3434
West Virginia Wisconsin Wyoming
1575 4787 494
In a horizontal bar graph with descending values. I tried various plots, but the names of the states do not appear. Only some names are printed.
I have used the simple Plot function, but I am unable to figure out how to get the names of the states to appear.
Plotting the above data in a horizontal histogram
plot(table(dfnew$state), type = "h")
Only a few names of the states appear.

While I see that you tried to provide your data (Thank you), it is not in a format that I can use without typing it all in again. I don't want to do that, so I will use the built-in USArrests data instead.
You can get a horizontal bar graph using the barplot function. Trying to squeeze 50 states in there, you will need to adjust the margins and use small print, but it certainly can be done. You can use order to sort the entries.
data(USArrests)
par(mar=c(4,7,1,2))
barplot(USArrests$Murder[order(USArrests$Murder)],
names.arg=row.names(USArrests)[order(USArrests$Murder)],
las=2, cex.names=0.7, horiz=TRUE)
I think that what you need for your data is
par(mar=c(4,7,1,2))
TAB = table(dfnew$state)
barplot(sort(TAB), names.arg=names(TAB)[order(TAB)],
las=2, cex.names=0.7, horiz=TRUE)
but without your data, that is untested. BTW, you may also need to make your graphics window bigger than the default.

Start with arrange() from the dplyr package to get values in descending order:
data %>% arrange(desc(value))
Then use ggplot2's geom_bar along with coord_flip, which will give you the horizontal bars. Try something like this:
ggplot(data, aes(x=state, y=value)) +
geom_bar() +
coord_flip()

Related

R Find unique value and find record like %_%

I have a data table of 10,000 records having multiple columns. Below is the code and part of the data set
states <- str_trim(unlist(strsplit(as.vector(search_data_set$location_name), ";"))
Part of Dataset:
Maine Virginia;
Oklahoma;
Kansas Minnesota South Dakota;
Delaware;
West Virginia;
Utah South Carolina;
Utah South Dakota Utah;
Indiana; Michigan Alaska Washington;
Washington Connecticut Maine;
Maine Oregon South Carolina Oregon;
Alabama Alaska;
Iowa Alabama New Mexico;
Virgin Islands South Dakota;
Maine Louisiana; Colorado;
District of Columbia Virgin Islands;
Pennsylvania Alabama;
I need to fulfill the below requirement and need help here:
Each record should take a unique value of location. (In Utah South Dakota Utah; , Utah should be counted as Unique)
When the user searches the dataset it should bring the record, if the location is anywhere. (%Oregon%) The current code is not bringing the record "Maine Oregon South Carolina Oregon;" when the user searches for "Oregon"
Need help in achieving this. Thanks in advance!

Rvest read table with cells that span multiple rows

I'm trying to scrape an irregular table from Wikipedia using rvest. The table has cells that span multiple rows. The documentation for html_table clearly states that this is a limitation. I'm just wondering if there's a workaround.
The table looks like this:
My code:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
html_table(fill=TRUE) %>% # fill=FALSE yields the same results
.[[1]]
Returns this:
Where there are several errors, for example: row 4 under "City" should be "Mesa", NOT "Chicago Cubs". I'd be happy with blank cells as I could "fill down" as needed, but the wrong data is a problem. Help is much appreciated.
I have a way to code it.
It is not perfect, a bit long but it does the trick:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
# get the lines of the table
lines <- url %>%
read_html() %>%
html_nodes(xpath="//table[starts-with(#class, 'wikitable')]") %>%
html_nodes(xpath = 'tbody/tr')
#define the empty table
ncol <- lines %>%
.[[1]] %>%
html_children()%>%
length()
nrow <- length(lines)
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
# fill the table
for(i in 1:nrow){
# get content of the line
linecontent <- lines[[i]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
# attribute the content to free columns
colselect <- is.na(table[i,])
table[i,colselect] <- linecontent
# get the line repetition of each columns
repetition <- lines[[i]]%>%
html_children()%>%
html_attr("rowspan")%>%
ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
as.numeric
# repeat the cells of the multiple rows down
for(j in 1:length(repetition)){
span <- repetition[j]
if(span > 1){
table[(i+1):(i+span-1),colselect][,j] <- rep(linecontent[j],span-1)
}
}
}
The idea is to have the html lines of the table in the lines variable by getting the /tr nodes. I then create an empty table: number of columns is the length of the children of the first row (because it contains the titles), number of line the length of lines. I fill it by hand in a for loop (didn't amanger a nicer way here).
The difficulty is that the amount of column text given in a row changes when there is already a multiple row column spanning on the current row. For example :
lines[[3]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
gives only 5 values :
[1] "Arizona League Athletics Gold" "Oakland Athletics" "Mesa" "Fitch Park"
[5] "10,000"
instead of the 6 columns, because the first column is East on 8 rows. This East value appears only on the first rows it spans on.
The trick is to repeat the cells down in the table when they have a rowspan attribute (meaning they span on several rows). It allows to select on the next row only the NA columns, so that the amount of text given by the html line match the amount of free columns in the table we fill.
This is done with the colselect variable, which is a bolean giving the free rows before repeting the cells of the given row.
The result :
V1 V2 V3 V4 V5 V6
1 Division Team MLB Affiliation City Stadium Capacity
2 East Arizona League Angels Los Angeles Angels Tempe Tempe Diablo Stadium 9,785
3 East Arizona League Athletics Gold Oakland Athletics Mesa Fitch Park 10,000
4 East Arizona League Athletics Green Oakland Athletics Mesa Fitch Park 10,000
5 East Arizona League Cubs 1 Chicago Cubs Mesa Sloan Park 15,000
6 East Arizona League Cubs 2 Chicago Cubs Mesa Sloan Park 15,000
7 East Arizona League Diamondbacks Arizona Diamondbacks Scottsdale Salt River Fields at Talking Stick 11,000
8 East Arizona League Giants Black San Francisco Giants Scottsdale Scottsdale Stadium 12,000
9 East Arizona League Giants Orange San Francisco Giants Scottsdale Scottsdale Stadium 12,000
10 Central Arizona League Brewers Gold Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
11 Central Arizona League Dodgers Lasorda Los Angeles Dodgers Phoenix Camelback Ranch 12,000
12 Central Arizona League Indians Blue Cleveland Indians Goodyear Goodyear Ballpark 10,000
13 Central Arizona League Padres 2 San Diego Padres Peoria Peoria Sports Complex 12,882
14 Central Arizona League Reds Cincinnati Reds Goodyear Goodyear Ballpark 10,000
15 Central Arizona League White Sox Chicago White Sox Phoenix Camelback Ranch 12,000
16 West Arizona League Brewers Blue Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
17 West Arizona League Dodgers Mota Los Angeles Dodgers Phoenix Camelback Ranch 12,000
18 West Arizona League Indians Red Cleveland Indians Goodyear Goodyear Ballpark 10,000
19 West Arizona League Mariners Seattle Mariners Peoria Peoria Sports Complex 12,882
20 West Arizona League Padres 1 San Diego Padres Peoria Peoria Sports Complex 12,882
21 West Arizona League Rangers Texas Rangers Surprise Surprise Stadium 10,500
22 West Arizona League Royals Kansas City Royals Surprise Surprise Stadium 10,500
Edit
I made a shorter version of the function, with more explanation here

state.divsion index in R

I'm asked to use the state.x77 data set and find the minimum income for each division defined by state.division and then use the state.name to find the name of the state that is in New England that has the minimum income. I'm getting some weird answers. Does anyone know what I'm doing wrong?
x <- tapply(state.x77$Income, state.division, min)
x
New England Middle Atlantic South Atlantic East South Central
3694 4449 3617 3098
West South Central East North Central West North Central Mountain
3378 4458 4167 3601
Pacific
4660
x1 <- tapply(state.x77$Income, state.name[state.division], min)
x1
Alabama Alaska Arizona Arkansas California Colorado
3694 4449 3617 3098 3378 4458
Connecticut Delaware Florida
4167 3601 4660
I personally tend to go straight for dplyr, where you could use either
library(dplyr)
result <- state.x77 %>%
group_by(state.division) %>%
filter(Income == min(Income))
if you want to preserve all minimum value rows (as in, if there are two minimums) or
state.x77 %>%
group_by(state.division) %>%
slice(which.min(Income))
if you want only one minimum value row.
If you want to only use the base package, you could try using ave() with min:
state.x77[state.x77$Distance == ave(state.x77$Income, state.x77$state.division, FUN = min), ]

R append function

I'm writing an R script that parses out the a state abbreviation from a column in a data.frame. It then uses the which() function to determine the index of the found state abbreviation in a look up data frame that contains state abbreviations and their corresponding full state names. I then use the found index to access the the full state name and append it to a vector called completeList. I then add the vector completeList which should contain the full state names to my original data frame under a newly created column STATE_NAME.
However, for some reason completeList only contains the indexes that were found earlier and not the full state names that I expected. What did I do wrong?
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, addCompleteStateName)
}
file["STATE_NAME"]<-completeList
>completeList
[1] 27 17 17 29 42 50 20 53 45 19 22 52 9 29 26 37 8 58 35
Here is the csv file where the abbreviation of the station is found
STATION STATION_NAME ELEVATION
GHCND:USC00202381 EAST JORDAN MI US 180.1
GHCND:USC00111290 CARLYLE RESERVOIR IL US 153
GHCND:USC00116661 PAW PAW 2 S IL US 274.9
GHCND:USC00228556 SUMRALL MS US 88.1
GHCND:USC00340292 ARDMORE OK US 267.9
GHCND:USC00408522 SPARTA WASTEWATER PLANT TN US 289.9
GHCND:USC00148341 VALLEY FALLS KS US 283.5
GHCND:USW00014742 BURLINGTON INTERNATIONAL AIRPORT VT US 101.2
GHCND:USC00367782 SALINA 3 W PA US 338
GHCND:USC00134142 IOWA FALLS IA US 356.9
GHCND:USC00161565 CARVILLE 2 SW LA US 9.1
GHCND:USC00421446 CITY CRK WATER PLANT UT US 1628.9
GHCND:USW00013781 WILMINGTON NEW CASTLE CO AIRPORT DE US 22.6
GHCND:USC00229400 WATER VALLEY MS US 116.1
GHCND:USC00190562 BELCHERTOWN MA US 171
GHCND:USW00094728 NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US 40.2
GHCND:USC00060973 BURLINGTON CT US 155.4
GHCND:USC00475516 MINOCQUA WI US 484.9
GHCND:USC00286055 NEW BRUNSWICK 3 SE NJ US 38.1
Here is the csv file where we look up abbreviations and find the corresponding full state name
State/Possession Abbreviation
Alabama AL
Alaska AK
American Samoa AS
Arizona AZ
Arkansas AR
California CA
Colorado CO
Connecticut CT
Delaware DE
District of Columbia DC
Federated States of Micronesia FM
Florida FL
Georgia GA
Guam GU
Hawaii HI
Idaho ID
Illinois IL
Indiana IN
Iowa IA
Kansas KS
Kentucky KY
Louisiana LA
Maine ME
Marshall Islands MH
Maryland MD
Massachusetts MA
Michigan MI
Minnesota MN
Mississippi MS
Missouri MO
Montana MT
Nebraska NE
Nevada NV
New Hampshire NH
New Jersey NJ
New Mexico NM
New York NY
North Carolina NC
North Dakota ND
Northern Mariana Islands MP
Ohio OH
Oklahoma OK
Oregon OR
Palau PW
Pennsylvania PA
Puerto Rico PR
Rhode Island RI
South Carolina SC
South Dakota SD
Tennessee TN
Texas TX
Utah UT
Vermont VT
Virgin Islands VI
Virginia VA
Washington WA
West Virginia WV
Wisconsin WI
Wyoming WY
Why am I not getting the full state name?
figured it out 😎
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, toString(addCompleteStateName))
}
file["STATE_NAME"]<-completeList
the type was being forced to an integer
The variable addCompleteStateName is a factor. You can convert it to a character to append the labels.
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
# modified to convert addCompleteStateName to character
completeList<-append(completeList, as.character(addCompleteStateName))
}
file["STATE_NAME"]<-completeList

how to load dataset package

I downloaded the dataset package but not sure how to load it. I know how to read csv files but not sure how to read the data.
http://www.inside-r.org/r-doc/datasets/state.division
I have to use state.division.
Thanks
Welcome to StackOverflow and R. First I would start with:
> library(help = "datasets")
This tells you a little about the available datasets in this package.
This package is part of the base R installation, and you don't need to load it. If you're curious where these datasets are stored on your machine, you can enter:
> system.file("data",package = "datasets")
For more info on the state datasets, you can enter: ?state
This tells you that state.division is one of the datasets available in this package.
> str(state.division)
However, it won't make a lot of sense without some additional context, so try something like:
> head(df <- data.frame(state.abb, state.division, state.x77))
state.abb state.division Population Income Illiteracy Life.Exp Murder HS.Grad
Alabama AL East South Central 3615 3624 2.1 69.05 15.1 41.3
Alaska AK Pacific 365 6315 1.5 69.31 11.3 66.7
Arizona AZ Mountain 2212 4530 1.8 70.55 7.8 58.1
Arkansas AR West South Central 2110 3378 1.9 70.66 10.1 39.9
California CA Pacific 21198 5114 1.1 71.71 10.3 62.6
Colorado CO Mountain 2541 4884 0.7 72.06 6.8 63.9
Frost Area
Alabama 20 50708
Alaska 152 566432
Arizona 15 113417
Arkansas 65 51945
California 20 156361
Colorado 166 103766
With a data.frame you should have the context you need to start make interesting plots or models, for example a linear regression model:
summary(lm(Murder ~ state.division + Illiteracy, data=df, weights=Population))

Resources